Unstable evaluation results on NYUv2

Manchery commented 4 years ago

Hello, sorry to bother you. I ran the same code several times but got unstable evaluate results. Here is the numbers:

	MEAN_IOU	PIX_ACC	ABS_ERR	REL_ERR	MEAN	MED	<11.25	<22.5	<30
semantic (2 times)	0.1693	0.548
	0.178	0.5685
depth (2 times)			0.6336	0.2739
			0.6398	0.2585
normal (3 times)					31.8452	26.0211	0.2161	0.4419	0.5633
					30.3995	23.9483	0.2406	0.4779	0.5975
					29.788	23.3448	0.2453	0.4861	0.6058
split (2 times)	0.1858	0.5571	0.6323	0.288	32.4637	27.4159	0.2028	0.4184	0.5412
	0.164	0.5255	0.6479	0.3159	33.8992	28.8828	0.1806	0.395	0.5174

Details:

results above are the average of last 10 epoches.
I used most of default options, which means I ran:

python model_segnet_single.py --task [task] --dataroot nyuv2

and

python model_segnet_split.py --dataroot nyuv2

I made some modification on the code. There are mainly two:
- move scheduler.step() according to README
- change batch size from 2 to 8

UPDATE

Here is the results of the code without any modification (except the location of scheduler.step()) to make my claim more convincing:

	ABS_ERR	REL_ERR
depth (2 times)	0.6615	0.2859
	0.7053	0.2996

lorenmt commented 4 years ago

Hi,

Thanks for your experiments and detailed analysis.

Yes, NYUv2 is a quite complicated dataset with a small number of samples, and that is one of the major reasons to cause such oscillation in the final result. I assume you could observe a much smaller oscillation for the final performance in CityScpaes dataset, since it's easier and contains more samples.

What I can suggest is:

Try to use the benchmark technique I suggested in the readme (relative improved performance over single task learning, (best each task performance in multi-task learning across all validation / single task validation performance)). It should reduce such uncertainty.
Simply run 3 or more times and report the averaged performance.

Hope that helps. Sk.

Manchery commented 4 years ago

Hi,

Thanks for you suggestion. I agree with that this is caused by a small number of samples and I will simply run several times to report average performance.

But for the recommended benchmark, relative improved performance, do you mean $max_{epoch} average_{task}(relative improvement)$ , or $average_{task} max_{epoch} (relative improvement)$ ? I think it should be the former one. The last one are less vulnerable to uncertainty but I think it voilates MTL because one purpose of MTL is to reduce inference time so we should use one model to predict all.

lorenmt commented 4 years ago

Yes, you should apply the first version, otherwise, it would break the advantage of doing multi-task learning. For a more detailed explanation for such method, I would suggest to take a look at the paper I linked in the README.

Manchery commented 4 years ago

I think I made a mathematical mistake. The first one are less vulnerable to uncertainty. Forget what I said.

Thank you for your patient replies.

Best wishes.

lorenmt / mtan

Unstable evaluation results on NYUv2 #29