Unable to reproduce the result as paper reported

Hello, thank you for your inspiring work! I tried to re-run the code on all of the datasets, but the results were not as promising as those repored by paper.

For CUB and DeepFashion, I did not modify any of the codes, but the metrics on CUB were poor.

And for Pascal-Part, I modified the training hyper-params according to your supp file. And also for the fairness of evaluation, I trained a foreground segmentator (a DeepLabV2-ResNet50-2branch) and used the predicted mask at evaluation. I conducted training upon Car, Cat and Horse. The results on Horse were OK, but those on Cat and Car were really not good.

Could you please provide analysis upon why the re-run results on CUB were not good? Also, for Pascal-Part, are there any training details that is left out in papers, so that I did not reproduce the results? (BTW, could you provide supervised mask of PascalPart?)

	FG-NMI1	FG-ARI1	Full-NMI	Full-ARI
paper	46.0	21.0	43.5	19.6
loading provided weights	46.03	21.03	43.52	19.58
own training run (batch size 3 instead of 6)	44.10	19.90	41.28	18.70
this github issue	39.62	17.72	39.12	18.11

subhc / unsup-parts

Unable to reproduce the result as paper reported #8