Closed jinhuan-hit closed 3 years ago
@jinhuan-hit Your baseline performance seems normal (within std). Can you provide more info here, e.g. what was the shell command you used for the 5 iterations and from which iteration does the performance starts to degrade?
The pseudo label looks bad and performance degradation is probably connected.
Thanks for your quick reply! I use the shell dmt-city-8-1.sh. To my surprise, after the first iter, the metric degrade to 19.64(imagenet). The model init from coco is still 59 or a little lower.
@jinhuan-hit It could have been a bad code merge, I'll rerun that shell tonight and get back to you tomorrow.
p.s. Do you observe similar issues in PASCAL VOC experiments?
Thanks for your reproduction!I'm looking forward to your results. I'm sorry about that I don't have the PASCAL VOC datasets. The PASCAL VOC experiments havn't been done. Actually I‘m more interested in the results on cityscapes. And could you please share the results after each iter?
@jinhuan-hit I‘ve re-clone the project and re-run dmt-city-8-1.sh
. I find that the hyper-parameters were set wrong in that shell script and have now rectified it to the hyper-parameters in the paper.
However, even with the incorrect hyper-parameters, I still get these results:
dmt-city-8-1__p0--c: 59.43004488945007
dmt-city-8-1__p0--i: 60.73184013366699
dmt-city-8-1__p1--c: 60.44157147407532
dmt-city-8-1__p1--i: 56.25389814376831
dmt-city-8-1__p2--c: 61.65197491645813
dmt-city-8-1__p2--i: 59.52984690666199
dmt-city-8-1__p3--c: 62.01944947242737
dmt-city-8-1__p3--i: 60.99923849105835
dmt-city-8-1__p4--c: 62.59671449661255
dmt-city-8-1__p4--i: 61.493390798568726
dmt-city-8-1__p5--c: 62.48372197151184
dmt-city-8-1__p5--i: 61.651426553726196
Could you check your tensorboard log and see if there are abnormalities like NaN loss values?
With the correct script, you should expect something a bit higher than 63, since I got 63.3628 on this split back when the paper was written.
Another thought, could you show your pip list
result and check your data_lists
, are they similar to these I used.
Thanks for your kindness and reproduction! The hyper-parameters maybe influence the mertic just a little. I think it's not the main problem. I use pytorch1.1, maybe the torch version problem? I'm sorry about that I can't get access to the txt file. My 8_labeled_1.txt is liked this and contains 371 images. The 8_unlabeled_1.txt contains 2604 images. I just generate it using the shell generate_splits.py.
cologne/cologne_000021_000019 strasbourg/strasbourg_000001_036232 hanover/hanover_000000_053437 hanover/hanover_000000_000164 dusseldorf/dusseldorf_000136_000019 dusseldorf/dusseldorf_000066_000019 aachen/aachen_000008_000019 strasbourg/strasbourg_000001_007524 strasbourg/strasbourg_000000_029179 bremen/bremen_000013_000019 monchengladbach/monchengladbach_000000_001294 dusseldorf/dusseldorf_000059_000019 tubingen/tubingen_000098_000019 strasbourg/strasbourg_000000_025089 ulm/ulm_000037_000019 zurich/zurich_000015_000019 tubingen/tubingen_000014_000019 tubingen/tubingen_000126_000019 darmstadt/darmstadt_000029_000019 hanover/hanover_000000_048274
@jinhuan-hit Your dataset seems all right. The torch version could be a issue. I use torch 1.2.0 torchvision 0.4.0 to be compatible with apex. But your imagenet init result does seem very weird. Could you maybe use the exact torch and apex versions and re-clone the code to run the latest shell script again? The apex codes are provided in README.md.
@voldemortX Ok, I will use torch 1.2.0 torchvision 0.4.0 to run it again. Any new result I will update here. Thanks for your guidance. And could you please share your pseudo label on pahse0(p0), the 59 mIOU model. I visualize the pseudo label, but it is not particularly good. Is it normal? Here is the visualization of aachen_000021_000019_gtFine_labelIds.npy.
This is the loss curve of imagenet init on pahse1. I don't know why the loss increases so much after a few steps. Maybe that's why the result is not good.
You're welcome. FYI, apex
includes loss scaling, so it may be crucial to prevent gradient explosions, which possibly is the cause that led to your 10-20 miou results. If you can't download google drive files, please let me know and I'll get them to you via other means.
And could you please share your pseudo label in stage0(p0), the 59 mIOU model. I visualize the pseudo label, but it is not particularly good. Is it normal? Here is the visualization of aachen_000021_000019_gtFine_labelIds.npy.
The first iteration only labels 20% pixels across the entire dataset, so it could look rather weird. Do you need the npy file or visualized result?
This is the loss curve of imagenet init on pahse1. I don't know why the loss increases so much after a few steps. Maybe that's why the result is not good.
That explains it! I think your gradient exploded in training. I just ran the corrected shell, and the loss curve is like this (~0.1):
I think the most probable issue is you installed a faulty apex version.
@jinhuan-hit In summary, you should start a new python 3 virtual environment (virtualenv/conda) and install pytorch&torchvision like in README.md. Only change git clone ...apex...
to using the one I provided from google drive.
@voldemortX Thanks for your kindness. I'll reply to your suggestions one by one. 1.My email is jinhuan_hit@163.com. We can communicate using it. 2.Either the npy file or visualized result is fine, and I just want to verify the phase0 is ok. 3.Yeah, the gradient overflow in my training procedure(imagenet init phase0), like this. 4.Maybe that's why the result is not good. However, I asked my friend to download the apex package before. Maybe it's not compatible with torch1.1 and torchvision0.3.0? In my opinion, the apex package is used for accelerated training, and is it ok for me to set mixed-precision as False? I want to try torch1.2 and torchvision0.4.0. Unfortunately, the NVIDIA driver and cuda driver is not compatible with it. Maybe that's not easy to fix, because other people are using the machine together.
@jinhuan-hit Yes the apex installation needs exactly torch 1.2.0 & torchvision 0.4.0. Are you using CUDA 9? I'll send you the npy file through email as soon as my current running scripts finish. The apex project from nvidia have various issues and the new pytorch 1.6 have native amp that is much more stable. But for reproduction, I'd like to keep the original environment setting here (nvidia's apex). It should be okay to set mixed-precision as False (if your card has enough memory, I myself did not try that), although I do believe gradient clipping is adopted in apex somehow. If you wish to do that, you need to re-train the baselines using full precision as well. The nvidia apex does not support loading weights trained in mixed precision to full precision training.
EDIT: Also, pytorch >= 1.0 mostly satisfies BC (backward compatible), that means a pytorch 1.1 program can run in pytorch 1.2. But I can't guarantee the opposite. I'm not very familiar with pytorch's development before 1.5.
@voldemortX Yes, I'm using CUDA 9 and this is not compatible with torch1.2. I set mixed precision as False and load weights seems ok? The result, 44.38, is better than before.
The 2nd epoch is only 8.76.
In summary, if I want to get a better result, maybe I should re-train the baselines using full precision. Or I need to try it on cloud with torch1.2 and torchvision0.4.0.
In summary, if I want to get a better result, maybe I should re-train the baselines using full precision. Or I need to try it on cloud with torch1.2 and torchvision0.4.0.
Yes, gradients usually explode when you load a model trained with mixed precision and train without amp.
感谢你的回复,我已经收到你的邮件了!我对比了一下20%像素的图片,跟我复现baseline的结果还是比较接近的,应该就是后面在伪标签训练时出了问题,有可能是apex版本与torch不匹配导致的。我将从全精度和混合精度两个方面来验证一下,另外我还有个疑问是如果是混合精度出现了问题,为什么训练baseline的时候没有出现呢,比较奇怪这里。
@jinhuan-hit apex尤其老版本确实有不少问题,因为一直是dev版本,翻翻它那个repo你可能会发现不少类似的接着训练梯度爆炸的issue。。。本来想升级到torch 1.6的native amp的,但是一个是怕影响可复现性,另一个是手头资源都投入到另一个repo了,就一直没搞。 问题主要是集中在他那个amp部分的load上,训练baseline问题不大可能是因为不涉及这个的load。
@voldemortX 可以了解,资源比较紧张。 明白你的意思了,就是后面加载那块可能出现了问题,导致全精度训练没有重头训出现了问题,另一个可能就像你说的由于apex不稳定,可能与torch1.1不太兼容导致训练出了问题。 我现在正在进行全精度和混合精度的训练,有任何更新我都会及时同步在这里。
@voldemortX 感谢你指出问题所在,确实是apex跟torch包的匹配问题,我使用你提供的apex包+torch1.2+torchvision0.4用混合精度训练目前来看都比较正常,这里是前两次迭代的结果。全精度的实验结果还没跑出来,有更新的话我也贴一下。
这个结果看起来相当正常,应该是没有问题。感谢你的细致调研!
To summarize for non-Chinese speakers: in this issue we found that torch1.1 + apex leads to gradient explosion after loading (which happens after baseline training) and cannot be used for this project yet (if mixed-precision training is used).
感谢你的细心解答。后面这个总结简直太贴心了!
全精度的实验结果如下,精度指标略低于混合精度训练,不过也达到了论文中的水平,非常感谢作者的分享!
全精度的实验结果如下,精度指标略低于混合精度训练,不过也达到了论文中的水平,非常感谢作者的分享!
不用谢!有的模型确实全精度和混合精度会有微小差异,但整体来说组间方差较小。原文混合精度在cityscapes这个实验的标准差是0.62.
Since the question was resolved a long time ago, this issue is now closed. Feel free to reopen if there are further questions.
Thanks for sharing a good job! I have a question. When I train cityscapes using 1/8 labeled data, two models(init from coco and imagenet) can reach nearly 59 mIOU in val set, close to 59.65 presented in the paper. However, after 5 iterations, the metric descends to 53(coco) and 22(imagenet). I check the pseudo label using the model of 59 mIOU and it is not particularly good. I don't know if that affected the results.