Closed muchengxue0911 closed 8 months ago
Thanks for your interest!
The result is strange and it shouldn't be like this given your command. Could you provide more information, like, which CLIP model did you use, or did you perform some modifications on the code?
I downloaded the CLIP model from your link, which is: https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt , and I only changed the VOC path and didn't modify the code.
I use the same env with you and re-run the code. The result is normal for me.
1464 images to eval 1 0.7084182165521185 2 0.6600291742360047 {'Pixel Accuracy': 0.9101851882152913, 'Mean Accuracy': 0.8892877034348168, 'Frequency Weighted IoU': 0.8450196609472428, 'Mean IoU': 0.7084182165521185, 'Class IoU': {0: 0.8902073794443365, 1: 0.65836592550838, 2: 0.486400383690162, 3: 0.7686163773404336, 4: 0.5633237505301961, 5: 0.5757771927686446, 6: 0.8440604824288674, 7: 0.75357773540567, 8: 0.8705375719476196, 9: 0.4874239107791828, 10: 0.8117929379499486, 11: 0.6299910314353231, 12: 0.868078446176208, 13: 0.8069361322790224, 14: 0.7739665005600384, 15: 0.633124464824793, 16: 0.5484016946872441, 17: 0.8589386277035465, 18: 0.7053738870822491, 19: 0.7689125257829013, 20: 0.5729755892697185}}
I provide the generated cams in Google Drive. You can eval_cam based on it to check whether this problem is caused by cam generation or evaluation.
Thanks for your help. I have solved the problem.
Thank you for your solid work. I'm reproducing your results, however, I can only get a result of 65.3 in CAM. I use an env of pytorch 1.9.0 and cuda 11.1 on a 3090. I change the file path and use the following command.
1 CUDA_VISIBLE_DEVICES=0 python generate_cams_voc12.py --split_file ./voc12/train_aug.txt --num_workers 1 --cam_out_dir ./output/voc12/cams 2 python eval_cam.py --cam_out_dir ./output/voc12/cams --cam_type attn_highres --split_file ./voc12/train.txt
and the result is : 1464 images to eval 1 0.6530047267030804 2 0.6320998620387147 {'Pixel Accuracy': 0.8778208683287688, 'Mean Accuracy': 0.8145488664308644, 'Frequency Weighted IoU': 0.7928293309400717, 'Mean IoU': 0.6530047267030804, 'Class IoU': {0: 0.8535294292729055, 1: 0.6130334632912042, 2: 0.6508222055180352, 3: 0.7070871766635738, 4: 0.5137165538355816, 5: 0.5389611068698299, 6: 0.7805507459439889, 7: 0.6934887957581246, 8: 0.7957022407450646, 9: 0.3969126198448379, 10: 0.763830780381508, 11: 0.4783550023269957, 12: 0.8016436417834419, 13: 0.7574893461824923, 14: 0.7504945026731186, 15: 0.5885225316760763, 16: 0.520919667740022, 17: 0.7986631191640343, 18: 0.5163718776232233, 19: 0.6667840730078058, 20: 0.526220380462825}}
Are there any advice to reproduce your result?