jchenghu / ExpansionNet_v2

Implementation code of the work "Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning"
https://arxiv.org/abs/2208.06551
MIT License
84 stars 24 forks source link

Test inconsistency between "end_to_end" vs. "extract features then evaluate" #4

Closed alexrame closed 1 year ago

alexrame commented 1 year ago

Hello, thanks for the great paper and repo. I've been struggling with a potential bug when evaluating the performances.

More specifically, with the weights rf_model.pth downloaded from your google drive, I either

  1. First case: launched python test.py --is_end_to_end True --save_model_path rf_model.pth and obtained [('CIDEr', 1.3965), ('Bleu_1', 0.8321), ('Bleu_2', 0.6864), ('Bleu_3', 0.5408), ('Bleu_4', 0.4168), ('ROUGE_L', 0.6059), ('SPICE', 0.2428), ('METEOR', 0.3036)]

  2. Second case: extracted the features with python data_generator.py --save_path rf_model.pth --output_path features_rf.hdf5 and then launched python test.py --is_end_to_end False --save_model_path rf_model.pth --features_path features_rf.hdf5 and obtained [('CIDEr', 1.3706), ('Bleu_1', 0.8246), ('Bleu_2', 0.6769), ('Bleu_3', 0.531), ('Bleu_4', 0.4079), ('ROUGE_L', 0.6004), ('SPICE', 0.2395), ('METEOR', 0.2993)]

How can we fix this inconsistency? Best, Alexandre

Remark: To make the second case possible, I only modified this line https://github.com/jchenghu/ExpansionNet_v2/blob/6ea9ce8330881493ca8e630dc2125c2248377723/test.py#LL299C62-L299C62: model.load_state_dict(checkpoint['model_state_dict'], strict=is_end_to_end)

jchenghu commented 1 year ago

Hi @alexrame

Thank you for the detailed report. I was able to reproduce the error and investigated on the issue.

You are right, there was indeed a bug caused by the function load_backbone_only_from_save in data_generator.py, the instruction

name = name.lstrip("swin_transf.")

worked in basically all cases but badly processed one single norm layer's identifier in the state dict. In particular "swin_transf.norm.weight / bias" was renamed into "orm.weight / bias" instead of "norm.weight / bias" thus it wasn't loaded in the features generation.

I fixed it in the latest commit (*,**) replacing the line with

name = name[len(prefix):]

I thank you again for the report since the bug was quite difficult to find. It may have penalized the results I reported on the paper but future trainings (and users) are surely going to benefit from this discovery.

(*) I also included your "static=is_end_to_end" option (**) Trivia: fixed little args naming inconsistencies

Best regards, Jia

alexrame commented 1 year ago

Thanks for the quick reply. I can confirm that your commit fixes the issue. Best, Alex