jchenghu / ExpansionNet_v2

Implementation code of the work "Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning"
https://arxiv.org/abs/2208.06551
MIT License
84 stars 24 forks source link

hi, i meet some trouble when i want to use ensemble test , Please help me!! thanks! #10

Closed PanYuQi66666666 closed 6 months ago

PanYuQi66666666 commented 7 months ago

Ensembling Evaluation Detected checkpoints: ['/mnt/workspace/ExpansionNet_v2/github_ignore_material/saves/first_base/phase6_checkpoint.pth', '/mnt/workspace/ExpansionNet_v2/github_ignore_material/saves/first_base/phase3_checkpoint.pth', '/mnt/workspace/ExpansionNet_v2/github_ignore_material/saves/first_base/phase5_checkpoint.pth', '/mnt/workspace/ExpansionNet_v2/github_ignore_material/saves/first_base/phase2_checkpoint.pth'] Traceback (most recent call last): File "/mnt/workspace/ExpansionNet_v2/test.py", line 455, in spawn_train_processes(is_end_to_end=args.is_end_to_end, File "/mnt/workspace/ExpansionNet_v2/test.py", line 379, in spawn_train_processes mp.spawn(test, File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 246, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method="spawn") File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 202, in start_processes while not context.join(): File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 163, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap fn(i, *args) File "/mnt/workspace/ExpansionNet_v2/test.py", line 319, in test ddp_model = get_ensemble_model(model, checkpoints_list, rank=rank) File "/mnt/workspace/ExpansionNet_v2/test.py", line 235, in get_ensemble_model model.load_state_dict(checkpoint['model_state_dict']) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for End_ExpansionNet_v2: Missing key(s) in state_dict: "swin_transf.patch_embed.proj.weight", "swin_transf.patch_embed.proj.bias", "swin_transf.patch_embed.norm.weight.....................", " Please help me!! thanks!

jchenghu commented 7 months ago

Hi Pan,

can you share with me, if the problem persists, the command that generated this output? I think it is caused by a wrong specification of whether models in the ensembles are end-to-end or not.

Also, I see that there are different type of models in your checkpoints folder, for instance .../phase6_checkpoint.pth', -> this is end-to-end .../phase3_checkpoint.pth', -> this is end-to-end .../phase5_checkpoint.pth', -> this is not-end-to-end ..../phase2_checkpoint.pth'] -> this is not end-to-end (edit: before the edit, I got confused for a moment about phase2, and 3 because phases are named after the steps in the readme files, as a result, there are no phase1 or phase 4, my bad)

Try for example removing phase2 and phase5, or phase3 and phase 6

The code, as for now, currently supports the ensembling of homogeneous types of architectures, all of them should be either end-to-end (backbone + refining model) or not end-to-end (refining model only, w/o the backbone). In the first case, you should put the is_end_to_end argument to True, and False in the latter.

Let me know if it helps!

PanYuQi66666666 commented 6 months ago

@jchenghu oh, i have settle the question by your help, thanks!!

jchenghu commented 6 months ago

I'm glad I helped, you're welcome!

As usual, feel free to open a new issue in case of other problems/questions.