The issue of the train phase in the Parallel

leefly072 commented 3 years ago

Dear author : this is a very interesting paper, and thank you very much for you share the code. But there is some problem when I try to run the model. when I training the data in the four GPU ,there has some error in the follow. if you can give me a hand, I will very appreciate for you kindness. Thank you very much again.

Traceback (most recent call last): File "/data/lifei/TTSR-master/main.py", line 51, in t.train(current_epoch=epoch, is_init=False) File "/data/lifei/TTSR-master/trainer.py", line 97, in train sr_lv1, sr_lv2, sr_lv3 = self.model(sr=sr).cuda() File "/home/lifei/.conda/envs/TTSR/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, kwargs) File "/home/lifei/.conda/envs/TTSR/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/lifei/.conda/envs/TTSR/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/lifei/.conda/envs/TTSR/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/home/lifei/.conda/envs/TTSR/lib/python3.8/site-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/home/lifei/.conda/envs/TTSR/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, *kwargs) File "/home/lifei/.conda/envs/TTSR/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(input, kwargs) File "/data/lifei/TTSR-master/model/TTSR.py", line 22, in forward self.LTE_copy.load_state_dict(self.LTE.state_dict()).cuda() File "/home/lifei/.conda/envs/TTSR/lib/python3.8/site-packages/torch/nn/modules/module.py", line 846, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for LTE: Missing key(s) in state_dict: "slice1.0.weight", "slice1.0.bias", "slice2.2.weight", "slice2.2.bias", "slice2.5.weight", "slice2.5.bias", "slice3.7.weight", "slice3.7.bias", "slice3.10.weight", "slice3.10.bias".

23vil commented 3 years ago

I have the same problem. Did you solve this issue? I printed the state_dicts of LTE and LTE_copy together with the GPU, the model is running on:

LTE 1 odict_keys([]) LTE_Copy 1 odict_keys([]) LTE 0 odict_keys(['sub_mean.weight', 'sub_mean.bias']) LTE_Copy 0 odict_keys(['slice1.0.weight', 'slice1.0.bias', 'slice2.2.weight', 'slice2.2.bias', 'slice2.5.weight', 'slice2.5.bias', 'slice3.7.weight', 'slice3.7.bias', 'slice3.10.weight', 'slice3.10.bias', 'sub_mean.weight', 'sub_mean.bias'])

If I set strict = False in load_state_dict then everything runs smoothly. But isn't that just ignoring the problem?

Geniussh commented 1 year ago

@23vil Hey I met exactly the same problem and I saw your question posted on stackoverflow as well as the pytorch forum. Did you find out the cause eventually?

researchmm / TTSR

The issue of the train phase in the Parallel #32