Training Error - Githubissues

isl-org / DPT

Dense Prediction Transformers

MIT License

1.96k stars 254 forks source link

Training Error #50

Closed kimsunkyung closed 2 years ago

kimsunkyung commented 2 years ago

I want to train your model.

When i didn't use nn.DataParallel then training is ok.

But, when i use nn.DataParallel i got this error.

RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)

I want to train with multi gpu. How can i do?

ranftlr commented 2 years ago

The model in the main repo doesn't support DataParallel. You'll need to use DistributedDataParallel or rewrite the models accordingly. See here for a discussion: #15. You could also try the redesigned models in the branch "dpt_scriptable". They might work with DataParallel, but we didn't try.

kimsunkyung commented 2 years ago

Thank you. I tried to use branch 'dpt_scriptable' and it worked well.

XiaoyuShi97 commented 2 years ago

Hi, I also encounter the same question when I use main branch code for training. As suggested, I use branch dpt_scriptable. But I get error when loading model. Missing key(s) in state_dict: "pretrained.readout_oper1.0.project.0.weight", ..... Unexpected key(s) in state_dict: "pretrained.act_postprocess1.3.weight", ...... How to fix it?

ranftlr commented 2 years ago

You need to download the new weight files that are linked in the branch dpt_scriptable too (the weights and models are the same, but the internal layout is different)

XiaoyuShi97 commented 2 years ago

It works now. Thx a lot for your prompt reply!