Closed MarianCodrinCretu closed 3 years ago
Unfortunately, I have no experience with running PET in a multi-GPU setting (for all of our experiments, we've used a single GPU). As the error occurs at the embedding layer (return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
) and the error message says arguments are located on different GPUs
, my best guess would be that the inputs are on a different device than the model's embedding weights.
The inputs are moved to GPU in this line in wrapper.py
and a DataParallel
wrapper is put around the model in this line. I found this discussion of an issue that seems to be very similar (it also proposes a solution for the problem that might be worth trying).
Sadly, I don't have the time right now to dive deeper into this issue. If you get PET to work with multiple GPUs, feel free to create a pull request :)
Thank you for the feedback!
Comment out line 359-360 in wrapper.py
helps in my case. The error is caused by declaring self.model = torch.nn.DataParallel(self.model)
twice during train and eval.
Ah, that makes sense! Thanks, @rubbybbs - feel free to write a pull request with this modification if you find the time :)
Hello!
I would like you kindly to revise this question, if possible: I am trying to run the project using this setup (f.e: for RTE task) and with all dependencies installed from requirements.txt (from master branch)
and I am obtaining the following error only when I am trying to use multiple GPU s (for a machine with a single GPU, it works)
Do you have any ideas of what shall I change in order to work, or if am I doing something wrong? I will apreciate any hints or remarks, if possible!