huggingface / transfer-learning-conv-ai

🦄 State-of-the-Art Conversational AI with Transfer Learning
MIT License
1.73k stars 431 forks source link

invalid device ordinal #53

Open gdet opened 4 years ago

gdet commented 4 years ago

Hello,

I followed the steps of your article and I have install pytorch with Cuda like this

   pip3 install torch torchvision

I have python 3.7, torch 1.1.0 , ubuntu 18.04. When I am trying to run this command

  python -m torch.distributed.launch --nproc_per_node=8 ./train.py

I get this error

  WARNING:./train.py:Running process 2
  THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1573049306803/work/torch/csrc/cuda/Module.cpp line=37 error=101 : invalid device ordinal
 Traceback (most recent call last):
 File "./train.py", line 267, in <module>
 train()
 File "./train.py", line 147, in train
  torch.cuda.set_device(args.local_rank)
 File "/home/hatzimin/.conda/envs/maria_env/lib/python3.7/site-packages/torch/cuda/__init__.py", 
  line 300, in set_device
torch._C._cuda_setDevice(device). 

I searched the error but I haven't managed to find a solution. If I try to run python ./train.py I get no error.

Thank you

sshleifer commented 4 years ago

How many GPU do you have on your machine? You need nproc_per_node= number of GPU on your machine.

gdet commented 4 years ago

I have four. I had changed the number from 8 to 4 but one of them was already used so I got this error. Thank you!