Training with K80 or GPUs with version less than 5.x

huberemanuel commented 3 years ago

In train.sh the following line limits the CUDA_VISIBLE_DEVICES environment variable with GPUs with versions greater or equal to 6.x.

export CUDA_VISIBLE_DEVICES=$(python3 -c "import torch; x=[str(x) for x in range(torch.cuda.device_count()) if torch.cuda.get_device_capability(x)[0]>=6]; print(','.join(x))" 2>/dev/null)

I don't know why there is this limitation, but this caused an issue in my case where I use a Tesla K80 that is a 3.x version. When I executed the training script, the error would say that I do not have a CUDA available device. After removing this limitation (export CUDA_VISIBLE_DEVICES=1) I was able to run the training procedure correctly.

This version limitation is really needed? Or can we remove it?

BigBird01 commented 3 years ago

We haven't tested on K80 yet. But you can try to remove it for inference purpose.

huberemanuel commented 3 years ago

The training was successful, the loss was printed so I think it works well with K80, thanks

microsoft / DeBERTa

Training with K80 or GPUs with version less than 5.x #15