Open regisss opened 2 years ago
This could be occurring for a multitude of reasons,
GPU access seems unavailable from what I see, this could be for many reasons.
If this problem is occurring due to another process being in play as you stated, try running "nvidia-smi" if you are using Nvidia, to see what processes are in action with the GPU.
Also, check to see if any dependencies/libraires aren't up to date on your local deice.
Hope this can help.
Unit and integration tests currently needs to be run with
pytest tests/test_gaudi_configuration.py tests/test_trainer_distributed.py tests/test_trainer.py
. If not, for instance withpytests tests/
, test_trainer will be executed before test_trainer_distributed and the latter will fail without any error message.The following code snippet in training_args.py should actually not be executed in single-card mode and is responsible for this error:
However, even when this is corrected, I still get the following error:
I think this is due to the fact that one process may still be running on a HPU when Torch tries to acquire devices.