huggingface / optimum-habana

Easy and lightning fast training of 🤗 Transformers on Habana Gaudi processor (HPU)
Apache License 2.0
152 stars 197 forks source link

Error in tests when test_trainer is run before test_trainer_distributed #31

Open regisss opened 2 years ago

regisss commented 2 years ago

Unit and integration tests currently needs to be run with pytest tests/test_gaudi_configuration.py tests/test_trainer_distributed.py tests/test_trainer.py. If not, for instance with pytests tests/ , test_trainer will be executed before test_trainer_distributed and the latter will fail without any error message.

The following code snippet in training_args.py should actually not be executed in single-card mode and is responsible for this error:

try:
  global mpi_comm
  from mpi4py import MPI

  mpi_comm = MPI.COMM_WORLD
  world_size = mpi_comm.Get_size()
  if world_size > 1:
      rank = mpi_comm.Get_rank()
      self.local_rank = rank
  else:
      raise ("Single MPI process")
except Exception as e:
  logger.info("Single node run")

However, even when this is corrected, I still get the following error:

Traceback (most recent call last):
  File "/root/shared/optimum-habana/tests/test_trainer_distributed.py", line 117, in <module>
    trainer = GaudiTrainer(
  File "/usr/local/lib/python3.8/dist-packages/optimum/habana/trainer.py", line 118, in __init__
    super().__init__(
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 382, in __init__
    self._move_model_to_device(model, args.device)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 548, in _move_model_to_device
    model = model.to(device)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 899, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 570, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 593, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 897, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: Device acquire failed.

I think this is due to the fact that one process may still be running on a HPU when Torch tries to acquire devices.

AaTekle commented 1 year ago

This could be occurring for a multitude of reasons,

GPU access seems unavailable from what I see, this could be for many reasons.

  1. Improper GPU Config
  2. GPU Drivers not installed
  3. GPU in use with another process.

If this problem is occurring due to another process being in play as you stated, try running "nvidia-smi" if you are using Nvidia, to see what processes are in action with the GPU.

Also, check to see if any dependencies/libraires aren't up to date on your local deice.

Hope this can help.