Error in tests when test_trainer is run before test_trainer_distributed

Unit and integration tests currently needs to be run with pytest tests/test_gaudi_configuration.py tests/test_trainer_distributed.py tests/test_trainer.py. If not, for instance with pytests tests/ , test_trainer will be executed before test_trainer_distributed and the latter will fail without any error message.

The following code snippet in training_args.py should actually not be executed in single-card mode and is responsible for this error:

try:
  global mpi_comm
  from mpi4py import MPI

  mpi_comm = MPI.COMM_WORLD
  world_size = mpi_comm.Get_size()
  if world_size > 1:
      rank = mpi_comm.Get_rank()
      self.local_rank = rank
  else:
      raise ("Single MPI process")
except Exception as e:
  logger.info("Single node run")

However, even when this is corrected, I still get the following error:

Traceback (most recent call last):
  File "/root/shared/optimum-habana/tests/test_trainer_distributed.py", line 117, in <module>
    trainer = GaudiTrainer(
  File "/usr/local/lib/python3.8/dist-packages/optimum/habana/trainer.py", line 118, in __init__
    super().__init__(
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 382, in __init__
    self._move_model_to_device(model, args.device)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 548, in _move_model_to_device
    model = model.to(device)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 899, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 570, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 593, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 897, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: Device acquire failed.

I think this is due to the fact that one process may still be running on a HPU when Torch tries to acquire devices.

huggingface / optimum-habana

Error in tests when test_trainer is run before test_trainer_distributed #31