huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Apache License 2.0
177 stars 53 forks source link

Issue with transformers 4.30.0 and Tutorial Fine-tune BERT for Text Classification on AWS Trainium #136

Closed aws-amerrez closed 2 months ago

aws-amerrez commented 12 months ago

Running the Tutorial to fine-tune bert with optimum-neuron hits an error "Found no NVIDIA driver on your system." when using transformers=4.30.0

Tutorial https://www.philschmid.de/getting-started-trainium

when downgrading transformers to 4.28.0 the issue doesn't happen.

The packages versions are the same ones suggested to follow in the tutorial

transformers                  4.30.0
aws-neuronx-runtime-discovery 2.9
libneuronxla                  0.5.326
neuronx-cc                    2.7.0.40+f7c6cf2a3
neuronx-hwm                   2.7.0.3+0092b9d34
optimum-neuron                0.0.7
tensorboard-plugin-neuronx    2.5.37.0
torch-neuronx                 1.13.1.1.8.0
torch-xla                     1.13.1+torchneuron7

error

Traceback (most recent call last):
  File "/home/ubuntu/train.py", line 147, in <module>
    main()
  File "/home/ubuntu/train.py", line 143, in main
    training_function(args)
  File "/home/ubuntu/train.py", line 93, in training_function
    training_args = TrainingArguments(
  File "<string>", line 111, in __init__
  File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/training_args.py", line 1340, in __post_init__
    and (self.device.type != "cuda")
  File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/training_args.py", line 1764, in device
    return self._setup_devices
  File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/utils/generic.py", line 54, in __get__
    cached = self.fget(obj)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/training_args.py", line 1695, in _setup_devices
    self.distributed_state = PartialState(backend=self.ddp_backend)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/state.py", line 197, in __init__
    torch.cuda.set_device(self.device)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 326, in set_device
    torch._C._cuda_setDevice(device)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 229, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 39526) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
michaelbenayoun commented 10 months ago

Can you try with the main branch? And if it fails, can you try with transformers >= 4.31 please?

HuggingFaceDocBuilderDev commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

HuggingFaceDocBuilderDev commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!