huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Apache License 2.0
209 stars 62 forks source link

AttributeError: can't set attribute 'deepspeed_plugin' #735

Open anushka0415 opened 1 week ago

anushka0415 commented 1 week ago

System Info

accelerate                    1.1.1
neuronx-cc                    2.14.227.0+2d4f85be
neuronx-distributed           0.8.0
neuronx-distributed-training  1.0.0
optimum                       1.22.0
optimum-neuron                0.0.25
torch                         2.1.2
torch-neuronx                 2.1.2.2.3.1
torch-xla                     2.1.4
torchvision                   0.16.2
triton                        2.1.0
trl                           0.12.1

Who can help?

@michaelbenayoun @JingyaHuang

Information

Tasks

Reproduction (minimal, reproducible, runnable)

set -ex

export NEURON_FUSE_SOFTMAX=1 export NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3 export MALLOC_ARENA_MAX=64 export NEURON_CC_FLAGS="--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cach> PROCESSES_PER_NODE=2

NUM_EPOCHS=1 TP_DEGREE=2 PP_DEGREE=1

BS=1 GRADIENT_ACCUMULATION_STEPS=8 LOGGING_STEPS=1 MODEL_NAME="meta-llama/Meta-Llama-3-8B" OUTPUT_DIR=output-$SLURM_JOB_ID

if [ "$NEURON_EXTRACT_GRAPHS_ONLY" = "1" ]; then MAX_STEPS=$((LOGGING_STEPS + 5)) else MAX_STEPS=-1 fi

XLA_USE_BF16=1 neuron_parallel_compile torchrun --nproc_per_node $PROCESSES_PER_NODE train.py \ --model_id $MODEL_NAME \ --num_train_epochs $NUM_EPOCHS \ --do_train \ --learning_rate 5e-5 \ --warmup_ratio 0.03 \ --max_steps $MAX_STEPS \ --per_device_train_batch_size $BS \ --per_device_eval_batch_size $BS \ --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \ --gradient_checkpointing true \ --bf16 \ --zero_1 false \ --tensor_parallel_size $TP_DEGREE \ --pipeline_parallel_size $PP_DEGREE \ --logging_steps $LOGGING_STEPS \ --save_total_limit 1 \ --output_dir $OUTPUT_DIR \ --lr_scheduler_type "constant" \ --overwrite_output_dir

Expected behavior

compilation should pass.

anushka0415 commented 1 week ago

Traceback (most recent call last):
File "/home/ubuntu/bobble-poc/train_example/train.py", line 112, in main()
File "/home/ubuntu/bobble-poc/train_example/train.py", line 108, in main training_function(script_args, training_args)
File "/home/ubuntu/bobble-poc/train_example/train.py", line 76, in training_function
trainer = NeuronSFTTrainer(
File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/optimum/neuron/trainers.py", line 1753, in init super().init(
File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/optimum/neuron/trainers.py", line 179, in init
super().init(*args, kwargs) File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/optimum/neuron/trainers.py", line 1514, in init return Trainer.init(self, *args, *kwargs) File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 165, in wrapped_func return func(args, kwargs) File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/transformers/trainer.py", line 430, in init
self.create_accelerator_and_postprocess()
File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/optimum/neuron/trainers.py", line 279, in create_accelerator_and_postprocess self.accelerator = NeuronAccelerator( File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/optimum/neuron/accelerate/accelerator.py", line 153, in init super().init(**full_kwargs)
File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/accelerate/accelerator.py", line 415, in init self.state = AcceleratorState(
File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/optimum/neuron/accelerate/state.py", line 151, in init self.deepspeed_plugin = None AttributeError: can't set attribute 'deepspeed_plugin'

vedant123454 commented 1 week ago

Issue: Incorrect Variable Name in state.py

In the file state.py, at line 151, the code currently sets:

self.deepspeed_plugin = None

This should be corrected to:

self.deepspeed_plugins = None

make the changes in the repo and build it from source

michaelbenayoun commented 3 days ago

@vedant123454's solution might work.

As accelerate is a fast moving library, and we extend it quite a bit in optimum-neuron to make everything work, we actually bump the version for every release. Right now, the officially supported version for accelerate is 0.29.2 but 1.1.1 is installed on your system.