Open anushka0415 opened 1 week ago
Traceback (most recent call last):
File "/home/ubuntu/bobble-poc/train_example/train.py", line 112, in
File "/home/ubuntu/bobble-poc/train_example/train.py", line 108, in main
training_function(script_args, training_args)
File "/home/ubuntu/bobble-poc/train_example/train.py", line 76, in training_function
trainer = NeuronSFTTrainer(
File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/optimum/neuron/trainers.py", line 1753, in init
super().init(
File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/optimum/neuron/trainers.py", line 179, in init
super().init(*args, kwargs)
File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/optimum/neuron/trainers.py", line 1514, in init
return Trainer.init(self, *args, *kwargs)
File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 165, in wrapped_func
return func(args, kwargs)
File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/transformers/trainer.py", line 430, in init
self.create_accelerator_and_postprocess()
File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/optimum/neuron/trainers.py", line 279, in create_accelerator_and_postprocess
self.accelerator = NeuronAccelerator(
File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/optimum/neuron/accelerate/accelerator.py", line 153, in init
super().init(**full_kwargs)
File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/accelerate/accelerator.py", line 415, in init
self.state = AcceleratorState(
File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/optimum/neuron/accelerate/state.py", line 151, in init
self.deepspeed_plugin = None
AttributeError: can't set attribute 'deepspeed_plugin'
state.py
In the file state.py
, at line 151, the code currently sets:
self.deepspeed_plugin = None
This should be corrected to:
self.deepspeed_plugins = None
make the changes in the repo and build it from source
@vedant123454's solution might work.
As accelerate
is a fast moving library, and we extend it quite a bit in optimum-neuron
to make everything work, we actually bump the version for every release. Right now, the officially supported version for accelerate
is 0.29.2
but 1.1.1
is installed on your system.
System Info
Who can help?
@michaelbenayoun @JingyaHuang
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction (minimal, reproducible, runnable)
set -ex
export NEURON_FUSE_SOFTMAX=1 export NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3 export MALLOC_ARENA_MAX=64 export NEURON_CC_FLAGS="--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cach> PROCESSES_PER_NODE=2
NUM_EPOCHS=1 TP_DEGREE=2 PP_DEGREE=1
BS=1 GRADIENT_ACCUMULATION_STEPS=8 LOGGING_STEPS=1 MODEL_NAME="meta-llama/Meta-Llama-3-8B" OUTPUT_DIR=output-$SLURM_JOB_ID
if [ "$NEURON_EXTRACT_GRAPHS_ONLY" = "1" ]; then MAX_STEPS=$((LOGGING_STEPS + 5)) else MAX_STEPS=-1 fi
XLA_USE_BF16=1 neuron_parallel_compile torchrun --nproc_per_node $PROCESSES_PER_NODE train.py \ --model_id $MODEL_NAME \ --num_train_epochs $NUM_EPOCHS \ --do_train \ --learning_rate 5e-5 \ --warmup_ratio 0.03 \ --max_steps $MAX_STEPS \ --per_device_train_batch_size $BS \ --per_device_eval_batch_size $BS \ --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \ --gradient_checkpointing true \ --bf16 \ --zero_1 false \ --tensor_parallel_size $TP_DEGREE \ --pipeline_parallel_size $PP_DEGREE \ --logging_steps $LOGGING_STEPS \ --save_total_limit 1 \ --output_dir $OUTPUT_DIR \ --lr_scheduler_type "constant" \ --overwrite_output_dir
Expected behavior
compilation should pass.