Open yahavb opened 3 weeks ago
Do you have the names of the packages that are missing by any chance please?
docker run -it --privileged -v /home/ec2-user:/home/ubuntu/ 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training-neuronx:1.13.1-transformers4.36.2-neuronx-py310-sdk2.18.0-ubuntu20.04 bash
apt-get update
...
pip install --upgrade pip
....
pip3 install peft trl
...
git clone https://github.com/huggingface/optimum-neuron.git
cd optimum-neuron
pip3 install .
....
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
neuronx-cc 2.13.66.0+6dfecc895 requires protobuf<3.20, but you have protobuf 3.20.3 which is incompatible.
....
#!/bin/bash
set -ex
export NEURON_FUSE_SOFTMAX=1
export NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3
export MALLOC_ARENA_MAX=64
export NEURON_CC_FLAGS="--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cache_dir=/home/ubuntu/cache_dir_neuron/"
PROCESSES_PER_NODE=8
NUM_EPOCHS=1
TP_DEGREE=2
PP_DEGREE=1
BS=1
GRADIENT_ACCUMULATION_STEPS=8
LOGGING_STEPS=1
MODEL_NAME="meta-llama/Meta-Llama-3-8B"
OUTPUT_DIR=output-$SLURM_JOB_ID
if [ "$NEURON_EXTRACT_GRAPHS_ONLY" = "1" ]; then
MAX_STEPS=$((LOGGING_STEPS + 5))
else
MAX_STEPS=-1
fi
XLA_USE_BF16=1 neuron_parallel_compile torchrun --nproc_per_node $PROCESSES_PER_NODE docs/source/training_tutorials/sft_lora_finetune_llm.py \
--model_id $MODEL_NAME \
--num_train_epochs $NUM_EPOCHS \
--do_train \
--learning_rate 5e-5 \
--warmup_ratio 0.03 \
--max_steps $MAX_STEPS \
--per_device_train_batch_size $BS \
--per_device_eval_batch_size $BS \
--gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
--gradient_checkpointing true \
--bf16 \
--zero_1 false \
--tensor_parallel_size $TP_DEGREE \
--pipeline_parallel_size $PP_DEGREE \
--logging_steps $LOGGING_STEPS \
--save_total_limit 1 \
--output_dir $OUTPUT_DIR \
--lr_scheduler_type "constant" \
--overwrite_output_dir
....
+ export NEURON_FUSE_SOFTMAX=1
+ NEURON_FUSE_SOFTMAX=1
+ export NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3
+ NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3
+ export MALLOC_ARENA_MAX=64
+ MALLOC_ARENA_MAX=64
+ export 'NEURON_CC_FLAGS=--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cache_dir=/home/ubuntu/cache_dir_neuron/'
+ NEURON_CC_FLAGS='--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cache_dir=/home/ubuntu/cache_dir_neuron/'
+ PROCESSES_PER_NODE=8
+ NUM_EPOCHS=1
+ TP_DEGREE=2
+ PP_DEGREE=1
+ BS=1
+ GRADIENT_ACCUMULATION_STEPS=8
+ LOGGING_STEPS=1
+ MODEL_NAME=meta-llama/Meta-Llama-3-8B
+ OUTPUT_DIR=output-
+ '[' '' = 1 ']'
+ MAX_STEPS=-1
+ XLA_USE_BF16=1
+ neuron_parallel_compile torchrun --nproc_per_node 8 docs/source/training_tutorials/sft_lora_finetune_llm.py --model_id meta-llama/Meta-Llama-3-8B --num_train_epochs 1 --do_train --learning_rate 5e-5 --warmup_ratio 0.03 --max_steps -1 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 8 --gradient_checkpointing true --bf16 --zero_1 false --tensor_parallel_size 2 --pipeline_parallel_size 1 --logging_steps 1 --save_total_limit 1 --output_dir output- --lr_scheduler_type constant --overwrite_output_dir
Traceback (most recent call last):
File "/usr/local/bin/neuron_parallel_compile", line 5, in <module>
from optimum.neuron.utils.neuron_parallel_compile import main
File "/usr/local/lib/python3.10/site-packages/optimum/neuron/utils/neuron_parallel_compile.py", line 8, in <module>
from torch_neuronx.parallel_compile.neuron_parallel_compile import LOGGER as torch_neuronx_logger
ModuleNotFoundError: No module named 'torch_neuronx.parallel_compile'
I tried to grab the neuron drivers:
echo 'deb https://apt.repos.neuron.amazonaws.com jammy main' > /etc/apt/sources.list.d/neuron.list
wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | apt-key add - && apt-get update
apt-get install -y aws-neuronx-collectives=2.* aws-neuronx-runtime-lib=2.* aws-neuronx-tools=2.*
echo "export PATH=/opt/aws/neuron/bin:\$PATH" >> /root/.bashrc
PATH="${PATH}:/opt/aws/neuron/bin"
and python -c "import torch_neuronx"
runs with no errors but no help
I then removed neuron_parallel_compile
and got:
...
Traceback (most recent call last):
File "
So I tried reinstall
pip install torch-neuronx optimum[neuron] transformers
and still got the same ModuleNotFoundError: No module named 'torch_xla.distributed.spmd'
error
System Info
Who can help?
@michaelbenayoun @JingyaHuang
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction (minimal, reproducible, runnable)
Precompilation step in https://huggingface.co/docs/optimum-neuron/en/training_tutorials/sft_lora_finetune_llm is failing on many missing packages. Is there a specific DLC we can use?
Expected behavior
Running the tutorial successfully. The "Fine-tune and Test Llama-3 8B on AWS Trainium" tutorial works with no issue with the same settings.