running packages when running the "Supervised Fine-Tuning of Llama 3 8B on one AWS Trainium instance" sample

yahavb commented 3 weeks ago

System Info

PyTorch 1.13.1 with NeuronX Training and HuggingFace transformers
Neuron 2.18.0
Python - Version Options - 3.10 (py310)
DLC 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training-neuronx:1.13.1-transformers4.36.2-neuronx-py310-sdk2.18.0-ubuntu20.04

Who can help?

@michaelbenayoun @JingyaHuang

Information

[x] The official example scripts
[ ] My own modified scripts

Tasks

[x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

Precompilation step in https://huggingface.co/docs/optimum-neuron/en/training_tutorials/sft_lora_finetune_llm is failing on many missing packages. Is there a specific DLC we can use?

Expected behavior

Running the tutorial successfully. The "Fine-tune and Test Llama-3 8B on AWS Trainium" tutorial works with no issue with the same settings.

michaelbenayoun commented 3 weeks ago

Do you have the names of the packages that are missing by any chance please?

yahavb commented 3 weeks ago

docker run -it --privileged  -v /home/ec2-user:/home/ubuntu/ 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training-neuronx:1.13.1-transformers4.36.2-neuronx-py310-sdk2.18.0-ubuntu20.04 bash

apt-get update 
...
pip install --upgrade pip
....
pip3 install peft trl
...
git clone https://github.com/huggingface/optimum-neuron.git
cd optimum-neuron
pip3 install .
....
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
neuronx-cc 2.13.66.0+6dfecc895 requires protobuf<3.20, but you have protobuf 3.20.3 which is incompatible.
....
#!/bin/bash
set -ex

export NEURON_FUSE_SOFTMAX=1
export NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3
export MALLOC_ARENA_MAX=64
export NEURON_CC_FLAGS="--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cache_dir=/home/ubuntu/cache_dir_neuron/"

PROCESSES_PER_NODE=8

NUM_EPOCHS=1
TP_DEGREE=2
PP_DEGREE=1
BS=1
GRADIENT_ACCUMULATION_STEPS=8
LOGGING_STEPS=1
MODEL_NAME="meta-llama/Meta-Llama-3-8B"
OUTPUT_DIR=output-$SLURM_JOB_ID

if [ "$NEURON_EXTRACT_GRAPHS_ONLY" = "1" ]; then
    MAX_STEPS=$((LOGGING_STEPS + 5))
else
    MAX_STEPS=-1
fi

XLA_USE_BF16=1 neuron_parallel_compile torchrun --nproc_per_node $PROCESSES_PER_NODE docs/source/training_tutorials/sft_lora_finetune_llm.py \
  --model_id $MODEL_NAME \
  --num_train_epochs $NUM_EPOCHS \
  --do_train \
  --learning_rate 5e-5 \
  --warmup_ratio 0.03 \
  --max_steps $MAX_STEPS \
  --per_device_train_batch_size $BS \
  --per_device_eval_batch_size $BS \
  --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
  --gradient_checkpointing true \
  --bf16 \
  --zero_1 false \
  --tensor_parallel_size $TP_DEGREE \
  --pipeline_parallel_size $PP_DEGREE \
  --logging_steps $LOGGING_STEPS \
  --save_total_limit 1 \
  --output_dir $OUTPUT_DIR \
  --lr_scheduler_type "constant" \
  --overwrite_output_dir
....
+ export NEURON_FUSE_SOFTMAX=1
+ NEURON_FUSE_SOFTMAX=1
+ export NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3
+ NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3
+ export MALLOC_ARENA_MAX=64
+ MALLOC_ARENA_MAX=64
+ export 'NEURON_CC_FLAGS=--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cache_dir=/home/ubuntu/cache_dir_neuron/'
+ NEURON_CC_FLAGS='--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cache_dir=/home/ubuntu/cache_dir_neuron/'
+ PROCESSES_PER_NODE=8
+ NUM_EPOCHS=1
+ TP_DEGREE=2
+ PP_DEGREE=1
+ BS=1
+ GRADIENT_ACCUMULATION_STEPS=8
+ LOGGING_STEPS=1
+ MODEL_NAME=meta-llama/Meta-Llama-3-8B
+ OUTPUT_DIR=output-
+ '[' '' = 1 ']'
+ MAX_STEPS=-1
+ XLA_USE_BF16=1
+ neuron_parallel_compile torchrun --nproc_per_node 8 docs/source/training_tutorials/sft_lora_finetune_llm.py --model_id meta-llama/Meta-Llama-3-8B --num_train_epochs 1 --do_train --learning_rate 5e-5 --warmup_ratio 0.03 --max_steps -1 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 8 --gradient_checkpointing true --bf16 --zero_1 false --tensor_parallel_size 2 --pipeline_parallel_size 1 --logging_steps 1 --save_total_limit 1 --output_dir output- --lr_scheduler_type constant --overwrite_output_dir
Traceback (most recent call last):
  File "/usr/local/bin/neuron_parallel_compile", line 5, in <module>
    from optimum.neuron.utils.neuron_parallel_compile import main
  File "/usr/local/lib/python3.10/site-packages/optimum/neuron/utils/neuron_parallel_compile.py", line 8, in <module>
    from torch_neuronx.parallel_compile.neuron_parallel_compile import LOGGER as torch_neuronx_logger
ModuleNotFoundError: No module named 'torch_neuronx.parallel_compile'

I tried to grab the neuron drivers:

echo 'deb https://apt.repos.neuron.amazonaws.com jammy main' > /etc/apt/sources.list.d/neuron.list
wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | apt-key add - && apt-get update
apt-get install -y aws-neuronx-collectives=2.* aws-neuronx-runtime-lib=2.* aws-neuronx-tools=2.*
echo "export PATH=/opt/aws/neuron/bin:\$PATH" >> /root/.bashrc
PATH="${PATH}:/opt/aws/neuron/bin"

and python -c "import torch_neuronx" runs with no errors but no help

I then removed neuron_parallel_compile and got: ... Traceback (most recent call last): File "", line 1027, in _find_and_load File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 11, in File "", line 1006, in _find_and_load_unlocked from optimum.neuron import NeuronHfArgumentParser as HfArgumentParser File "/usr/local/lib/python3.10/site-packages/optimum/neuron/init.py", line 18, in File "", line 688, in _load_unlocked from .trainers import Seq2SeqTrainiumTrainer, TrainiumTrainer File "/usr/local/lib/python3.10/site-packages/optimum/neuron/trainers.py", line 20, in File "", line 883, in exec_module File "", line 241, in _call_with_frames_removed from transformers import Seq2SeqTrainer, Trainer File "", line 1075, in _handle_fromlist File "/usr/local/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 26, in File "/usr/local/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1462, in getattr from .trainer import Trainer File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 180, in import torch_xla.distributed.spmd as xs ModuleNotFoundError: No module named 'torch_xla.distributed.spmd' ...

So I tried reinstall

pip install torch-neuronx optimum[neuron] transformers

and still got the same ModuleNotFoundError: No module named 'torch_xla.distributed.spmd' error

huggingface / optimum-neuron