huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Apache License 2.0
192 stars 59 forks source link

Llama3-8B finetuning shows runtime error of TDRV:v2_cc_execute #658

Open jianyinglangaws opened 1 month ago

jianyinglangaws commented 1 month ago

System Info

The same script works with `Neuron SDK 2.18.0` and `optimum-neuronx v0.0.22`.  But with the latest software stack 

(aws_neuron_venv_pytorch) [ec2-user@ip-172-31-29-22 text-generation]$ yum list | grep neuron
aws-neuronx-collectives.x86_64                                    2.21.46.0_69b77134b-1                       @neuron         
aws-neuronx-dkms.noarch                                           2.17.17.0-dkms                              @neuron         
aws-neuronx-runtime-lib.x86_64                                    2.21.41.0_fb1705f5f-1                       @neuron         
aws-neuronx-tools.x86_64                                          2.18.3.0-1                                  @neuron         
aws-neuron-dkms.noarch                                            2.3.26.0-dkms                               neuron          
aws-neuron-dkms.src                                               2.3.26.0-dkms                               neuron          
aws-neuron-k8-plugin.x86_64                                       1.9.3.0-1                                   neuron          
aws-neuron-k8-scheduler.x86_64                                    1.9.3.0-1                                   neuron          
aws-neuron-runtime.x86_64                                         1.6.24.0-1                                  neuron          
aws-neuron-runtime-base.x86_64                                    1.6.21.0-1                                  neuron          
aws-neuron-tools.x86_64                                           2.1.4.0-1                                   neuron          
aws-neuronx-dkms.src                                              2.17.17.0-dkms                              neuron          
aws-neuronx-gpsimd-customop.x86_64                                0.2.3.0-1                                   neuron          
aws-neuronx-gpsimd-customop-lib.x86_64                            0.11.4.0-1                                  neuron          
aws-neuronx-gpsimd-tools.x86_64                                   0.11.3.0_36dcb86d4-1                        neuron          
aws-neuronx-k8-plugin.x86_64                                      2.21.14.0-1                                 neuron          
aws-neuronx-k8-scheduler.x86_64                                   2.21.14.0-1                                 neuron          
aws-neuronx-oci-hook.x86_64                                       2.4.4.0-1                                   neuron          
tensorflow-model-server-neuron.x86_64                             2.8.0.2.3.0.0-0                             neuron          
tensorflow-model-server-neuronx.x86_64                            2.10.1.2.11.4.0-0                           neuron       
(aws_neuron_venv_pytorch) [ec2-user@ip-172-31-29-22 text-generation]$ pip list | grep neuron
aws-neuronx-runtime-discovery 2.9
libneuronxla                  2.0.2335
neuronx-cc                    2.13.66.0+6dfecc895
neuronx-distributed           0.7.0
optimum-neuron                0.0.23
torch-neuronx                 2.1.2.2.1.0
transformers-neuronx          0.10.0.21

gives the following error.

745142719040221994+6bd63055/model.neff. Exiting with a successfully compiled graph.
2024-Jul-17 22:00:32.531450 57376:58367 ERROR  TDRV:v2_cc_execute                           [nec_dev 1, gid 1] MPMD execution is not supported. This is likely caused for some but not all ranks recompiling/reloading a graph, model: /tmp/ec2-user/neuroncc_compile_workdir/952ba576-b08c-4ac0-b0f1-12e560e5e362/model.MODULE_11203961494150985019+6bd63055.neff2024-Jul-17 22:00:32.5314452024-Jul-17 22:00:32.5314692024-Jul-17 22:00:32.5314502024-Jul-17 22:00:32.5314522024-Jul-17 22:00:32.531461 57380:58467 ERROR  TDRV:v2_cc_execute                            57381:57583 ERROR  TDRV:v2_cc_execute                           
 57379:57681 ERROR  TDRV:v2_cc_execute                            57378:58269 ERROR  TDRV:v2_cc_execute                           [nec_dev 4, gid 4] MPMD execution is not supported. This is likely caused for some but not all ranks recompiling/reloading a graph, model: /tmp/ec2-user/neuroncc_compile_workdir/ebe15e9e-3f50-4061-8ef1-9c80d9a78071/model.MODULE_12660838522657173708+6bd63055.neff[nec_dev 5, gid 5] MPMD execution is not supported. This is likely caused for some but not all ranks recompiling/reloading a graph, model: /tmp/ec2-user/neuroncc_compile_workdir/a12cbe53-3042-4c9a-86d4-4bbcac795471/model.MODULE_15938951179947649509+6bd63055.neff 57382:57461 ERROR  TDRV:v2_cc_execute                           [nec_dev 6, gid 6] MPMD execution is not supported. This is likely caused for some but not all ranks recompiling/reloading a graph, model: /tmp/ec2-user/neuroncc_compile_workdir/eac078e9-79e3-49af-8695-c65797ca89c0/model.MODULE_3281029225498615900+6bd63055.neff[nec_dev 3, gid 3] MPMD execution is not supported. This is likely caused for some but not all ranks recompiling/reloading a graph, model: /tmp/ec2-user/neuroncc_compile_workdir/43f12c55-64e6-45ad-a002-a0676ed72df9/model.MODULE_5948348338269179475+6bd63055.neff
[nec_dev 7, gid 7] MPMD execution is not supported. This is likely caused for some but not all ranks recompiling/reloading a graph, model: /tmp/ec2-user/neuroncc_compile_workdir/59a5b5cd-fff2-4315-a603-8a152f5186ca/model.MODULE_12429740934125521760+6bd63055.neff2024-Jul-17 22:00:32.531563

 57376:58367 ERROR   ENC:enc_dump_neff_info                      [nec_dev 1, gid 1] NEFF: /tmp/ec2-user/neuroncc_compile_workdir/952ba576-b08c-4ac0-b0f1-12e560e5e362/model.MODULE_11203961494150985019+6bd63055.neff2024-Jul-17 22:00:32.531607
2024-Jul-17 22:00:32.5316292024-Jul-17 22:00:32.5316332024-Jul-17 22:00:32.531631 57379:57681 ERROR   ENC:enc_dump_neff_info                       57378:58269 ERROR   ENC:enc_dump_neff_info                      
 57381:57583 ERROR   ENC:enc_dump_neff_info                       57380:58467 ERROR   ENC:enc_dump_neff_info                      [nec_dev 4, gid 4] NEFF: /tmp/ec2-user/neuroncc_compile_workdir/ebe15e9e-3f50-4061-8ef1-9c80d9a78071/model.MODULE_12660838522657173708+6bd63055.neff[nec_dev 3, gid 3] NEFF: /tmp/ec2-user/neuroncc_compile_workdir/43f12c55-64e6-45ad-a002-a0676ed72df9/model.MODULE_5948348338269179475+6bd63055.neff[nec_dev 6, gid 6] NEFF: /tmp/ec2-user/neuroncc_compile_workdir/eac078e9-79e3-49af-8695-c65797ca89c0/model.MODULE_3281029225498615900+6bd63055.neff[nec_dev 5, gid 5] NEFF: /tmp/ec2-user/neuroncc_compile_workdir/a12cbe53-3042-4c9a-86d4-4bbcac795471/model.MODULE_15938951179947649509+6bd63055.neff2024-Jul-17 22:00:32.531670 57382:57461 ERROR   ENC:enc_dump_neff_info                      
2024-Jul-17 22:00:32.531701 57376:58367 ERROR   ENC:enc_dump_neff_info    

### Who can help?

_No response_

### Information

- [X] The official example scripts
- [ ] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction (minimal, reproducible, runnable)

The script I used is as below:

Launch the instance with Amazon Linux2023 Install the deps using the following script

Configure Linux for Neuron repository updates

sudo tee /etc/yum.repos.d/neuron.repo > /dev/null <<EOF [neuron] name=Neuron YUM Repository baseurl=https://yum.repos.neuron.amazonaws.com enabled=1 metadata_expire=0 EOF sudo rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB

Update OS packages

sudo yum update -y

Install OS headers

sudo yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r) -y

Install git

sudo yum install git -y

install Neuron Driver

sudo yum install aws-neuronx-dkms-2.* -y

Install Neuron Runtime

sudo yum install aws-neuronx-collectives-2. -y sudo yum install aws-neuronx-runtime-lib-2. -y

Install Neuron Tools

sudo yum install aws-neuronx-tools-2.* -y

Create python3 venv

sudo yum install -y libxcrypt-compat sudo yum install -y gcc-c++ python3 -m venv /home/ec2-user/aws_neuron_venv_pytorch

Activate venv

source ~/aws_neuron_venv_pytorch/bin/activate

python -m pip install -U pip

Install Jupyter notebook kernel

pip install ipykernel python3 -m ipykernel install --user --name aws_neuron_venv_pytorch --display-name "Python (torch-neuronx)" pip install jupyter notebook pip install environment_kernels

Set pip repository pointing to the Neuron repository

python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com

Install wget, awscli

python -m pip install wget python -m pip install awscli

Install Neuron Compiler and Framework

python -m pip install neuronx-cc==2.* torch-neuronx torchvision

Install optmimum-neuronx

pip3 install --upgrade-strategy eager optimum[neuronx]

Download scripts

git clone https://github.com/huggingface/optimum-neuron.git

cd optimum-neuron/notebooks/text-generation/

Login with your huggingface token ID to download gated models

huggingface-cli login --token YOUR_TOKEN

Create a python3 file download_data.py to download and prcoess dataset under directory optimum-neuron/notebooks/text-generation/:

from datasets import load_dataset from random import randrange

Load dataset from the hub

dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

print(f"dataset size: {len(dataset)}") print(dataset[randrange(len(dataset))])

def format_dolly(sample): instruction = f"### Instruction\n{sample['instruction']}" context = f"### Context\n{sample['context']}" if len(sample["context"]) > 0 else None response = f"### Answer\n{sample['response']}"

join all the parts together

prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])
return prompt

from random import randrange

print(format_dolly(dataset[randrange(len(dataset))]))

from transformers import AutoTokenizer

Hugging Face model id

model_id = "meta-llama/Meta-Llama-3-8B" # gated

model_id = "meta-llama/Llama-2-7b-hf" # gated

tokenizer = AutoTokenizer.from_pretrained(model_id) from random import randint

add utils method to path for loading dataset

import sys sys.path.append("./scripts/utils") # make sure you change this to the correct path from pack_dataset import pack_dataset

template dataset to add prompt to each sample

def template_dataset(sample): sample["text"] = f"{format_dolly(sample)}{tokenizer.eos_token}" return sample

apply prompt template per sample

dataset = dataset.map(template_dataset, remove_columns=list(dataset.features))

print random sample

print(dataset[randint(0, len(dataset))]["text"])

tokenize dataset

dataset = dataset.map( lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(dataset.features) )

chunk dataset

lm_dataset = pack_dataset(dataset, chunk_length=2048) # We use 2048 as the maximum length for packing

save train_dataset to disk

dataset_path = "tokenized_dolly" lm_dataset.save_to_disk(dataset_path) Run the above script:

python download_data.py

Compile the finetuning script on inf2.8xlarge with the compile_llama3.sh script

MALLOC_ARENA_MAX=64 neuron_parallel_compile torchrun --nproc_per_node=8 scripts/run_clm.py \ --model_id "meta-llama/Meta-Llama-3-8B" \ --dataset_path "tokenized_dolly" \ --bf16 True \ --learning_rate 5e-5 \ --output_dir dolly_llama \ --overwrite_output_dir True \ --per_device_train_batch_size 1 \ --gradient_checkpointing True \ --tensor_parallel_size 8 \ --max_steps 10 \ --logging_steps 10 \ --gradient_accumulation_steps 16

Run the finetuning on inf2.8xlarge with the run_llama3.sh script

MALLOC_ARENA_MAX=64 torchrun --nproc_per_node=8 scripts/run_clm.py \ --model_id "meta-llama/Meta-Llama-3-8B" \ --dataset_path "tokenized_dolly" \ --bf16 True \ --learning_rate 5e-5 \ --output_dir dolly_llama \ --overwrite_output_dir True \ --skip_cache_push True \ --per_device_train_batch_size 1 \ --gradient_checkpointing True \ --tensor_parallel_size 8 \ --num_train_epochs 3 \ --logging_steps 10 \ --gradient_accumulation_steps 16



### Expected behavior

The run command should give performance numbers.
michaelbenayoun commented 1 month ago

It should be fixed on main. Also you might also encounter a MPMD issue after the first epoch depending on your logging strategy, this is fixed in #654 .

jianyinglangaws commented 1 month ago

The script can run with the Neuron SDK 2.19.1 and the optimum-neuron main. However, the loss value shows nan.

2024-07-22 20:10:28.000737:  280430  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.13.66.0+6dfecc895/M
ODULE_17784021259853473086+abb26765/model.neff. Exiting with a successfully compiled graph.                                                          
{'loss': nan, 'learning_rate': 4.796747967479675e-05, 'epoch': 0.12}                                                                                 
{'loss': nan, 'learning_rate': 4.59349593495935e-05, 'epoch': 0.24}                                                                                  
{'loss': nan, 'learning_rate': 4.390243902439025e-05, 'epoch': 0.36}                                                                                 
{'loss': nan, 'learning_rate': 4.186991869918699e-05, 'epoch': 0.48}                                                                                 
{'loss': nan, 'learning_rate': 3.983739837398374e-05, 'epoch': 0.6}                                                                                  
{'loss': nan, 'learning_rate': 3.780487804878049e-05, 'epoch': 0.72}                                                                                 
{'loss': nan, 'learning_rate': 3.577235772357724e-05, 'epoch': 0.84}                                                                                 
{'loss': nan, 'learning_rate': 3.373983739837399e-05, 'epoch': 0.96}                                                                                 
{'loss': nan, 'learning_rate': 3.170731707317073e-05, 'epoch': 1.09}                                                                                 
{'loss': nan, 'learning_rate': 2.9674796747967482e-05, 'epoch': 1.21}                                                                                
{'loss': nan, 'learning_rate': 2.764227642276423e-05, 'epoch': 1.33}                                                                                 
{'loss': nan, 'learning_rate': 2.5609756097560977e-05, 'epoch': 1.45}                                                                                
{'loss': nan, 'learning_rate': 2.3577235772357724e-05, 'epoch': 1.57}                                                                                
{'loss': nan, 'learning_rate': 2.1544715447154475e-05, 'epoch': 1.69}                                                                                
{'loss': nan, 'learning_rate': 1.9512195121951222e-05, 'epoch': 1.81}                                                                                
{'loss': nan, 'learning_rate': 1.747967479674797e-05, 'epoch': 1.93}                                                                                 
{'loss': nan, 'learning_rate': 1.5447154471544717e-05, 'epoch': 2.05}                                                                                
{'loss': nan, 'learning_rate': 1.3414634146341466e-05, 'epoch': 2.17}                                                                                
{'loss': nan, 'learning_rate': 1.1382113821138211e-05, 'epoch': 2.29}                                                                                
{'loss': nan, 'learning_rate': 9.34959349593496e-06, 'epoch': 2.41}                                                                                  
{'loss': nan, 'learning_rate': 7.317073170731707e-06, 'epoch': 2.53}                                                                                 
{'loss': nan, 'learning_rate': 5.2845528455284555e-06, 'epoch': 2.65}                                                                                
{'loss': nan, 'learning_rate': 3.2520325203252037e-06, 'epoch': 2.77}                                                                                
{'loss': nan, 'learning_rate': 1.2195121951219514e-06, 'epoch': 2.89}