Open jianyinglangaws opened 4 months ago
It should be fixed on main
.
Also you might also encounter a MPMD issue after the first epoch depending on your logging strategy, this is fixed in #654 .
The script can run with the Neuron SDK 2.19.1 and the optimum-neuron main
. However, the loss value shows nan
.
2024-07-22 20:10:28.000737: 280430 INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.13.66.0+6dfecc895/M
ODULE_17784021259853473086+abb26765/model.neff. Exiting with a successfully compiled graph.
{'loss': nan, 'learning_rate': 4.796747967479675e-05, 'epoch': 0.12}
{'loss': nan, 'learning_rate': 4.59349593495935e-05, 'epoch': 0.24}
{'loss': nan, 'learning_rate': 4.390243902439025e-05, 'epoch': 0.36}
{'loss': nan, 'learning_rate': 4.186991869918699e-05, 'epoch': 0.48}
{'loss': nan, 'learning_rate': 3.983739837398374e-05, 'epoch': 0.6}
{'loss': nan, 'learning_rate': 3.780487804878049e-05, 'epoch': 0.72}
{'loss': nan, 'learning_rate': 3.577235772357724e-05, 'epoch': 0.84}
{'loss': nan, 'learning_rate': 3.373983739837399e-05, 'epoch': 0.96}
{'loss': nan, 'learning_rate': 3.170731707317073e-05, 'epoch': 1.09}
{'loss': nan, 'learning_rate': 2.9674796747967482e-05, 'epoch': 1.21}
{'loss': nan, 'learning_rate': 2.764227642276423e-05, 'epoch': 1.33}
{'loss': nan, 'learning_rate': 2.5609756097560977e-05, 'epoch': 1.45}
{'loss': nan, 'learning_rate': 2.3577235772357724e-05, 'epoch': 1.57}
{'loss': nan, 'learning_rate': 2.1544715447154475e-05, 'epoch': 1.69}
{'loss': nan, 'learning_rate': 1.9512195121951222e-05, 'epoch': 1.81}
{'loss': nan, 'learning_rate': 1.747967479674797e-05, 'epoch': 1.93}
{'loss': nan, 'learning_rate': 1.5447154471544717e-05, 'epoch': 2.05}
{'loss': nan, 'learning_rate': 1.3414634146341466e-05, 'epoch': 2.17}
{'loss': nan, 'learning_rate': 1.1382113821138211e-05, 'epoch': 2.29}
{'loss': nan, 'learning_rate': 9.34959349593496e-06, 'epoch': 2.41}
{'loss': nan, 'learning_rate': 7.317073170731707e-06, 'epoch': 2.53}
{'loss': nan, 'learning_rate': 5.2845528455284555e-06, 'epoch': 2.65}
{'loss': nan, 'learning_rate': 3.2520325203252037e-06, 'epoch': 2.77}
{'loss': nan, 'learning_rate': 1.2195121951219514e-06, 'epoch': 2.89}
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
System Info
gives the following error.
Launch the instance with Amazon Linux2023 Install the deps using the following script
Configure Linux for Neuron repository updates
sudo tee /etc/yum.repos.d/neuron.repo > /dev/null <<EOF [neuron] name=Neuron YUM Repository baseurl=https://yum.repos.neuron.amazonaws.com enabled=1 metadata_expire=0 EOF sudo rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB
Update OS packages
sudo yum update -y
Install OS headers
sudo yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r) -y
Install git
sudo yum install git -y
install Neuron Driver
sudo yum install aws-neuronx-dkms-2.* -y
Install Neuron Runtime
sudo yum install aws-neuronx-collectives-2. -y sudo yum install aws-neuronx-runtime-lib-2. -y
Install Neuron Tools
sudo yum install aws-neuronx-tools-2.* -y
Create python3 venv
sudo yum install -y libxcrypt-compat sudo yum install -y gcc-c++ python3 -m venv /home/ec2-user/aws_neuron_venv_pytorch
Activate venv
source ~/aws_neuron_venv_pytorch/bin/activate
python -m pip install -U pip
Install Jupyter notebook kernel
pip install ipykernel python3 -m ipykernel install --user --name aws_neuron_venv_pytorch --display-name "Python (torch-neuronx)" pip install jupyter notebook pip install environment_kernels
Set pip repository pointing to the Neuron repository
python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
Install wget, awscli
python -m pip install wget python -m pip install awscli
Install Neuron Compiler and Framework
python -m pip install neuronx-cc==2.* torch-neuronx torchvision
Install optmimum-neuronx
pip3 install --upgrade-strategy eager optimum[neuronx]
Download scripts
git clone https://github.com/huggingface/optimum-neuron.git
cd optimum-neuron/notebooks/text-generation/
Login with your huggingface token ID to download gated models
huggingface-cli login --token YOUR_TOKEN
Create a python3 file download_data.py to download and prcoess dataset under directory optimum-neuron/notebooks/text-generation/:
from datasets import load_dataset from random import randrange
Load dataset from the hub
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
print(f"dataset size: {len(dataset)}") print(dataset[randrange(len(dataset))])
def format_dolly(sample): instruction = f"### Instruction\n{sample['instruction']}" context = f"### Context\n{sample['context']}" if len(sample["context"]) > 0 else None response = f"### Answer\n{sample['response']}"
join all the parts together
from random import randrange
print(format_dolly(dataset[randrange(len(dataset))]))
from transformers import AutoTokenizer
Hugging Face model id
model_id = "meta-llama/Meta-Llama-3-8B" # gated
model_id = "meta-llama/Llama-2-7b-hf" # gated
tokenizer = AutoTokenizer.from_pretrained(model_id) from random import randint
add utils method to path for loading dataset
import sys sys.path.append("./scripts/utils") # make sure you change this to the correct path from pack_dataset import pack_dataset
template dataset to add prompt to each sample
def template_dataset(sample): sample["text"] = f"{format_dolly(sample)}{tokenizer.eos_token}" return sample
apply prompt template per sample
dataset = dataset.map(template_dataset, remove_columns=list(dataset.features))
print random sample
print(dataset[randint(0, len(dataset))]["text"])
tokenize dataset
dataset = dataset.map( lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(dataset.features) )
chunk dataset
lm_dataset = pack_dataset(dataset, chunk_length=2048) # We use 2048 as the maximum length for packing
save train_dataset to disk
dataset_path = "tokenized_dolly" lm_dataset.save_to_disk(dataset_path) Run the above script:
python download_data.py
Compile the finetuning script on inf2.8xlarge with the compile_llama3.sh script
MALLOC_ARENA_MAX=64 neuron_parallel_compile torchrun --nproc_per_node=8 scripts/run_clm.py \ --model_id "meta-llama/Meta-Llama-3-8B" \ --dataset_path "tokenized_dolly" \ --bf16 True \ --learning_rate 5e-5 \ --output_dir dolly_llama \ --overwrite_output_dir True \ --per_device_train_batch_size 1 \ --gradient_checkpointing True \ --tensor_parallel_size 8 \ --max_steps 10 \ --logging_steps 10 \ --gradient_accumulation_steps 16
Run the finetuning on inf2.8xlarge with the run_llama3.sh script
MALLOC_ARENA_MAX=64 torchrun --nproc_per_node=8 scripts/run_clm.py \ --model_id "meta-llama/Meta-Llama-3-8B" \ --dataset_path "tokenized_dolly" \ --bf16 True \ --learning_rate 5e-5 \ --output_dir dolly_llama \ --overwrite_output_dir True \ --skip_cache_push True \ --per_device_train_batch_size 1 \ --gradient_checkpointing True \ --tensor_parallel_size 8 \ --num_train_epochs 3 \ --logging_steps 10 \ --gradient_accumulation_steps 16