huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Apache License 2.0
198 stars 60 forks source link

The model is re-compiled at the end of a training job even when we have cache saved to the current repo #129

Closed samir-souza closed 1 year ago

samir-souza commented 1 year ago

The following training script works fine, completes the training job and pushes the neuron cache to the defined repo. However, every time I run this script again, it loads the cache to train the model but always run an additional compilation step at the end, before finishing the process. It should use the cache and not do this in my understanding

SCRIPT CUSTOM_CACHE_REPO= optimum-cli neuron cache set $CUSTOM_CACHE_REPO

MODEL_ID="gpt2"

MODEL_ID="bert-base-uncased"

torchrun --nproc_per_node=2 \ examples/language-modeling/run_clm.py \ --model_name_or_path $MODEL_ID \ --learning_rate 5e-5 \ --do_train \ --num_train_epochs 1 \ --output output \ --bf16 \ --block_size 128 \ --overwrite_output_dir \ --per_device_train_batch_size 4 \ --train_file ../datasets/eli5/train/data.csv

OUTPUT WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


WARNING:main:Process rank: 1, device: xla:0, n_gpu: 0distributed training: True, 16-bits training: False WARNING:main:Process rank: 0, device: xla:1, n_gpu: 0distributed training: True, 16-bits training: False WARNING:datasets.builder:Found cached dataset csv (/home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d) 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 836.19it/s] WARNING:datasets.builder:Found cached dataset csv (/home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d) 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 538.84it/s] WARNING:datasets.builder:Found cached dataset csv (/home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d) WARNING:datasets.builder:Found cached dataset csv (/home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d) WARNING:datasets.builder:Found cached dataset csv (/home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d) WARNING:datasets.builder:Found cached dataset csv (/home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d) WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-552f5bdb20f2f014.arrow WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-fe6c229a644960e1.arrow WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-680aaafeb37df5d2.arrow WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-7f878c5dd541b1f8.arrow WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-552f5bdb20f2f014.arrow WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-fe6c229a644960e1.arrow WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-680aaafeb37df5d2.arrow WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-7f878c5dd541b1f8.arrow /home/ubuntu/Optimum/optimvenv/lib/python3.8/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning warnings.warn( [INFO|trainer.py:1769] 2023-07-07 13:56:01,140 >> Running training [INFO|trainer.py:1770] 2023-07-07 13:56:01,140 >> Num examples = 229 [INFO|trainer.py:1771] 2023-07-07 13:56:01,140 >> Num Epochs = 10 [INFO|trainer.py:1772] 2023-07-07 13:56:01,140 >> Instantaneous batch size per device = 4 [INFO|trainer.py:1773] 2023-07-07 13:56:01,140 >> Total train batch size (w. parallel, distributed & accumulation) = 8 [INFO|trainer.py:1774] 2023-07-07 13:56:01,140 >> Gradient Accumulation steps = 1 [INFO|trainer.py:1775] 2023-07-07 13:56:01,140 >> Total optimization steps = 290 [INFO|trainer.py:1776] 2023-07-07 13:56:01,141 >> Number of trainable parameters = 124,439,808 0%| | 0/290 [00:00<?, ?it/s]/home/ubuntu/Optimum/optimvenv/lib/python3.8/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning warnings.warn( 2023-07-07 13:56:01.000281: INFO ||NCC_WRAPPER||: Compile cache path: /tmp/tmpd29vse1h/neuron-compile-cache 2023-07-07 13:56:01.000284: INFO ||NCC_WRAPPER||: Using a cached neff at /tmp/tmpd29vse1h/neuron-compile-cache/neuronxcc-2.7.0.40+f7c6cf2a3/MODULE_7411938889920977621+d41d8cd9/model.neff. Exiting with a successfully compiled graph. 0%|▌ | 1/290 [00:04<22:02, 4.58s/it]2023-07-07 13:56:06.000202: INFO ||NCC_WRAPPER||: Compile cache path: /tmp/tmpd29vse1h/neuron-compile-cache 2023-07-07 13:56:06.000307: INFO ||NCC_WRAPPER||: Using a cached neff at /tmp/tmpd29vse1h/neuron-compile-cache/neuronxcc-2.7.0.40+f7c6cf2a3/MODULE_12206097846517805328+d41d8cd9/model.neff. Exiting with a successfully compiled graph. 2023-Jul-07 13:56:06.0950 1358943:1359077 [0] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed 2023-Jul-07 13:56:06.0950 1358943:1359077 [0] init.cc:138 CCOM WARN OFI plugin initNet() failed is EFA enabled? 1%|█ | 2/290 [00:05<12:38, 2.63s/it]2023-07-07 13:56:10.000870: INFO ||NCC_WRAPPER||: Compile cache path: /tmp/tmpd29vse1h/neuron-compile-cache 2023-07-07 13:56:10.000993: INFO ||NCC_WRAPPER||: Using a cached neff at /tmp/tmpd29vse1h/neuron-compile-cache/neuronxcc-2.7.0.40+f7c6cf2a3/MODULE_3234126581774316598+d41d8cd9/model.neff. Exiting with a successfully compiled graph. 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍| 289/290 [00:40<00:00, 11.62it/s][INFO|trainer.py:2039] 2023-07-07 13:56:41,499 >>

Training completed. Do not forget to share your model on huggingface.co/models =)

2023-07-07 13:56:41.000807: INFO NCC_WRAPPER : Compile cache path: /tmp/tmpd29vse1h/neuron-compile-cache 100% ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 290/290 [00:57<00:00, 11.62it/s].Selecting 9266 allocations 0% 10 20 30 40 50 60 70 80 90 100%

Selecting 2351 allocations 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----|


Selecting 10507 allocations 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----|


Selecting 541 allocations 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----|


Analyzing dependencies of Block1 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----|


Analyzing dependencies of Block1 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----|


Dependency reduction of sg0000 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----|


Compiler status PASS 2023-07-07 13:57:16.000068: INFO ||NCC_WRAPPER||: Compile cache path: /tmp/tmpd29vse1h/neuron-compile-cache

samir-souza commented 1 year ago

This issue was fixed in the latest release: 0.0.10