The following training script works fine, completes the training job and pushes the neuron cache to the defined repo. However, every time I run this script again, it loads the cache to train the model but always run an additional compilation step at the end, before finishing the process. It should use the cache and not do this in my understanding
SCRIPT
CUSTOM_CACHE_REPO=
optimum-cli neuron cache set $CUSTOM_CACHE_REPO
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
WARNING:main:Process rank: 1, device: xla:0, n_gpu: 0distributed training: True, 16-bits training: False
WARNING:main:Process rank: 0, device: xla:1, n_gpu: 0distributed training: True, 16-bits training: False
WARNING:datasets.builder:Found cached dataset csv (/home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 836.19it/s]
WARNING:datasets.builder:Found cached dataset csv (/home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 538.84it/s]
WARNING:datasets.builder:Found cached dataset csv (/home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d)
WARNING:datasets.builder:Found cached dataset csv (/home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d)
WARNING:datasets.builder:Found cached dataset csv (/home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d)
WARNING:datasets.builder:Found cached dataset csv (/home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d)
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-552f5bdb20f2f014.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-fe6c229a644960e1.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-680aaafeb37df5d2.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-7f878c5dd541b1f8.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-552f5bdb20f2f014.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-fe6c229a644960e1.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-680aaafeb37df5d2.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-7f878c5dd541b1f8.arrow
/home/ubuntu/Optimum/optimvenv/lib/python3.8/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning
warnings.warn(
[INFO|trainer.py:1769] 2023-07-07 13:56:01,140 >> Running training
[INFO|trainer.py:1770] 2023-07-07 13:56:01,140 >> Num examples = 229
[INFO|trainer.py:1771] 2023-07-07 13:56:01,140 >> Num Epochs = 10
[INFO|trainer.py:1772] 2023-07-07 13:56:01,140 >> Instantaneous batch size per device = 4
[INFO|trainer.py:1773] 2023-07-07 13:56:01,140 >> Total train batch size (w. parallel, distributed & accumulation) = 8
[INFO|trainer.py:1774] 2023-07-07 13:56:01,140 >> Gradient Accumulation steps = 1
[INFO|trainer.py:1775] 2023-07-07 13:56:01,140 >> Total optimization steps = 290
[INFO|trainer.py:1776] 2023-07-07 13:56:01,141 >> Number of trainable parameters = 124,439,808
0%| | 0/290 [00:00<?, ?it/s]/home/ubuntu/Optimum/optimvenv/lib/python3.8/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning
warnings.warn(
2023-07-07 13:56:01.000281: INFO ||NCC_WRAPPER||: Compile cache path: /tmp/tmpd29vse1h/neuron-compile-cache
2023-07-07 13:56:01.000284: INFO ||NCC_WRAPPER||: Using a cached neff at /tmp/tmpd29vse1h/neuron-compile-cache/neuronxcc-2.7.0.40+f7c6cf2a3/MODULE_7411938889920977621+d41d8cd9/model.neff. Exiting with a successfully compiled graph.
0%|▌ | 1/290 [00:04<22:02, 4.58s/it]2023-07-07 13:56:06.000202: INFO ||NCC_WRAPPER||: Compile cache path: /tmp/tmpd29vse1h/neuron-compile-cache
2023-07-07 13:56:06.000307: INFO ||NCC_WRAPPER||: Using a cached neff at /tmp/tmpd29vse1h/neuron-compile-cache/neuronxcc-2.7.0.40+f7c6cf2a3/MODULE_12206097846517805328+d41d8cd9/model.neff. Exiting with a successfully compiled graph.
2023-Jul-07 13:56:06.0950 1358943:1359077 [0] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2023-Jul-07 13:56:06.0950 1358943:1359077 [0] init.cc:138 CCOM WARN OFI plugin initNet() failed is EFA enabled?
1%|█ | 2/290 [00:05<12:38, 2.63s/it]2023-07-07 13:56:10.000870: INFO ||NCC_WRAPPER||: Compile cache path: /tmp/tmpd29vse1h/neuron-compile-cache
2023-07-07 13:56:10.000993: INFO ||NCC_WRAPPER||: Using a cached neff at /tmp/tmpd29vse1h/neuron-compile-cache/neuronxcc-2.7.0.40+f7c6cf2a3/MODULE_3234126581774316598+d41d8cd9/model.neff. Exiting with a successfully compiled graph.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍| 289/290 [00:40<00:00, 11.62it/s][INFO|trainer.py:2039] 2023-07-07 13:56:41,499 >>
Training completed. Do not forget to share your model on huggingface.co/models =)
The following training script works fine, completes the training job and pushes the neuron cache to the defined repo. However, every time I run this script again, it loads the cache to train the model but always run an additional compilation step at the end, before finishing the process. It should use the cache and not do this in my understanding
SCRIPT CUSTOM_CACHE_REPO=
optimum-cli neuron cache set $CUSTOM_CACHE_REPO
MODEL_ID="gpt2"
MODEL_ID="bert-base-uncased"
torchrun --nproc_per_node=2 \ examples/language-modeling/run_clm.py \ --model_name_or_path $MODEL_ID \ --learning_rate 5e-5 \ --do_train \ --num_train_epochs 1 \ --output output \ --bf16 \ --block_size 128 \ --overwrite_output_dir \ --per_device_train_batch_size 4 \ --train_file ../datasets/eli5/train/data.csv
OUTPUT WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
WARNING:main:Process rank: 1, device: xla:0, n_gpu: 0distributed training: True, 16-bits training: False WARNING:main:Process rank: 0, device: xla:1, n_gpu: 0distributed training: True, 16-bits training: False WARNING:datasets.builder:Found cached dataset csv (/home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d) 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 836.19it/s] WARNING:datasets.builder:Found cached dataset csv (/home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d) 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 538.84it/s] WARNING:datasets.builder:Found cached dataset csv (/home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d) WARNING:datasets.builder:Found cached dataset csv (/home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d) WARNING:datasets.builder:Found cached dataset csv (/home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d) WARNING:datasets.builder:Found cached dataset csv (/home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d) WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-552f5bdb20f2f014.arrow WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-fe6c229a644960e1.arrow WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-680aaafeb37df5d2.arrow WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-7f878c5dd541b1f8.arrow WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-552f5bdb20f2f014.arrow WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-fe6c229a644960e1.arrow WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-680aaafeb37df5d2.arrow WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/csv/default-ea5b785874d94a95/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-7f878c5dd541b1f8.arrow /home/ubuntu/Optimum/optimvenv/lib/python3.8/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set
no_deprecation_warning=True
to disable this warning warnings.warn( [INFO|trainer.py:1769] 2023-07-07 13:56:01,140 >> Running training [INFO|trainer.py:1770] 2023-07-07 13:56:01,140 >> Num examples = 229 [INFO|trainer.py:1771] 2023-07-07 13:56:01,140 >> Num Epochs = 10 [INFO|trainer.py:1772] 2023-07-07 13:56:01,140 >> Instantaneous batch size per device = 4 [INFO|trainer.py:1773] 2023-07-07 13:56:01,140 >> Total train batch size (w. parallel, distributed & accumulation) = 8 [INFO|trainer.py:1774] 2023-07-07 13:56:01,140 >> Gradient Accumulation steps = 1 [INFO|trainer.py:1775] 2023-07-07 13:56:01,140 >> Total optimization steps = 290 [INFO|trainer.py:1776] 2023-07-07 13:56:01,141 >> Number of trainable parameters = 124,439,808 0%| | 0/290 [00:00<?, ?it/s]/home/ubuntu/Optimum/optimvenv/lib/python3.8/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or setno_deprecation_warning=True
to disable this warning warnings.warn( 2023-07-07 13:56:01.000281: INFO ||NCC_WRAPPER||: Compile cache path: /tmp/tmpd29vse1h/neuron-compile-cache 2023-07-07 13:56:01.000284: INFO ||NCC_WRAPPER||: Using a cached neff at /tmp/tmpd29vse1h/neuron-compile-cache/neuronxcc-2.7.0.40+f7c6cf2a3/MODULE_7411938889920977621+d41d8cd9/model.neff. Exiting with a successfully compiled graph. 0%|▌ | 1/290 [00:04<22:02, 4.58s/it]2023-07-07 13:56:06.000202: INFO ||NCC_WRAPPER||: Compile cache path: /tmp/tmpd29vse1h/neuron-compile-cache 2023-07-07 13:56:06.000307: INFO ||NCC_WRAPPER||: Using a cached neff at /tmp/tmpd29vse1h/neuron-compile-cache/neuronxcc-2.7.0.40+f7c6cf2a3/MODULE_12206097846517805328+d41d8cd9/model.neff. Exiting with a successfully compiled graph. 2023-Jul-07 13:56:06.0950 1358943:1359077 [0] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed 2023-Jul-07 13:56:06.0950 1358943:1359077 [0] init.cc:138 CCOM WARN OFI plugin initNet() failed is EFA enabled? 1%|█ | 2/290 [00:05<12:38, 2.63s/it]2023-07-07 13:56:10.000870: INFO ||NCC_WRAPPER||: Compile cache path: /tmp/tmpd29vse1h/neuron-compile-cache 2023-07-07 13:56:10.000993: INFO ||NCC_WRAPPER||: Using a cached neff at /tmp/tmpd29vse1h/neuron-compile-cache/neuronxcc-2.7.0.40+f7c6cf2a3/MODULE_3234126581774316598+d41d8cd9/model.neff. Exiting with a successfully compiled graph. 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍| 289/290 [00:40<00:00, 11.62it/s][INFO|trainer.py:2039] 2023-07-07 13:56:41,499 >>Training completed. Do not forget to share your model on huggingface.co/models =)
Selecting 2351 allocations 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----|
Selecting 10507 allocations 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----|
Selecting 541 allocations 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----|
Analyzing dependencies of Block1 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----|
Analyzing dependencies of Block1 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----|
Dependency reduction of sg0000 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----|
Compiler status PASS 2023-07-07 13:57:16.000068: INFO ||NCC_WRAPPER||: Compile cache path: /tmp/tmpd29vse1h/neuron-compile-cache