Persistent Cache for `notebooks/text-classification/notebook.ipynb`

dangdana commented 1 year ago

Hi,

I'm trying to run notebooks/text-classification/notebook.ipynb . Each time it runs, it recompiles, using a different /tmp/ path each time, e.g. 2023-06-27 17:59:23.000182: INFO ||NCC_WRAPPER||: Compile cache path: /tmp/tmpytdl1fxg

I've never actually completed the compile this way. It may be progressing but very slowly. After 30+ minutes it has made little progress (below). Is this expected?

  0%|         | 1/1500 [00:11<4:41:45, 11.28s/it]2023-06-27 17:59:23.000182
.....................................  ............    ...

To remedy the compile time issue, I tried using Neuron SDK's persistent cache. With neuron_parallel_compile, I see a cache created and populated at e.g. /var/tmp/neuron-compile-cache/neuronxcc-2.7.0.40+f7c6cf2a3. However, for some reason this cache is not leveraged when I run the training. Furthermore, when I rerun the neuron_parallel_compile, 6-8 graphs are always recompiled and added into the cache.

Is it possible to use the persistent cache with the notebook? Or otherwise compile and train in a timely way?

philschmid commented 1 year ago

Did you make any changes to the notebook? Where are you running the notebook? On the Hugging Face AMI?

dangdana commented 1 year ago

I did not make any changes to the notebook. I am running the notebook on the Hugging Face AMI.

Step to reproduce:

Launch a new Hugging Face Ami
Fix up a few things: pip3 install --upgrade ipywidgets # display errors in jupyter pip3 install git+https://github.com/huggingface/accelerate.git#egg=accelerate # re ticket https://github.com/huggingface/optimum-neuron/issues/102
clone optimum-neuron, follow notebook directions.

Here is the output of torchrun after 30+ minutes:

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
is precompilation: None
is precompilation: None
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch
/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
***** Running training *****
  Num examples = 16,000
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 1,500
  Number of trainable parameters = 109,486,854
  0%|                                                  | 0/1500 [00:00<?, ?it/s]2023-06-28 13:38:23.000086: INFO ||NCC_WRAPPER||: Compile cache path: /tmp/tmp1szb6a7r
.....Selecting 2 allocations
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Analyzing dependencies of Block1
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Analyzing dependencies of Block1
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Dependency reduction of sg0000
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************

Compiler status PASS
No Trainium cache name is saved locally. This means that only the official Trainium cache, and potentially a cache defined in $CUSTOM_CACHE_REPO will be used. You can create a Trainium cache repo by running the following command: `optimum-cli neuron cache create`. If the Trainium cache already exists you can set it by running the following command: `optimum-cli neuron cache set -n [name]`.
No Trainium cache name is saved locally. This means that only the official Trainium cache, and potentially a cache defined in $CUSTOM_CACHE_REPO will be used. You can create a Trainium cache repo by running the following command: `optimum-cli neuron cache create`. If the Trainium cache already exists you can set it by running the following command: `optimum-cli neuron cache set -n [name]`.
You do not have write access to aws-neuron/optimum-neuron-cache so you will not be able to push any cached compilation files. Please log in and/or use a custom Trainium cache.
  0%|                                       | 1/1500 [01:35<39:38:46, 95.21s/it]2023-06-28 13:39:58.000124: INFO ||NCC_WRAPPER||: Compile cache path: /tmp/tmp1szb6a7r
..............Selecting 193207 allocations
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
*************************************.*******.***.****

^ Could the issue be around access to the aws-neuron/optimum-neuron-cache?

Here are the cached models I see available:

ubuntu@ip-172-31-33-172:~/optimum-neuron/notebooks/text-classification$ optimum-cli neuron cache list aws-neuron/optimum-neuron-cache | grep name
Downloading (…)e/main/registry.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.23k/3.23k [00:00<00:00, 19.5MB/s]
Model name:     Helsinki-NLP/opus-mt-en-ro
Model name:     Helsinki-NLP/opus-mt-en-ro
Model name:     Helsinki-NLP/opus-mt-en-ro
Model name:     hf-internal-testing/tiny-random-gpt_neo
Model name:     hf-internal-testing/tiny-random-gpt_neo
Model name:     hf-internal-testing/tiny-random-gpt_neo
Model name:     hf-internal-testing/tiny-random-gpt_neo