huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Apache License 2.0
201 stars 61 forks source link

run_clm.py tries to push Neuron graphs to optimum-neuron-cache even when no credentials are specified #331

Closed 5cp closed 6 months ago

5cp commented 11 months ago

I previously ran the run_clm.py example and populated the public optimum-neuron-cache with the graphs. When I run subsequent jobs, there are a handful of smaller graphs that are JiT-compiled, however the larger graphs all seem to be present in the cache, and training proceeds quickly.

The job completes successfully, but at the end of the job there are many 'invalid username or password' errors when Optimum Neuron is attempting to push the newly compiled graphs to the hub cache. This is undesirable and creates a lot of noise. This push should not be attempted unless the user has requested it.

Error message:

Repository Not Found for url: https://huggingface.co/api/models/aws-neuron/optimum-neuron-cache/preupload/main.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.
Note: Creating a commit assumes that the repo already exists on the Huggingface Hub. Please use `create_repo` if it's not the case..
Did not push the cached model located at /tmp/tmpb7e4bn_8/neuronxcc-2.11.0.34+c5231f848/MODULE_15768223958212646773+d41d8cd9/model.hlo.pb to the repo named aws-neuron/optim
um-neuron-cache because it already exists there. Use overwrite_existing=True if you want to overwrite the cache on the Hub.
Could not push the cached model to the repo aws-neuron/optimum-neuron-cache, most likely due to not having the write permission for this repo. Exact error:
401 Client Error. (Request ID: Root=1-65564fae-7c389aa421abc14a7a66f568;5a5e9ea7-fef5-428a-a88f-063460b1e48d)

Launch command:

torchrun --nproc_per_node=2 optimum-neuron/examples/language-modeling/run_clm.py \
  --model_name_or_path "PY007/TinyLlama-1.1B-intermediate-step-715k-1.5T" \
  --dataset_name wikitext \
  --dataset_config_name wikitext-2-raw-v1 \
  --per_device_train_batch_size 1 \
  --do_train \
  --max_steps 100 \
  --block_size 150 \
  --tensor_parallel_size 2 \
  --output_dir my_training_new/ \
  --gradient_accumulation_steps 8 \
  --logging_steps 5 \
  --bf16 \
  --overwrite_output_dir

Env:

Python packages:

accelerate                    0.23.0
aws-neuronx-runtime-discovery 2.9
datasets                      2.14.7
evaluate                      0.4.1
libneuronxla                  0.5.538
neuronx-cc                    2.11.0.34+c5231f848
neuronx-distributed           0.5.0
neuronx-hwm                   2.11.0.2+e34678757
optimum                       1.14.1
optimum-neuron                0.0.14.dev0
sagemaker-pytorch-training    2.8.1
torch                         1.13.1
torch-neuronx                 1.13.1.1.12.0
torch-xla                     1.13.1+torchneuronc
torchvision                   0.14.1
transformers                  4.35.0
dacorvo commented 6 months ago

@michaelbenayoun I think this is fixed.

michaelbenayoun commented 6 months ago

Yes, closing this since we dont rely on that anymore.