huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Apache License 2.0
177 stars 53 forks source link

Training gets stuck when cache upload errors. #369

Open philschmid opened 7 months ago

philschmid commented 7 months ago

I Found a new edge case. When you train and try to push to the Trainium cache during training and you get a 500 from the hub during the upload of the neffs, the training gets stuck. My training is stuck now stuck for > 30 minutes and is not finishing.

We should make sure that in case the upload fails the training correctly finishes.

error:

model.neff:  80%|████████████████████▊     | 16.0M/20.0M [00:00<00:00, 51.3MB/s]Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/huggingface_hub/utils/_errors.py", line 269, in hf_raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.8/dist-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://s3.us-east-1.amazonaws.com/lfs.huggingface.co/repos/bf/18/bf18727de0ab9f5939c4b3b52ca9cddeb0389a416e27697a6b48ccd639670f9e/7280eab103853156f8a7cb68aad5f51e6b05e5288e2d63f030ded8efb70ae90a?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA4N7VTDGO27GPWFUO%2F20231207%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20231207T135039Z&X-Amz-Expires=86400&X-Amz-Signature=18c4306daf63bb4e89f53372e5e05b1b51fa3403604281529730ca0d7be6d1cc&X-Amz-SignedHeaders=host&partNumber=2&uploadId=.SU.MrPlZIcWlMwlHmJ5xZViOvbm4gvxrD5ZWC7aqNjbJ5lKt_gQIpwbCAAX8q32dzzWGrFcZqRnTMo4xdrD671JuIBEIWy6en54pZzChiNfxtATDVT3mq6o2KHpEKB2&x-id=UploadPart

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/huggingface_hub/_commit_api.py", line 391, in _wrapped_lfs_upload
    lfs_upload(operation=operation, lfs_batch_action=batch_action, token=token)
  File "/usr/local/lib/python3.8/dist-packages/huggingface_hub/lfs.py", line 222, in lfs_upload
    _upload_multi_part(operation=operation, header=header, chunk_size=chunk_size, upload_url=upload_action["href"])
  File "/usr/local/lib/python3.8/dist-packages/huggingface_hub/lfs.py", line 318, in _upload_multi_part
    else _upload_parts_iteratively(operation=operation, sorted_parts_urls=sorted_parts_urls, chunk_size=chunk_size)
  File "/usr/local/lib/python3.8/dist-packages/huggingface_hub/lfs.py", line 375, in _upload_parts_iteratively
    hf_raise_for_status(part_upload_res)
  File "/usr/local/lib/python3.8/dist-packages/huggingface_hub/utils/_errors.py", line 320, in hf_raise_for_status
    raise HfHubHTTPError(str(e), response=response) from e
huggingface_hub.utils._errors.HfHubHTTPError: 500 Server Error: Internal Server Error for url: https://s3.us-east-1.amazonaws.com/lfs.huggingface.co/repos/bf/18/bf18727de0ab9f5939c4b3b52ca9cddeb0389a416e27697a6b48ccd639670f9e/7280eab103853156f8a7cb68aad5f51e6b05e5288e2d63f030ded8efb70ae90a?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA4N7VTDGO27GPWFUO%2F20231207%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20231207T135039Z&X-Amz-Expires=86400&X-Amz-Signature=18c4306daf63bb4e89f53372e5e05b1b51fa3403604281529730ca0d7be6d1cc&X-Amz-SignedHeaders=host&partNumber=2&uploadId=.SU.MrPlZIcWlMwlHmJ5xZViOvbm4gvxrD5ZWC7aqNjbJ5lKt_gQIpwbCAAX8q32dzzWGrFcZqRnTMo4xdrD671JuIBEIWy6en54pZzChiNfxtATDVT3mq6o2KHpEKB2&x-id=UploadPart

The above exception was the direct cause of the following exception:
michaelbenayoun commented 6 months ago

Can you provide the command to reproduce it please?

philschmid commented 6 months ago

Sorry I don't have a command. But it could happen with any script. Steps to reproduce would be:

  1. Login into hf.co which account which has access to the trainium cache
  2. run a training
  3. force a 500 error during the upload face of the cached .neff files.
  4. training is stuck