Training gets stuck when cache upload errors.

philschmid commented 7 months ago

I Found a new edge case. When you train and try to push to the Trainium cache during training and you get a 500 from the hub during the upload of the neffs, the training gets stuck. My training is stuck now stuck for > 30 minutes and is not finishing.

We should make sure that in case the upload fails the training correctly finishes.

error:

model.neff:  80%|████████████████████▊     | 16.0M/20.0M [00:00<00:00, 51.3MB/s]Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/huggingface_hub/utils/_errors.py", line 269, in hf_raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.8/dist-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://s3.us-east-1.amazonaws.com/lfs.huggingface.co/repos/bf/18/bf18727de0ab9f5939c4b3b52ca9cddeb0389a416e27697a6b48ccd639670f9e/7280eab103853156f8a7cb68aad5f51e6b05e5288e2d63f030ded8efb70ae90a?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA4N7VTDGO27GPWFUO%2F20231207%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20231207T135039Z&X-Amz-Expires=86400&X-Amz-Signature=18c4306daf63bb4e89f53372e5e05b1b51fa3403604281529730ca0d7be6d1cc&X-Amz-SignedHeaders=host&partNumber=2&uploadId=.SU.MrPlZIcWlMwlHmJ5xZViOvbm4gvxrD5ZWC7aqNjbJ5lKt_gQIpwbCAAX8q32dzzWGrFcZqRnTMo4xdrD671JuIBEIWy6en54pZzChiNfxtATDVT3mq6o2KHpEKB2&x-id=UploadPart

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/huggingface_hub/_commit_api.py", line 391, in _wrapped_lfs_upload
    lfs_upload(operation=operation, lfs_batch_action=batch_action, token=token)
  File "/usr/local/lib/python3.8/dist-packages/huggingface_hub/lfs.py", line 222, in lfs_upload
    _upload_multi_part(operation=operation, header=header, chunk_size=chunk_size, upload_url=upload_action["href"])
  File "/usr/local/lib/python3.8/dist-packages/huggingface_hub/lfs.py", line 318, in _upload_multi_part
    else _upload_parts_iteratively(operation=operation, sorted_parts_urls=sorted_parts_urls, chunk_size=chunk_size)
  File "/usr/local/lib/python3.8/dist-packages/huggingface_hub/lfs.py", line 375, in _upload_parts_iteratively
    hf_raise_for_status(part_upload_res)
  File "/usr/local/lib/python3.8/dist-packages/huggingface_hub/utils/_errors.py", line 320, in hf_raise_for_status
    raise HfHubHTTPError(str(e), response=response) from e
huggingface_hub.utils._errors.HfHubHTTPError: 500 Server Error: Internal Server Error for url: https://s3.us-east-1.amazonaws.com/lfs.huggingface.co/repos/bf/18/bf18727de0ab9f5939c4b3b52ca9cddeb0389a416e27697a6b48ccd639670f9e/7280eab103853156f8a7cb68aad5f51e6b05e5288e2d63f030ded8efb70ae90a?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA4N7VTDGO27GPWFUO%2F20231207%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20231207T135039Z&X-Amz-Expires=86400&X-Amz-Signature=18c4306daf63bb4e89f53372e5e05b1b51fa3403604281529730ca0d7be6d1cc&X-Amz-SignedHeaders=host&partNumber=2&uploadId=.SU.MrPlZIcWlMwlHmJ5xZViOvbm4gvxrD5ZWC7aqNjbJ5lKt_gQIpwbCAAX8q32dzzWGrFcZqRnTMo4xdrD671JuIBEIWy6en54pZzChiNfxtATDVT3mq6o2KHpEKB2&x-id=UploadPart

The above exception was the direct cause of the following exception:

michaelbenayoun commented 6 months ago

Can you provide the command to reproduce it please?

philschmid commented 6 months ago

Sorry I don't have a command. But it could happen with any script. Steps to reproduce would be:

Login into hf.co which account which has access to the trainium cache
run a training
force a 500 error during the upload face of the cached .neff files.
training is stuck

huggingface / optimum-neuron

Training gets stuck when cache upload errors. #369