I Found a new edge case. When you train and try to push to the Trainium cache during training and you get a 500 from the hub during the upload of the neffs, the training gets stuck. My training is stuck now stuck for > 30 minutes and is not finishing.
We should make sure that in case the upload fails the training correctly finishes.
error:
model.neff: 80%|████████████████████▊ | 16.0M/20.0M [00:00<00:00, 51.3MB/s]Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/huggingface_hub/utils/_errors.py", line 269, in hf_raise_for_status
response.raise_for_status()
File "/usr/local/lib/python3.8/dist-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://s3.us-east-1.amazonaws.com/lfs.huggingface.co/repos/bf/18/bf18727de0ab9f5939c4b3b52ca9cddeb0389a416e27697a6b48ccd639670f9e/7280eab103853156f8a7cb68aad5f51e6b05e5288e2d63f030ded8efb70ae90a?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA4N7VTDGO27GPWFUO%2F20231207%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20231207T135039Z&X-Amz-Expires=86400&X-Amz-Signature=18c4306daf63bb4e89f53372e5e05b1b51fa3403604281529730ca0d7be6d1cc&X-Amz-SignedHeaders=host&partNumber=2&uploadId=.SU.MrPlZIcWlMwlHmJ5xZViOvbm4gvxrD5ZWC7aqNjbJ5lKt_gQIpwbCAAX8q32dzzWGrFcZqRnTMo4xdrD671JuIBEIWy6en54pZzChiNfxtATDVT3mq6o2KHpEKB2&x-id=UploadPart
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/huggingface_hub/_commit_api.py", line 391, in _wrapped_lfs_upload
lfs_upload(operation=operation, lfs_batch_action=batch_action, token=token)
File "/usr/local/lib/python3.8/dist-packages/huggingface_hub/lfs.py", line 222, in lfs_upload
_upload_multi_part(operation=operation, header=header, chunk_size=chunk_size, upload_url=upload_action["href"])
File "/usr/local/lib/python3.8/dist-packages/huggingface_hub/lfs.py", line 318, in _upload_multi_part
else _upload_parts_iteratively(operation=operation, sorted_parts_urls=sorted_parts_urls, chunk_size=chunk_size)
File "/usr/local/lib/python3.8/dist-packages/huggingface_hub/lfs.py", line 375, in _upload_parts_iteratively
hf_raise_for_status(part_upload_res)
File "/usr/local/lib/python3.8/dist-packages/huggingface_hub/utils/_errors.py", line 320, in hf_raise_for_status
raise HfHubHTTPError(str(e), response=response) from e
huggingface_hub.utils._errors.HfHubHTTPError: 500 Server Error: Internal Server Error for url: https://s3.us-east-1.amazonaws.com/lfs.huggingface.co/repos/bf/18/bf18727de0ab9f5939c4b3b52ca9cddeb0389a416e27697a6b48ccd639670f9e/7280eab103853156f8a7cb68aad5f51e6b05e5288e2d63f030ded8efb70ae90a?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA4N7VTDGO27GPWFUO%2F20231207%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20231207T135039Z&X-Amz-Expires=86400&X-Amz-Signature=18c4306daf63bb4e89f53372e5e05b1b51fa3403604281529730ca0d7be6d1cc&X-Amz-SignedHeaders=host&partNumber=2&uploadId=.SU.MrPlZIcWlMwlHmJ5xZViOvbm4gvxrD5ZWC7aqNjbJ5lKt_gQIpwbCAAX8q32dzzWGrFcZqRnTMo4xdrD671JuIBEIWy6en54pZzChiNfxtATDVT3mq6o2KHpEKB2&x-id=UploadPart
The above exception was the direct cause of the following exception:
I Found a new edge case. When you train and try to push to the Trainium cache during training and you get a 500 from the hub during the upload of the neffs, the training gets stuck. My training is stuck now stuck for > 30 minutes and is not finishing.
We should make sure that in case the upload fails the training correctly finishes.
error: