huggingface / huggingface_hub

The official Python client for the Huggingface Hub.
https://huggingface.co/docs/huggingface_hub
Apache License 2.0
2.1k stars 547 forks source link

`huggingface-cli upload` - `_wrapped_lfs_upload` should retry | `http_backoff` retry not working with `_upload_multi_part`/`SliceFileObj` #2539

Closed hlky closed 1 month ago

hlky commented 1 month ago
HTTP Error 500 thrown while requesting PUT https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com/repos/...
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/huggingface_hub/utils/_http.py", line 405, in hf_raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.9/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://huggingface.co/api/complete_multipart...
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/huggingface_hub/_commit_api.py", line 431, in _wrapped_lfs_upload
lfs_upload(operation=operation, lfs_batch_action=batch_action, headers=headers, endpoint=endpoint)
File "/usr/local/lib/python3.9/site-packages/huggingface_hub/lfs.py", line 246, in lfs_upload
_upload_multi_part(operation=operation, header=header, chunk_size=chunk_size, upload_url=upload_url)
File "/usr/local/lib/python3.9/site-packages/huggingface_hub/lfs.py", line 355, in _upload_multi_part
hf_raise_for_status(completion_res)
File "/usr/local/lib/python3.9/site-packages/huggingface_hub/utils/_http.py", line 459, in hf_raise_for_status
 raise _format(BadRequestError, message, response) from e
 huggingface_hub.errors.BadRequestError: (Request ID: Root=1-66e43174-2ebfb55456eac7aa1735c83b;cb20fa6b-0659-47af-8f62-2698e507c12b)
Bad request:
   Your proposed upload is smaller than the minimum allowed size
The above exception was the direct cause of the following exception:                                                                                                                                                                                         
File "/usr/local/lib/python3.9/site-packages/huggingface_hub/commands/huggingface_cli.py", line 57, in main
  File "/usr/local/lib/python3.9/site-packages/huggingface_hub/commands/huggingface_cli.py", line 57, in main
    service.run()
  File "/usr/local/lib/python3.9/site-packages/huggingface_hub/commands/upload.py", line 192, in run
    print(self._upload())
  File "/usr/local/lib/python3.9/site-packages/huggingface_hub/commands/upload.py", line 287, in _upload
    return self.api.upload_folder(
  File "/usr/local/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/huggingface_hub/hf_api.py", line 1473, in _inner
    return fn(self, *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/huggingface_hub/hf_api.py", line 4990, in upload_folder
    commit_info = self.create_commit(
  File "/usr/local/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/huggingface_hub/hf_api.py", line 1473, in _inner
    return fn(self, *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/huggingface_hub/hf_api.py", line 3901, in create_commit
    self.preupload_lfs_files(
  File "/usr/local/lib/python3.9/site-packages/huggingface_hub/hf_api.py", line 4448, in preupload_lfs_files
    _upload_lfs_files(
  File "/usr/local/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/huggingface_hub/_commit_api.py", line 446, in _upload_lfs_files
    thread_map(
  File "/usr/local/lib/python3.9/site-packages/tqdm/contrib/concurrent.py", line 69, in thread_map
    return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
  File "/usr/local/lib/python3.9/site-packages/tqdm/contrib/concurrent.py", line 51, in _executor_map
    return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
  File "/usr/local/lib/python3.9/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 609, in result_iterator
    yield fs.pop().result()
  File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 446, in result
    return self.__get_result()
  File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/usr/lib64/python3.9/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.9/site-packages/huggingface_hub/_commit_api.py", line 433, in _wrapped_lfs_upload
    raise RuntimeError(f"Error while uploading '{operation.path_in_repo}' to the Hub.") from exc
RuntimeError: Error while uploading 'part-00537.parquet' to the Hub.

Using huggingface_hub@main(3cd3286)

This issue wastes hours hashing files again. As a side note implementing the caching of upload-large-folder path for upload would be useful, I find upload-large-folder to be unsuitable as it hits API rate limits by creating too many commits, the number of commits in turn affects Dataset Viewer, and in my testing passing --revision to upload-large-folder gives errors on upload.

With regards to this particular issue I'd suggest retrying in _wrapped_lfs_upload rather than raising the exception, which seems to be caused by transient S3 errors, something like:

    def _wrapped_lfs_upload(batch_action) -> None:
        try:
            operation = oid2addop[batch_action["oid"]]
            lfs_upload(operation=operation, lfs_batch_action=batch_action, headers=headers, endpoint=endpoint)
        except Exception:
            logger.debug(f"Error while uploading '{operation.path_in_repo}' to the Hub.")
            _wrapped_lfs_upload(batch_action)
hlky commented 1 month ago

I've applied the suggested retry fix locally and can confirm it works.

I notice that the built in retry mechanism of http_backoff doesn't appear to be working, the only message from that function is:

Retrying in 1s [Retry 1/5].

It seems the cause of this issue is that data is consumed after the first attempt when using SliceFileObj: io_obj_initial_pos should be set, code

io_obj_initial_pos = None
if "data" in kwargs and isinstance(kwargs["data"], io.IOBase):
    io_obj_initial_pos = kwargs["data"].tell()

and reset on retry, code

# If `data` is used and is a file object (or any IO), set back cursor to
# initial position.
if io_obj_initial_pos is not None:
    kwargs["data"].seek(io_obj_initial_pos)

However it is not set, reproduction of _upload_parts_iteratively (used by lfs_upload->_upload_multi_part):

from huggingface_hub import CommitOperationAdd
from huggingface_hub.lfs import SliceFileObj
import io

chunk_size = 8192
part_idx = 0
operation = CommitOperationAdd("part-00666.parquet", "parquet/part-00666.parquet")
with operation.as_file() as fileobj:
    with SliceFileObj(
        fileobj,
        seek_from=chunk_size * part_idx,
        read_limit=chunk_size,
    ) as fileobj_slice:
        io_obj_initial_pos = None
        if isinstance(fileobj_slice, io.IOBase):
            io_obj_initial_pos = fileobj_slice.tell()
        print(io_obj_initial_pos)
None
>>> type(fileobj_slice)
<class 'huggingface_hub.lfs.SliceFileObj'>

The check in http_backoff should be changed to include SliceFileObj.

I'd still suggest retrying in _wrapped_lfs_upload in case there are any future issues, of course this is assuming that other error types i.e. Unauthorized will be caught before the upload itself, if that is not the case then fixing the retry mechanism of http_backoff should suffice.