Closed hlky closed 1 month ago
I've applied the suggested retry fix locally and can confirm it works.
I notice that the built in retry mechanism of http_backoff
doesn't appear to be working, the only message from that function is:
Retrying in 1s [Retry 1/5].
It seems the cause of this issue is that data
is consumed after the first attempt when using SliceFileObj
:
io_obj_initial_pos
should be set, code
io_obj_initial_pos = None
if "data" in kwargs and isinstance(kwargs["data"], io.IOBase):
io_obj_initial_pos = kwargs["data"].tell()
and reset on retry, code
# If `data` is used and is a file object (or any IO), set back cursor to
# initial position.
if io_obj_initial_pos is not None:
kwargs["data"].seek(io_obj_initial_pos)
However it is not set, reproduction of _upload_parts_iteratively (used by lfs_upload
->_upload_multi_part
):
from huggingface_hub import CommitOperationAdd
from huggingface_hub.lfs import SliceFileObj
import io
chunk_size = 8192
part_idx = 0
operation = CommitOperationAdd("part-00666.parquet", "parquet/part-00666.parquet")
with operation.as_file() as fileobj:
with SliceFileObj(
fileobj,
seek_from=chunk_size * part_idx,
read_limit=chunk_size,
) as fileobj_slice:
io_obj_initial_pos = None
if isinstance(fileobj_slice, io.IOBase):
io_obj_initial_pos = fileobj_slice.tell()
print(io_obj_initial_pos)
None
>>> type(fileobj_slice)
<class 'huggingface_hub.lfs.SliceFileObj'>
The check in http_backoff
should be changed to include SliceFileObj
.
I'd still suggest retrying in _wrapped_lfs_upload
in case there are any future issues, of course this is assuming that other error types i.e. Unauthorized will be caught before the upload itself, if that is not the case then fixing the retry mechanism of http_backoff
should suffice.
Using huggingface_hub@main(3cd3286)
This issue wastes hours hashing files again. As a side note implementing the caching of
upload-large-folder
path forupload
would be useful, I findupload-large-folder
to be unsuitable as it hits API rate limits by creating too many commits, the number of commits in turn affects Dataset Viewer, and in my testing passing--revision
toupload-large-folder
gives errors on upload.With regards to this particular issue I'd suggest retrying in
_wrapped_lfs_upload
rather than raising the exception, which seems to be caused by transient S3 errors, something like: