huggingface / huggingface_hub

The official Python client for the Huggingface Hub.
https://huggingface.co/docs/huggingface_hub
Apache License 2.0
1.83k stars 470 forks source link

Content-Range header for multiple part request #2248

Open narugo1992 opened 2 months ago

narugo1992 commented 2 months ago

I'm developing a library to download files for tar archives on huggingface repository:

this is based on the Range header in http request, so download tar archives with Range: bytes=xxx-yyy will only download the specific file instead of the full archive file.

In some cases, we need to download many files from different tar archives, and many of them are from the same archive. So im considering using Range: bytes=xxx-yyy,zzz-ttt to download all of them with only one http request. This can greatly improve the performance of batch downloading, and can also reduce the pressure to the huggingface cdn.

But in my test, when using multiple part ranges, the Content-Range header seems gone in response.

from pprint import pprint

import requests

resp = requests.get(
    'https://huggingface.co/datasets/deepghs/yande_full/resolve/main/images/0008.tar',
    headers={
        'Range': 'bytes=0-99,1200-1369'
    }
)
print(resp)
pprint(dict(resp.headers))
print(len(resp.content))

The output is like this, no Content-Range found. The length of content seems okay, but i dont know what are the ranges of each part.

<Response [206]>
{'Accept-Ranges': 'bytes',
 'Connection': 'keep-alive',
 'Content-Disposition': "attachment; filename*=UTF-8''0008.tar; "
                        'filename="0008.tar";',
 'Content-Length': '570',
 'Content-Type': 'multipart/byteranges; '
                 'boundary=CloudFront:8C171B3C6DAD1DF1040C2DA33E27D04D',
 'Date': 'Wed, 24 Apr 2024 14:57:57 GMT',
 'ETag': '"820b63e3250678f8217c157c8b557712-135"',
 'Last-Modified': 'Sat, 20 Apr 2024 15:06:38 GMT',
 'Server': 'AmazonS3',
 'Vary': 'Origin',
 'Via': '1.1 db3cc869e0dda88ce4fa37dee230e06e.cloudfront.net (CloudFront)',
 'X-Amz-Cf-Id': 'VToeCDfStyG6NtjMCRVWdUqbHvojrQN8a29nE-tgh0zbMNF_80DMEg==',
 'X-Amz-Cf-Pop': 'TXL50-P6',
 'X-Cache': 'RefreshHit from cloudfront',
 'x-amz-server-side-encryption': 'AES256',
 'x-amz-storage-class': 'INTELLIGENT_TIERING'}
570

This header information is really important. So can it be added? or is there an alternative solution to download multiple parts at one time, and save each parts to different files?

julien-c commented 2 months ago

Cool idea to implement a lazy tar parser on top of HF Hub!! What's the context/goals there?

Re. support for multiple ranges in a single Range request, I think I remember @Kakulukian took a look at this at some point (was this you @Kakulukian?)

narugo1992 commented 2 months ago

@julien-c

In essence, the idea (detailed in the hfutils.index module) is to create an index for tar files, including offsets, sizes, and file hashes of all files within the tar. This enables downloading specific files and verifying integrity using Range: bytes=xxx-yyy during retrieval.

Our requirement is to swiftly retrieve a set of specific files from datasets on huggingface. These datasets typically comprise numerous (e.g., 1k) tar archives, each containing numerous image files. The archive in which an image resides depends on the image's id modulo 1000. Notably, one such dataset is nyanko7/danbooru2023, containing roughly 8 million images spread across 2k+ archive files.

In our practical application, we often begin by querying images based on metadata like tags, obtaining a list of required image ids (often over 1k, sometimes exceeding 100k), then fetching all images based on these ids to make a dataset. For this purpose, we're developing a library called cheesechaser. Though still a work in progress, it already supports the aforementioned danbooru2023 dataset. Based on our current tests, downloading 10k specified images (with consecutive ids spread across 1000 archive files) totaling approximately 18gb, using 12 threads took about 17 minutes, involving roughly 10k download requests. This performance is satisfactory, significantly faster than downloading and decompressing approximately 9tb of complete tar archives, with minimal local disk usage.

However, we've identified areas for improvement in performance. Primarily, due to the large volume of download requests and relatively small file sizes, most time is spent establishing connections rather than downloading. Additionally, as the number of downloaded files increases, excessive requests strain huggingface's cdn resources. Therefore, supporting multi-part range requests could significantly boost performance and alleviate pressure on the cdn service by enabling simultaneous downloads of multiple files within the same archive.

Furthermore, after raising this issue and attempting to use multi-part range, we encountered some more problems:

Kakulukian commented 2 months ago

When you request multiple ranges, the response will be in a multipart/byteranges content type, including a boundary. Each subsequent range corresponds to a specific block separated by this boundary with content-range header (https://www.rfc-editor.org/rfc/rfc7233#page-21)

For example for your request:

GET https://huggingface.co/datasets/deepghs/yande_full/resolve/main/images/0008.tar HTTP/1.1
Range: bytes=0-99,1200-1369 

Response:

HTTP/1.1 206 Partial Content
Content-Length: 570
Content-Range: multipart/byteranges; boundary=CloudFront:725CE26A0B74DDB74002A7B61F84A558

--CloudFront:725CE26A0B74DDB74002A7B61F84A558
Content-Type: application/x-tar
Content-Range: bytes 0-99/2146662400

././@PaxHeader
--CloudFront:725CE26A0B74DDB74002A7B61F84A558
Content-Type: application/x-tar
Content-Range: bytes 1200-1369/2146662400

ustar00runnerdocker00000000000000
--CloudFront:725CE26A0B74DDB74002A7B61F84A558--
narugo1992 commented 2 months ago

While I attempted to read the response body, it appears to have a certain format internally. However, the format varies across different runtime environments for the same request, sometimes returning the entire archive file.

I just reproduce this

Reproduce code

import time
from pprint import pprint

import requests

# ranges to get
ranges = [
    (0, 99),
    (1200, 1369),
    (2000, 2209),
    (2146660100, 2146660200),
]

# get ranges with standalone requests
datas = []
for i, (x, y) in enumerate(ranges):
    start_time = time.time()
    resp = requests.get(
        'https://huggingface.co/datasets/deepghs/yande_full/resolve/main/images/0008.tar',
        headers={
            'Range': f'bytes={x}-{y}'
        },
    )
    print(f'Range {i}, response: {resp!r}, length: {len(resp.content)}, time cost: {time.time() - start_time:.3f}s')
    datas.append(bytes(resp.content))
    assert resp.status_code == 206, f'Should be 206, but {resp.status_code} found!'

# get all the data with one request
start_time = time.time()
resp = requests.get(
    'https://huggingface.co/datasets/deepghs/yande_full/resolve/main/images/0008.tar',
    headers={
        'Range': f'bytes={",".join(map(lambda ix: f"{ix[0]}-{ix[1]}", ranges))}'
    },
)
print(f'Multipart response: {resp!r}')
print(f'Time cost: {time.time() - start_time:.3f}s')
print('Headers:')
pprint(dict(resp.headers))
print(f'Content length: {len(resp.content)}')
assert resp.status_code == 206, f'Should be 206, but {resp.status_code} found!'

full_bytes = resp.content

start_pos = 0
current_i = 0
while True:
    try:
        next_sep = full_bytes.index(b'\r\n\r\n', start_pos)
    except ValueError:
        break

    lines = list(filter(bool, full_bytes[start_pos: next_sep].decode().splitlines(keepends=False)))
    pairs = [line.split(':', maxsplit=1) for line in lines]
    headers = {
        key.strip(): value.strip()
        for key, value in pairs
    }
    start_bytes, end_bytes = headers['Content-Range'].split(' ')[-1].split('/')[0].split('-', maxsplit=1)
    start_bytes, end_bytes = int(start_bytes), int(end_bytes)
    length = end_bytes - start_bytes + 1
    current_data = full_bytes[next_sep + 4: next_sep + 4 + length]
    start_pos = next_sep + 4 + length

    print(f'Multipart, range {current_i}, headers: {headers!r}, byte-ranges: {(start_bytes, end_bytes)}')
    assert current_data == datas[current_i], f'Range {current_i} not match!'
    print(f'Range {current_i} matched!')
    current_i += 1

if current_i < len(datas):
    print(f'Range {list(range(current_i, len(datas)))} not matched!')
else:
    print('Match success!')

On my local machine

When i run this on my local environment the result is (the time cost of multipart request is really slow, but the result is correct, status code is 206 as expected)

Range 0, response: <Response [206]>, length: 100, time cost: 2.709s               
Range 1, response: <Response [206]>, length: 170, time cost: 2.133s
Range 2, response: <Response [206]>, length: 210, time cost: 2.101s
Range 3, response: <Response [206]>, length: 101, time cost: 2.365s
Multipart response: <Response [206]>
Time cost: 23.916s                                                                                   
Headers:                                                                                             
{'Accept-Ranges': 'bytes',
 'Connection': 'keep-alive',                                                                         
 'Content-Disposition': "attachment; filename*=UTF-8''0008.tar; "
                        'filename="0008.tar";',
 'Content-Length': '1147',                                                                           
 'Content-Type': 'multipart/byteranges; '
                 'boundary=CloudFront:E5D729C94A500F62E0C8D8AF02F938EF',
 'Date': 'Thu, 25 Apr 2024 13:33:23 GMT',
 'ETag': '"820b63e3250678f8217c157c8b557712-135"',
 'Last-Modified': 'Sat, 20 Apr 2024 15:06:38 GMT', 
 'Server': 'AmazonS3',   
 'Vary': 'Origin',  
 'Via': '1.1 c1ff362c1118e059b545627964cd2e64.cloudfront.net (CloudFront)',
 'X-Amz-Cf-Id': 'I3Zj3t7Yn0ndSDNb7q9F3-_2700VGin-UGIZK-Ik9dkZmfkY5Um8Jw==',
 'X-Amz-Cf-Pop': 'SFO53-P1',
 'X-Cache': 'Miss from cloudfront',
 'x-amz-server-side-encryption': 'AES256',
 'x-amz-storage-class': 'INTELLIGENT_TIERING'}
Content length: 1147
Multipart, range 0, headers: {'--CloudFront': 'E5D729C94A500F62E0C8D8AF02F938EF', 'Content-Type': 'application/x-tar', 'Content-Range': 'bytes 0-99/2146662400'}, byte-ranges: (0, 99)
Range 0 matched!
Multipart, range 1, headers: {'--CloudFront': 'E5D729C94A500F62E0C8D8AF02F938EF', 'Content-Type': 'application/x-tar', 'Content-Range': 'bytes 1200-1369/2146662400'}, byte-ranges: (1200, 1369)
Range 1 matched!
Multipart, range 2, headers: {'--CloudFront': 'E5D729C94A500F62E0C8D8AF02F938EF', 'Content-Type': 'application/x-tar', 'Content-Range': 'bytes 2000-2209/2146662400'}, byte-ranges: (2000, 2209)
Range 2 matched!
Multipart, range 3, headers: {'--CloudFront': 'E5D729C94A500F62E0C8D8AF02F938EF', 'Content-Type': 'application/x-tar', 'Content-Range': 'bytes 2146660100-2146660200/2146662400'}, byte-ranges: (2146660100
, 2146660200) 
Range 3 matched!                  
Match success!

my local env

On huggingface space

When i run this code on huggingface space (i deployed a jupyterlab in hfspace), the output is (failed, the entire file is returned)

Range 0, response: <Response [206]>, length: 100, time cost: 0.291s
Range 1, response: <Response [206]>, length: 170, time cost: 0.190s
Range 2, response: <Response [206]>, length: 210, time cost: 0.119s
Range 3, response: <Response [206]>, length: 101, time cost: 0.281s
Multipart response: <Response [200]>
Time cost: 23.513s
Headers:
{'Accept-Ranges': 'bytes',
 'Content-Disposition': "attachment; filename*=UTF-8''0008.tar; "
                        'filename="0008.tar";',
 'Content-Length': '2146662400',
 'Content-Type': 'application/x-tar',
 'Date': 'Thu, 25 Apr 2024 13:33:19 GMT',
 'ETag': '"820b63e3250678f8217c157c8b557712-135"',
 'Last-Modified': 'Sat, 20 Apr 2024 15:06:38 GMT',
 'Server': 'AmazonS3',
 'x-amz-id-2': 'yGuW1BP+wVzZ6c6FgVvrvuBw2vkHDuqskpgpGHFW2t5y9sDFGRNGMi/29Ywf1t3t06aL3ma6MME=',
 'x-amz-request-id': 'HBC2WBTYWNDSXDP7',
 'x-amz-server-side-encryption': 'AES256',
 'x-amz-storage-class': 'INTELLIGENT_TIERING'}
Content length: 2146662400
Traceback (most recent call last):
  File "test_main.py", line 41, in <module>
    assert resp.status_code == 206, f'Should be 206, but {resp.status_code} found!'
AssertionError: Should be 206, but 200 found!

the env

So, 2 problems: