source-cooperative / data.source.coop

Source Cooperative Data Proxy
https://data.source.coop
MIT License
13 stars 0 forks source link

[Bug] Large downloads through boto3 download incorrect data #7

Open kbgg opened 2 weeks ago

kbgg commented 2 weeks ago

Description of Bug:

When downloading a large file through boto3, the completed file is unexpectedly large and corrupt.

Steps to Reproduce:

s3_client = boto3.client("s3", endpoint_url="https://data.source.coop") with open("boundaries_austria_2021_boto3.parquet", "wb") as f: s3_client.download_fileobj("kerner-lab", "fields-of-the-world-austria/boundaries_austria_2021.parquet", f)

- Verify that the file sizes do not match on the two files downloaded

**Expected Behavior:**

The file sizes should match and the checksums should match

**Actual Behavior:**

The file sizes do not match and the checksums do not match

**Additional Context:**

MD5 (boundaries_austria_2021_boto3.parquet) = 7249b300347b14f13d6652c98b266350 MD5 (boundaries_austria_2021.parquet) = e8f3dc1683acd316a0668d42802fa6a4 -rw-r--r-- 1 kevin staff 43912147 Nov 6 07:41 boundaries_austria_2021.parquet -rw-r--r-- 1 kevin staff 85855187 Nov 6 07:43 boundaries_austria_2021_boto3.parquet

kbgg commented 2 weeks ago

It looks like the data proxy is not handling the case properly when the end of the range is not specified. boto3 will send a request with the range bytes=41943040- which should return the remaining bytes of the file however the data proxy is returning the entire file.

This matches the file size that is downloaded, 41943040 + 43912147 = 85855187 which matches the file that is downloaded through boto3