peak / s5cmd

Parallel S3 and local filesystem execution tool.
MIT License
2.7k stars 239 forks source link

feature: cp support for `Range` header #756

Open mackenzie-grimes-noaa opened 2 months ago

mackenzie-grimes-noaa commented 2 months ago

Requested feature

An optional flag in cp --range, which accepts a byte range string, just like the standard, if unpopular, HTTP GET request Range header.

s5cmd cp s3://my-bucket/my-object.json . --range 'bytes=123-456'

The destination object, when cp is done, would have only data from that byte range of the src object.

Value

For reasons not worth getting into here, my organization often wants to download specific chunks from an otherwise unwieldy S3 object. These objects may be 10-20 GBs, when we only want a specific 1 MB chunk in a known byte location inside that file.

S3's GetObject enables us do that by accepting a Range header, which translates to huge network load reductions for us, plus storage savings on the destination volume.

Here's an example of including a Range header using the official AWS CLI:

aws s3api get-object --bucket my-bucket --key my-object.json --range "bytes=123-456" .

Or using boto3:

boto3.client('s3').get_object(Bucket='my-bucket', Key='my-object.json', Range='bytes=123-456')

s5cmd does not appear to expose this header, which is the only thing keeping us from using s5cmd in production (in spite of its clear superiority over boto3 ☹️).

I'm happy to contribute, but I'm creating an Issue first in case someone knows of a good reason why s5cmd doesn't have this yet.

mackenzie-grimes-noaa commented 2 days ago

Resolved in #772