minio / minio-go

MinIO Go client SDK for S3 compatible object storage
https://docs.min.io/docs/golang-client-quickstart-guide.html
Apache License 2.0
2.49k stars 645 forks source link

api: Support CopyObject for all sizes #617

Closed donatello closed 7 years ago

donatello commented 7 years ago

High-level CopyObject requirements:

Currently, the library supports only copying objects <= 5GiB in size. Larger objects can also be copied via a multipart-copy-object strategy.

The multipart copy object operation consists of starting a new multipart upload, followed by 1 or more copy-object-part requests, and finally a complete-multipart request.

Note that copy object via a single PUT request does not support range headers, but copy-object-part does support this.

This feature was recently implemented in the Haskell SDK.

harshavardhana commented 7 years ago

Support source objects with an arbitrary range header (i.e. any valid start and end offset of a source object) via multipart copy object.

Why should this be supported? - what benefit does this provide a user? also at what junction a user really knows that they need to copy only a certain range of the object.

Since we cannot append on the destination i don't see how this API behavior benefits anyone.

hashbackup commented 7 years ago

HashBackup could put a multipart ranged copy to good use. HB packs files into arc files during the backup. These default to 100MB but can be larger, like 4GB. Over time, arc files get "holes" poked in them as files are deleted due to retention policies.

For example, you backup a 75MB file and a 25MB file into 1 arc file and store it on S3. The first file is marked deleted. To actually recover space, the 100MB arc file has to be downloaded, packed, and uploaded. The download is where high costs are incurred.

By using a series of mulipart copy requests, this packing operation could be done remotely without requiring a download. I think the only cost would be the request cost: I couldn't see where Amazon charges fees for copy based on the size of the data.

(Just realized this is for the Go binding, and I'm using the Python binding)

harshavardhana commented 7 years ago

Yes but minio libraries are not meant for exposing lower level multipart operations. For that you should use AWS SDKs or copy minio library source into your repo.

I don't see why we should explore range APIs while not exposing multipart APIs underneath.

hashbackup commented 7 years ago

A range list of start-end offsets could be added to copy_object without exposing multipart.

donatello commented 7 years ago

Why should this be supported? - what benefit does this provide a user? also at what junction a user really knows that they need to copy only a certain range of the object.

Since we cannot append on the destination i don't see how this API behavior benefits anyone.

@harshavardhana Here is my reasoning about this:

When discussing this with @balamurugana - he gave the idea to do an even more general API that accepts multiple source objects with one or more start-end offset pairs for each source object, that can be used to create a single object on the server side using only copy-object. He believed that is a useful operation for working with related objects that are created separately and finally need to be stitched together (e.g. large video production/rendering applications, and @hashbackup's application above). This was going to be my next proposal.

hashbackup commented 7 years ago

A negative aspect of exposing ranges is that it might not actually work as expected. After reading about copy object with ranges on S3, it seems that each range must be at least 5M, because it uses the multipart API. So if a user says to copy bytes 0-5 and bytes 20-30, what should happen? You could get very general and do a download, create a temp file with only the bytes needed, then upload it as a new file, but seems to be way out of scope for minio, and whether/how to do that would be very dependent on the storage service's capabilities.

harshavardhana commented 7 years ago

Moving this as blocked to discuss with @abperiasamy

harshavardhana commented 7 years ago

BTW this is not blocked anymore @deekoder

donatello commented 7 years ago

CopyObject now supports objects of all sizes, copy-conditions, source object ranges, server-side-encryption with decryption of source and encryption of destination, and copying/setting user-metadata on the destination.

In addition, the ComposeObject function is added, which enables creating objects from multiple source objects by providing a concatenation specification.

These changes are available in version 3.0.0 onwards.