Open amwolff opened 2 years ago
At the core we'd have to extend ObjectDownloadRequest / Endpoint.DownloadObject() to be capable of returning from only a specific part. This would guarantee that order limits are set correctly, etc.
It does seem that part numbers are currently being persisted. The SegmentPosition
struct is serialized as uint64(pos.Part)<<32 | uint64(pos.Index)
and saved in metabase DB's [segments].[position]
.
Here are my findings from a quick test with the "real" S3:
partNumber
parameter can be specified even for non-multipart-uploaded objects, but then x-amz-mp-parts-count
is not present in the response. This scheme aligns well with imposed by AWS S3 limits, i.e., the maximum part size is 5GB, and the largest object that can be uploaded in a single PUT is 5 GB (so objects like that will only have one part);partNumber
is specified for multipart-uploaded objects, then x-amz-mp-parts-count
is present. I think this means that the client can effectively download an object aligned with how it was uploaded part-by-part.In short, per GetObject request with this mighty parameter, we will need to gather the appropriate range for the part specified and the number of available parts—if any.
Just an initial question. Is this our finding or is it's a real use case we have with one of external testsuites? I'm asking to understand the priority.
In regards to the solution, @wthorp is right and we would need to extend metainfo.DownloadObject
to return the exact range for the specified part. That is the relatively easy part, usually, we have more trouble with defining how to expose this with uplink API. There I think we have 3 options:
DownloadObjectOptions
to include partNumber
project.DownloadPart(ctx, bucket, key, partNumber, options)
If we need to have this fix sooner than later then we can also initially fix it only on uplink side. DownloadObject
response at the moment should contain all necessary information to figure out part range in object, so we can use this range to download part without changing server-side. The main issue is that we will allocate bandwidth for the full object but that would be only a temporary solution.
If we agree that this is a high priority we can think to schedule time for it. As task fully for Metainfo team I don't think we will have the power to do this next sprint but if that would be cross-team effort I'm sure we can figure out something.
Thanks so much for thorough response! Our test suite and external test suites we test against don't cover this case. This is our find at the moment. I don't think it's very high in priority (certainly good to have, though); this probably needs confirmation, but I believe it could be even scheduled for Q2 (for our compatibility-improvement milestone).
Moved to backlog until S3 compatibility Improvement milestone work is underway.
I had thought that this feature (beyond being part of the standard) would result in more optimized parallel downloads.
Most parallel download clients will ask for some arbitrary sized chunk. If these chunk sizes relatively but don't exactly align with segment sizes, we'd effectively require two requests to download each segment. However, this story is worse if downloaders grab small chunks; It could take arbitrarily many metainfo requests to download a segment.
Seeing as how gaining efficiency here would require changing how clients work (which frankly won't happen), this now seems low priority. I can see a cache of some sort being the more effective solution to this problem.
GetObjectNInfo
needs adjustments to handle thepartNumber
parameter. This way, we will fully support GetObject action.Reference: https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html#API_GetObject_RequestSyntax