treeverse / lakeFS

lakeFS - Data version control for your data lake | Git for data
https://docs.lakefs.io
Apache License 2.0
4.46k stars 359 forks source link

BUG: Multipart upload: mtime difference between storage and lakeFS can be substantial #8303

Closed N-o-Z closed 3 weeks ago

N-o-Z commented 4 weeks ago

For example: In S3 the mtime is determined when creating the multipart upload requests, while lakeFS mtime is determined upon completion of multipart upload. Needless to say this can result in a very huge diff between the S3 mtime and lakeFS mtime.

Need to find a generic solution to this which will be valid for all storage adapters Possible solution: Upon CompleteMultipartUpload, stat the object on the blockstore and use the mtime to create the lakeFS entry.

In order to properly test this - we need to consider adding a head object interface to our block adapter.

arielshaqed commented 3 weeks ago

Fortunately we can do this: GCS and for Azure return this information. S3 does not, but we already headObject the generated object to gets its ETag, after which Last-Modified time is free (and guaranteed to be found).

Probably also want to straighten this out for put-object: any difference can be unpleasant for presigned, and generally confusing.