Closed ozkatz closed 1 year ago
@nopcoder @itaiad200 @N-o-Z see Suggested Solution above as a way of satisfying most (all?) requirements. WDYT?
@ozkatz stateless state with continuation between calls while the suggested API requires multiple listings which need to be aligned into a single listing (ordered) will require managing a state or performing multiple passes. Handling it with the client callback will be a challenge. We need to specify if the commit will be into an import branch. and if this API will support import to staging.
We need to specify if the commit will be into an import branch. and if this API will support import to staging.
@nopcoder no import branch, no staging.
Import triggers a commit on the current branch.
Need to make sure the destination path is replaced by the imported objects.
@ozkatz maybe it will make more sense to separate the body to paths list and objects list, WDYT?
I was trying to stay consistent with how object listings typically work on object stores - not sure the benefit in functionality / coherency for each approach.
I guess a big ol' list is easier to assemble client-side, if you do have a mix of both, but if you only import ranges (or only import objects), it could be overly verbose. Interested in hearing more pros/cons though?
@ozkatz Somebody recently asked about pattern matching to import objects e.g. import all objects starting with car*
There are currently 2 ways to create pointers to existing data in lakeFS:
createMetarange
,createRange
)These introduce a couple of usability challenges:
Definition of Done
Suggested Solution
a single API to import existing data from the object store.
POST /api/v1/repositories/{repoId}/branches/{branchId}/import
Body:
For simplicity, we can decide that the
paths[].destination
field is optional and will default to the path portion of the currentpaths[].path
URI (excluding the bucket/host).Calling this API endpoint will import the given paths by doing the following:
prefix
path within thepaths
list will default to replacing thedestination
with the objects that appear inpath
object
path within thepaths
list will replace any existing object atdestination
Alternative: keeping it sync
Alternatively, we can keep this API synchronous by involving the client. This could be done by adding an opaque continuation payload in the request:
Every N objects imported, the server would return a continuation payload to the client, which in turn, would call the server again with that payload. Import is complete once the response indicates no further payloads, and probably a different status code.
This comes with a trade-off: it's more reliable/scalable because it allows idempotent retries of parts of the import, as well as requires less coordination/state between lakeFS instances.
It does, however, require a "smarter" SDK to abstract it away from callers (which it turn makes it harder to simply consume the OpenAPI interface verbatim without an understanding of this protocol).