treeverse / lakeFS

lakeFS - Data version control for your data lake | Git for data
https://docs.lakefs.io
Apache License 2.0
4.48k stars 359 forks source link

Improved import DX #5780

Closed ozkatz closed 1 year ago

ozkatz commented 1 year ago

There are currently 2 ways to create pointers to existing data in lakeFS:

  1. The import API (createMetarange, createRange)
  2. Stage Object API

These introduce a couple of usability challenges:

  1. creating ranges works by merging from a weird looking branch and is not coherent to users
  2. creating ranges works well when importing many objects, but is inefficient when importing a few random objects. This is hard to explain.
  3. Staging an object requires the user to fetch additional metadata and pass it to lakeFS, while the import API does not. This is inconsistent.
  4. when importing a bunch of random things (e.g. labeled data) - requires many calls to the server
  5. Using the Import API as exposed in the generated SDKs is very hard and requires quite a bit of knowledge of how lakeFS works
  6. Understanding the branching scheme required when importing different paths is confusing

Definition of Done

  1. a single coherent API for importing data
  2. flexible enough to pull large ranges of objects efficiently from the object store
  3. be able to import a list of individual items
  4. allow replacing a range with a current representation of it (a-la periodic import)
  5. Works on a given branch and does not depend on merging

Suggested Solution

a single API to import existing data from the object store.

POST /api/v1/repositories/{repoId}/branches/{branchId}/import

Body:

{
    "paths": [
        {
            "path": "s3://bucket/path/",
            "destination": "path/",
            "type": "prefix",
        },
        {
            "path": "s3://bucket/path2/object",
            "destination": "path2/object",
            "type": "object",
        },
        {
            "path": "s3://bucket/bar/object.txt",
            "destination": "bar/object.txt",
            "type": "object",
        },
        {
            "path": "s3://bucket/foo/",
            "destination": "foo/",
            "type": "prefix",
        },
    ],
    "commit": {
        "message": "imported things",
        "metadata": {},
    }

For simplicity, we can decide that the paths[].destination field is optional and will default to the path portion of the current paths[].path URI (excluding the bucket/host).

Calling this API endpoint will import the given paths by doing the following:

  1. Any prefix path within the paths list will default to replacing the destination with the objects that appear in path
  2. Any object path within the paths list will replace any existing object at destination
  3. The response would be some poll-able ID, with a status:
    1. in progress: returned while the operation is running
    2. failed: will include a structured error message
    3. success: will include a commit ID that resulted from the operation completing (nice to have: stats about the operation: # of objects, sum of bytes imported, etc)

Alternative: keeping it sync

Alternatively, we can keep this API synchronous by involving the client. This could be done by adding an opaque continuation payload in the request:

Every N objects imported, the server would return a continuation payload to the client, which in turn, would call the server again with that payload. Import is complete once the response indicates no further payloads, and probably a different status code.

This comes with a trade-off: it's more reliable/scalable because it allows idempotent retries of parts of the import, as well as requires less coordination/state between lakeFS instances.

It does, however, require a "smarter" SDK to abstract it away from callers (which it turn makes it harder to simply consume the OpenAPI interface verbatim without an understanding of this protocol).

ozkatz commented 1 year ago

@nopcoder @itaiad200 @N-o-Z see Suggested Solution above as a way of satisfying most (all?) requirements. WDYT?

nopcoder commented 1 year ago

@ozkatz stateless state with continuation between calls while the suggested API requires multiple listings which need to be aligned into a single listing (ordered) will require managing a state or performing multiple passes. Handling it with the client callback will be a challenge. We need to specify if the commit will be into an import branch. and if this API will support import to staging.

ozkatz commented 1 year ago

We need to specify if the commit will be into an import branch. and if this API will support import to staging.

@nopcoder no import branch, no staging.

Import triggers a commit on the current branch.

Need to make sure the destination path is replaced by the imported objects.

N-o-Z commented 1 year ago

@ozkatz maybe it will make more sense to separate the body to paths list and objects list, WDYT?

ozkatz commented 1 year ago

I was trying to stay consistent with how object listings typically work on object stores - not sure the benefit in functionality / coherency for each approach.

I guess a big ol' list is easier to assemble client-side, if you do have a mix of both, but if you only import ranges (or only import objects), it could be overly verbose. Interested in hearing more pros/cons though?

kesarwam commented 1 year ago

@ozkatz Somebody recently asked about pattern matching to import objects e.g. import all objects starting with car*