kaskada-ai / kaskada

Modern, open-source event-processing
https://kaskada.io/
Apache License 2.0
349 stars 15 forks source link

feat: Remove object store uses from Wren #485

Open bjchambers opened 1 year ago

bjchambers commented 1 year ago

Summary

Currently, Wren uses object stores for a variety of things. At least one of which is downloading a file locally prior to getting the schema via FileService. As of #479 Sparrow is able to get the schema for any URL, including s3:// and gs://, etc. via object_store. And doing so will only need to retrieve the footer of Parquet files.

We should at least switch Wren to not downloading files locally for this use case and also consider removing all object store dependencies from Wren, so that Sparrow is the only thing with object store credentials, etc.

epinzur commented 1 year ago

Wren does not download files locally for the FileService.

Currently wren only copies a file into the fileSystem owned by Wren (local or objectStore) on Table Load. And the default behavior for this will be changed soon to leave the file in-place.

epinzur commented 1 year ago

wren also uses object store to:

bjchambers commented 1 year ago

Ah. Great. So if I understand correctly, Wren uses object stores for:

  1. Signing output URLs before returning them, so the user can read them without credentials.
  2. Copying files into the "owned" object store on table load, if necessary.

With your change, 2 is no longer the default. And the first one could be (conditionally) disabled in cases where the user already has credentials for the file store (eg., doesn't need a signed URL).

So it sounds like no work is needed here to avoid copying files to disk (beyond #472). We can keep this issue for tracking the remaining object store dependencies within Wren, but that can be handled later. Correct?

epinzur commented 1 year ago

That is correct. We can delay this work into the future.