kaskada-ai / kaskada

Modern, open-source event-processing
https://kaskada.io/
Apache License 2.0
352 stars 15 forks source link

feat: allow preparing & output to an object store when running locally in a notebook #484

Open epinzur opened 1 year ago

epinzur commented 1 year ago

Summary If I'm running the Kaskada engine locally in a notebook, and I'm working with larger datasets, it would be better if I could use remote object storage for the prepare cache and query results output.

Therefore I wouldn't need to worry about filling up my local disk.

Is your feature request related to a problem? Please describe. This is related to trying to compute on large datasets (1+ TB), when the available local storage for my notebook is smaller than the datasize. This could be for working on a local machine or working from a hosted platform like google colab. The default local disk size for google colab is 80 GB.

Describe the solution you'd like The manager and engine already support using remote object storage for the prepare cache and query output storage.

The python client should be updated to allow creating a local session with the following ENV params specified on manager startup:

Describe alternatives you've considered

bjchambers commented 1 year ago

Two thoughts:

  1. I could see a case where the user wants to separate prepare and output, since they have different roles. Ideally this could be separate options.
  2. Why is it three environment variables. It seems like we've started moving towards specifying s3://<bucket>/<path> as a single option, allowing the user to provide all three with a single URL.

Eg: prepare_prefix_url='s3://<bucket>/prepare/', output_prefix_url='file:///tmp/path/to/local/output' could be used to prepare to S3 but output locally to the /tmp directory.

bjchambers commented 1 year ago

Also probably eventually want an option for controlling where the snapshots (rocksdb) are written.

epinzur commented 1 year ago

currently there is just "storage owned by kaskada". this includes prepare cache, rocks snapshots, output files, compute traces, etc...

I think separating to multiple storage locations should be a separate issue.

bjchambers commented 1 year ago

I think it depends on whether it is possible for the user to specify today (eg., by passing extra arguments to Wren via the session builder). If that's the case, then we may want to defer making any API changes until we have a plan for what the API should be, and treat the extra arguments as a way to accomplish this in the time being.

If API changes are necessary, then we should discuss further.