Open epinzur opened 1 year ago
Two thoughts:
s3://<bucket>/<path>
as a single option, allowing the user to provide all three with a single URL.Eg: prepare_prefix_url='s3://<bucket>/prepare/', output_prefix_url='file:///tmp/path/to/local/output'
could be used to prepare to S3 but output locally to the /tmp
directory.
Also probably eventually want an option for controlling where the snapshots (rocksdb) are written.
currently there is just "storage owned by kaskada". this includes prepare cache, rocks snapshots, output files, compute traces, etc...
I think separating to multiple storage locations should be a separate issue.
I think it depends on whether it is possible for the user to specify today (eg., by passing extra arguments to Wren via the session builder). If that's the case, then we may want to defer making any API changes until we have a plan for what the API should be, and treat the extra arguments as a way to accomplish this in the time being.
If API changes are necessary, then we should discuss further.
Summary If I'm running the Kaskada engine locally in a notebook, and I'm working with larger datasets, it would be better if I could use remote object storage for the prepare cache and query results output.
Therefore I wouldn't need to worry about filling up my local disk.
Is your feature request related to a problem? Please describe. This is related to trying to compute on large datasets (1+ TB), when the available local storage for my notebook is smaller than the datasize. This could be for working on a local machine or working from a hosted platform like google colab. The default local disk size for google colab is 80 GB.
Describe the solution you'd like The manager and engine already support using remote object storage for the prepare cache and query output storage.
The python client should be updated to allow creating a local session with the following ENV params specified on manager startup:
OBJECT_STORE_TYPE
: eithers3
orgs
OBJECT_STORE_BUCKET
: the name of the bucketOBJECT_STORE_PATH
: the path in the bucket to store all data inDescribe alternatives you've considered