kaskada-ai / kaskada

Modern, open-source event-processing
https://kaskada.io/
Apache License 2.0
348 stars 15 forks source link

feat: Use object store and async, byte-range reads #465

Open bjchambers opened 1 year ago

bjchambers commented 1 year ago

Summary

Currently, compute uses the S3 client to retrieve Parquet files before reading them. We have started a transition to using https://docs.rs/object_store/latest/object_store/ which supports (a) reading from multiple object stores and (b) doing a direct byte-range read without fetching the file locally first.

We should finish up this migration to fully benefit from from object_store.

bjchambers commented 1 year ago

Some of this may be done as part of building the new partitioned execution logic (as part of #409).

epinzur commented 1 year ago

I believe this work is complete for the getMetadata() and prepareData() methods, but still needs to be completed on the query execution and materialization code paths.

bjchambers commented 1 year ago

In general -- started working on this to allow operating on many and/or large files without filling up the disk. First PR(s) are ready for review.

@epinzur re getMetadat() and prepareData() -- it isn't really complete for them either. Specifically, they still rely on downloading the whole file. For get metadata, we should only need to fetch the bytes corresponding to the footer, for prepare we should be able to use object store to read bytes in chunks, never fetching the whole thing. Similarly, it isn't used for uploading the file yet.

bjchambers commented 1 year ago

Capturing some links / thoughts: