Open bjchambers opened 1 year ago
Some of this may be done as part of building the new partitioned execution logic (as part of #409).
I believe this work is complete for the getMetadata() and prepareData() methods, but still needs to be completed on the query execution and materialization code paths.
In general -- started working on this to allow operating on many and/or large files without filling up the disk. First PR(s) are ready for review.
@epinzur re getMetadat()
and prepareData()
-- it isn't really complete for them either. Specifically, they still rely on downloading the whole file. For get metadata, we should only need to fetch the bytes corresponding to the footer, for prepare we should be able to use object store to read bytes in chunks, never fetching the whole thing. Similarly, it isn't used for uploading the file yet.
Capturing some links / thoughts:
Summary
Currently, compute uses the S3 client to retrieve Parquet files before reading them. We have started a transition to using https://docs.rs/object_store/latest/object_store/ which supports (a) reading from multiple object stores and (b) doing a direct byte-range read without fetching the file locally first.
We should finish up this migration to fully benefit from from
object_store
.ObjectStoreUrl
rather than&str
orString
key
method andObjectStoreCrate
private (#501)ConvertURI
methods (https://github.com/kaskada-ai/kaskada/blob/main/wren/compute/helpers.go#L19-L23) (#503)