Open lnicola opened 1 month ago
Hi Thanks for the suggestion. What exactly would be the feature description?
Being able to open e.g. Parquet datasets stored on S3 without downloading then first, or mounting the bucket with s3fs.
Opening a file from any remote source requires downloading it first, as the SQL engine requires the entire data frame at the time of registration. Anyway, I added your suggestion to the backlog, yet I do not promise anything any time soon.
as the SQL engine requires the entire data frame at the time of registration
It does? But Arrow and Parquet are supposed to be great for random/remote access :confused:. DuckDB is relatively popular these days for that reason.
To the best of my knowledge, it does. However, downloading here does not necessarily refer to writing to a file. It could be downloaded to memory as well. In the later releases, I am planning to add reading from stdin, which could help by piping the data from the remote file to Tabiew. Something like this:
curl -s <remote_file_url> | tw
@shshemi I think what @lnicola is referring to are optimizations like predicate pushdown: https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/. That way you can avoid loading the whole parquet file, even to memory.
I'm also referring to SQL LIMIT and OFFSET. If you're displaying a table or the result of a relatively simple query, you don't need to compute all the results -- just the ones that fit on the screen.
This could be pretty useful, and I think
opendal
crate would make it relatively easy to implement: https://crates.io/crates/opendal (unlesspolars
can already do it, of course).