S3 etc. support? - Githubissues

shshemi / tabiew

A lightweight TUI application to view and query tabular data files, such as CSV, TSV, or parquet.

MIT License

541 stars 14 forks source link

S3 etc. support? #15

Open lnicola opened 1 month ago

lnicola commented 1 month ago

This could be pretty useful, and I think opendal crate would make it relatively easy to implement: https://crates.io/crates/opendal (unless polars can already do it, of course).

shshemi commented 1 month ago

Hi Thanks for the suggestion. What exactly would be the feature description?

lnicola commented 1 month ago

Being able to open e.g. Parquet datasets stored on S3 without downloading then first, or mounting the bucket with s3fs.

shshemi commented 1 month ago

Opening a file from any remote source requires downloading it first, as the SQL engine requires the entire data frame at the time of registration. Anyway, I added your suggestion to the backlog, yet I do not promise anything any time soon.

lnicola commented 1 month ago

as the SQL engine requires the entire data frame at the time of registration

It does? But Arrow and Parquet are supposed to be great for random/remote access :confused:. DuckDB is relatively popular these days for that reason.

shshemi commented 1 month ago

To the best of my knowledge, it does. However, downloading here does not necessarily refer to writing to a file. It could be downloaded to memory as well. In the later releases, I am planning to add reading from stdin, which could help by piping the data from the remote file to Tabiew. Something like this:

curl -s <remote_file_url> | tw

muscar commented 1 month ago

@shshemi I think what @lnicola is referring to are optimizations like predicate pushdown: https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/. That way you can avoid loading the whole parquet file, even to memory.

lnicola commented 1 month ago

I'm also referring to SQL LIMIT and OFFSET. If you're displaying a table or the result of a relatively simple query, you don't need to compute all the results -- just the ones that fit on the screen.