pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
28.66k stars 1.79k forks source link

Improve public Rust API for `read_X` and `scan_X` functionality #8104

Open universalmind303 opened 1 year ago

universalmind303 commented 1 year ago

_Original title: read_X and scan_X macros for rust_

Problem description

The Rust library feels like it has a lot more "rough edges" than the python lib, and seems that most of it's design is around serving the py-polars package. I think we could take some inspiration from the python library & make a more user friendly & intuitive library for rust as well.

One common theme I've seen come up in issues, or discord conversations is that unlike pandas, Polars favors a single consistent way of performing a task instead of many ways. This reflects well in the python code, but the rust library seems to be the opposite, especially with the io functions.

There is a lot of inconsistency within creating df/lfs in the Rust library. With python, everything is pl.read_X(file_name, options) or pl.scan_X(filename, options).

Most of the readers implement SerReader so you can do ::new(buf).finish()?. But some extend that with a from_path or other custom factory methods, and many of them have slightly different signatures, as seen below.

// `<P: Into<PathBuf>>(path: P) `
CsvReader::from_path("test.csv")?.finish()?;

// <R: Read + Seek>(reader: R)
CsvReader::new(reader).finish()?;

// impl AsRef<Path>
LazyCsvReader::new("test").finish()?

// <P: Into<PathBuf>>
LazyFrame::scan_parquet("test.parquet".into())

Some use *Options structs, some use *Args structs, others builders, and others require all options specified as arguments. It makes it very unintuitive when working with different datasources, or even going from eager to lazy.

Suggested

** We would have to use macros to provide the same level of flexibility that is provided in the other languages.

Using a macro, we can provide a way for reads & scans that is very similar to the python library

polars::read_csv!(path);
polars::read_csv!(buf);
polars::read_csv!(path, sep = '-');
polars::read_csv!(path, has_header = true);
polars::read_csv!(path, has_header = true, sep = '-');
polars::read_csv!(path);
polars::scan_csv!(path);

polars::read_json!(path);

polars::read_ndjson!(path);
polars::scan_ndjson!(path);

polars::read_parquet!(path);
polars::scan_parquet!(path);

// and so on...

Since polars is not yet at a stable version, I'd assume it would be safe to remove these from the public api & keep them as implementation details.

ritchie46 commented 1 year ago

Yes, I agree with this and also have been thinking about this a bit earlier.

The reason we can provide a much cleaner API in python is because of ducktyping/overloading and because we have an indirection between the API and the implementation API (e.g. polars crate).

I think we should add this indirection in polars crate as well. How about adding a crate polars-api? We then can rename the Series and DataFrame from core to SeriesCore and DataFrameCore to clearly indicate that this is lower level, and should not be used directly by end users. Then in polars-api we can make wrapper structs and dispatch most to the query engine like we do in python.

Here we then also can create those scan macros. I believe we should be able to copy the python structure and and reference guide pretty 1:1 in such a way.

stinodego commented 6 months ago

I don't think macros are the way to go here. I think we can look at the sink functionality for an example of how to do this nicely: an Options struct containing the various parameter settings with a Default implementation. The same was suggested in https://github.com/pola-rs/polars/pull/8212#issuecomment-1600662799.

The public API for these functions definitely needs a makeover though, so this issue is valid regardless. I would suggest a different solution though. I'll update the title to reflect the core problem so we may discuss alternative solutions.

cartercanedy commented 2 weeks ago

I don't think macros are the way to go here. I think we can look at the sink functionality for an example of how to do this nicely: an Options struct containing the various parameter settings with a Default implementation.

I agree with the idea that enums are the way to go and handle the input specialization internally, probably something that looks like this:

pub enum Input<'a> {
  Path(&'a std::path::Path),
  Buf(&'a dyn std::io::Read),
  ...
}

I think that the lazy api would benefit greatly from this, since from what I can tell, the only way you can construct a LazyFrame from an in-memory buffer is by first constructing a DataFrame yourself by calling the SerReader::new(buf) impl for whatever schema the buffered data has, then calling SerReader::finish(self) to eagerly construct a DataFrame, then getting a LazyFrame via the IntoLazy::lazy(self) impl.

I gotta say that reducing the friction of using an in-memory buffer would be a big boon for embedded/wasm application usability. My current use-case is wasm (but the same applies for embedded envs) where there is no filesystem and I really want to limit the memory footprint as much as possible.