pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.34k stars 1.86k forks source link

Nice project #9

Closed alippai closed 2 years ago

alippai commented 4 years ago

This is pretty interesting project, shows how powerful things can be built using Rust + Arrow. A few basic questions regarding your future plans:

  1. There is some overlap across the Polars, ndarray and DataFusion projects. E.g. sum() is implemented by all of them. Do you plan to converge on the long run or is this the intended level of abstraction&separation?
  2. Compared to Pandas, will polars run multi-threaded, using all the cores e.g. using rayon? Like Modin or Dask.
  3. Do you plan to create a Python API (e.g. pyo3)?
ritchie46 commented 4 years ago

Thanks! :)

  1. I've chosen Arrow as a backend, and want to utiilize their implementation (e.g. Sum) as much as possible. I also could've opted for ndarray, indeed. But then we would need to solve the same pitfalls that pandas has encountered (like integer arrays as floats to be able to use NaN as null values). AFAIK DataFusion has the goal to be a distributed system (much like Apache Spark). There will probably be some overlap indeed.
  2. Yes, wherever multithreading seems to be a good fit, I will try to use that. At the moment the apply in the groupby operation is multithreaded with rayon.
  3. I probably will eventually. I hope to be able to make zero copy transfers from arrow memory to numpy. At the moment, the focus Rust.
alippai commented 4 years ago

I see that Python API is already WIP, you are fast :))

As you mentioned that you already use rayon, are the pandas vs polars benchmarks (groupby, join) run using multiple cores? So is it eg. 1 core vs 4 core comparison in the README now?

ritchie46 commented 4 years ago

Yes, the benchmarks are indeed on multiple cores. However, not everything is parallelizable and the Join and Groupby algorithm are largely single-threaded. For groupby the obvious parallelization is the apply part. For the joins, I could only parallelize the selection of the rows after all the join tuples were computed.

B.t.w. you can already try out the Python bindings (though beware, it is probably still very buggy). If you run docker run --rm -p 8890:8890 ritchie46/py-polars, it will start a jupyter notebook server for you. The Python public api can be found here: https://github.com/ritchie46/polars/tree/master/py-polars/polars

cvonsteg commented 3 years ago

Hi @ritchie46 - thought I'd jump on this thread. Firstly, I wanted to echo @alippai 's sentiments in saying that this looks like a really interesting/promising project! Are you keen to have other contributors, or would you prefer users just to raise Issues? If you'd like other contributors, it'd be great to get some more info on the Issues, e.g. reproducible errors, or more detailed info on the scope and acceptance criteria.

Keep up the great work!

ritchie46 commented 3 years ago

@cvonsteg thank you :). I'd prefer to start with issues and when needed some discussion, but people are free to implement these issues. It's good that you mention this. It enforces me to formalize it a bit and create a contribution guide

cvonsteg commented 3 years ago

@ritchie46 - great, thanks for letting me know. Also, kudos on getting a contribution guide up and running so quickly! For now, I'll play with the project some more and start trying to identify bugs, improvements, and new features.

alippai commented 3 years ago

Speaking of reusing the compute kernels, will the improvements like https://github.com/apache/arrow/commit/0100121f92299d68b348206288f12c43c44110e4 automatically improve polars when Arrow 2.0 is released (and the polars dependency updated)?

ritchie46 commented 3 years ago

@ritchie46 - great, thanks for letting me know. Also, kudos on getting a contribution guide up and running so quickly! For now, I'll play with the project some more and start trying to identify bugs, improvements, and new features.

Great. I am currently wrapping up the initial lazy api. This postpones your query until you actually require data and then tries to optmize the query. This optimization can be flawed though, so any feedback bugs on that part of the api is very much appreciated.

ritchie46 commented 3 years ago

Speaking of reusing the compute kernels, will the improvements like apache/arrow@0100121 automatically improve polars when Arrow 2.0 is released (and the polars dependency updated)?

Definitely! Polars uses the arrow compute kernels whenever the chunks of the Series/ChunkedArray align (which is most of the time). Here is a snippet of the Add trait impl.

    fn add(self, rhs: Self) -> Self::Output {
        // arrow simd path
        if self.chunk_id == rhs.chunk_id {
            let expect_str = "Could not add, check data types and length";
            operand_on_primitive_arr![self, rhs, compute::add, expect_str]
        // broadcasting and fast path
        } else if rhs.len() == 1 {
            let opt_rhs = rhs.get(0);
            match opt_rhs {
                None => ChunkedArray::full_null(self.name(), self.len()),
                Some(rhs) => self.apply(|val| val + rhs),
            }
        // slower path
        } else {
            apply_operand_on_chunkedarray_by_iter!(self, rhs, +)
        }
    }
rubyfin commented 3 years ago

I love what you have done so far with this project. It looks very interesting and promising, and it is incredible to have this type of functionality being developed in Rust ecosystem. I have a few questions / suggestions, if you don't mind:

  1. What is your attitude towards stable vs nightly Rust? Are you building this library targeting PROD usage, or are you building it as a research project and want to explore Rust? Currently, as your project relies on Apache Arrow, it looks like you have to use nightly. Is the plan to use nightly until Apache Arrow Rust supports stable, and then migrate? It appears that there is a ticket to support stable, the only left dependency is portable SIMD: https://issues.apache.org/jira/browse/ARROW-6717, which I think might take some time.
  2. What are your general goals for this project? E.g. do you plan to build a faster pandas, or a more efficient and feature complete dask, or a mini Spark?
  3. As a long-time user of similar libraries in other languages, I use some functions a lot more often (you covered a lot of them already it seems :). Just an idea: I think it could be useful to create polling/ voting for what features users would want to see - I think it would generate some ideas for the Roadmap, and would lead to an increased adoption.

Thank you for the hard work!

ritchie46 commented 3 years ago

Hi @rubyfin,

Thanks, and good to hear you like it. Let me address your questions:

  1. This library is targeting production use. With regard to nightly features. The only nightly feature in Arrow and therefore Polars is SIMD. SIMD is an opt-in, so if you want maximal performance, you need nightly.
  2. The goals and non-goals are added to the polars-book. To summarize; the goal of polars is being a blazing fast in-memory DataFrame library for data that fits on a single machine (server up to 250GB RAM or so). With regard to the API it is somewhat in the middle of pandas and spark. As I do want to encourage you to use the lazy API as much as possible (allowing for more optimizations), but the eager API that is similar to pandas is a more low-level entry and often easier to use. Polars does not want to mimic spark in the sense of distributed compute.
  3. Feature requests are more than welcome. I've added the features that I use often, but ironically, I am not that big a DataFrame user, so I might have a blind spot for a lot of features.
rubyfin commented 3 years ago

Hi @rubyfin,

Thanks, and good to hear you like it. Let me address your questions:

  1. This library is targeting production use. With regard to nightly features. The only nightly feature in Arrow and therefore Polars is SIMD. SIMD is an opt-in, so if you want maximal performance, you need nightly.
  2. The goals and non-goals are added to the polars-book. To summarize; the goal of polars is being a blazing fast in-memory DataFrame library for data that fits on a single machine (server up to 250GB RAM or so). With regard to the API it is somewhat in the middle of pandas and spark. As I do want to encourage you to use the lazy API as much as possible (allowing for more optimizations), but the eager API that is similar to pandas is a more low-level entry and often easier to use. Polars does not want to mimic spark in the sense of distributed compute.
  3. Feature requests are more than welcome. I've added the features that I use often, but ironically, I am not that big a DataFrame user, so I might have a blind spot for a lot of features.

Thanks for the replies - I will definitely try contributing.

CMCDragonkai commented 2 years ago

@ritchie46, if I want to use Polars in a distributed execution context (multiple machines), how should I do this? With pandas/numpy, I have dask, but what about Polars?

KlausGlueckert commented 1 year ago

@CMCDragonkai I think it's meant to be only for single machine use (in memory)

danielgafni commented 1 year ago

@ritchie46 could you please share your thoughts on adding distributed mode to polars?

As I understand you are currently not planning this, would you mind explaining why? Is it just too complicated? Are there other reasons? Can this happen in the distant future?

Honestly distributed Polars would be just amazing!

ritchie46 commented 1 year ago

@ritchie46 could you please share your thoughts on adding distributed mode to polars?

As I understand you are currently not planning this, would you mind explaining why? Is it just too complicated? Are there other reasons? Can this happen in the distant future?

Honestly distributed Polars would be just amazing!

All things at their time. First more plans on single node polars. 🙂

danielgafni commented 1 year ago

This sounds a little... promising! Thanks Ritchie, have a good day :)