Closed alippai closed 2 years ago
Thanks! :)
I see that Python API is already WIP, you are fast :))
As you mentioned that you already use rayon, are the pandas vs polars benchmarks (groupby, join) run using multiple cores? So is it eg. 1 core vs 4 core comparison in the README now?
Yes, the benchmarks are indeed on multiple cores. However, not everything is parallelizable and the Join and Groupby algorithm are largely single-threaded. For groupby the obvious parallelization is the apply part. For the joins, I could only parallelize the selection of the rows after all the join tuples were computed.
B.t.w. you can already try out the Python bindings (though beware, it is probably still very buggy). If you run
docker run --rm -p 8890:8890 ritchie46/py-polars
, it will start a jupyter notebook server for you. The Python public api can be found here: https://github.com/ritchie46/polars/tree/master/py-polars/polars
Hi @ritchie46 - thought I'd jump on this thread. Firstly, I wanted to echo @alippai 's sentiments in saying that this looks like a really interesting/promising project! Are you keen to have other contributors, or would you prefer users just to raise Issues? If you'd like other contributors, it'd be great to get some more info on the Issues, e.g. reproducible errors, or more detailed info on the scope and acceptance criteria.
Keep up the great work!
@cvonsteg thank you :). I'd prefer to start with issues and when needed some discussion, but people are free to implement these issues. It's good that you mention this. It enforces me to formalize it a bit and create a contribution guide
@ritchie46 - great, thanks for letting me know. Also, kudos on getting a contribution guide up and running so quickly! For now, I'll play with the project some more and start trying to identify bugs, improvements, and new features.
Speaking of reusing the compute kernels, will the improvements like https://github.com/apache/arrow/commit/0100121f92299d68b348206288f12c43c44110e4 automatically improve polars when Arrow 2.0 is released (and the polars dependency updated)?
@ritchie46 - great, thanks for letting me know. Also, kudos on getting a contribution guide up and running so quickly! For now, I'll play with the project some more and start trying to identify bugs, improvements, and new features.
Great. I am currently wrapping up the initial lazy api. This postpones your query until you actually require data and then tries to optmize the query. This optimization can be flawed though, so any feedback bugs on that part of the api is very much appreciated.
Speaking of reusing the compute kernels, will the improvements like apache/arrow@0100121 automatically improve polars when Arrow 2.0 is released (and the polars dependency updated)?
Definitely! Polars uses the arrow compute kernels whenever the chunks of the Series/ChunkedArray
align (which is most of the time). Here is a snippet of the Add
trait impl.
fn add(self, rhs: Self) -> Self::Output {
// arrow simd path
if self.chunk_id == rhs.chunk_id {
let expect_str = "Could not add, check data types and length";
operand_on_primitive_arr![self, rhs, compute::add, expect_str]
// broadcasting and fast path
} else if rhs.len() == 1 {
let opt_rhs = rhs.get(0);
match opt_rhs {
None => ChunkedArray::full_null(self.name(), self.len()),
Some(rhs) => self.apply(|val| val + rhs),
}
// slower path
} else {
apply_operand_on_chunkedarray_by_iter!(self, rhs, +)
}
}
I love what you have done so far with this project. It looks very interesting and promising, and it is incredible to have this type of functionality being developed in Rust ecosystem. I have a few questions / suggestions, if you don't mind:
Thank you for the hard work!
Hi @rubyfin,
Thanks, and good to hear you like it. Let me address your questions:
Hi @rubyfin,
Thanks, and good to hear you like it. Let me address your questions:
- This library is targeting production use. With regard to nightly features. The only nightly feature in Arrow and therefore Polars is SIMD. SIMD is an opt-in, so if you want maximal performance, you need nightly.
- The goals and non-goals are added to the polars-book. To summarize; the goal of polars is being a blazing fast in-memory DataFrame library for data that fits on a single machine (server up to 250GB RAM or so). With regard to the API it is somewhat in the middle of pandas and spark. As I do want to encourage you to use the lazy API as much as possible (allowing for more optimizations), but the eager API that is similar to pandas is a more low-level entry and often easier to use. Polars does not want to mimic spark in the sense of distributed compute.
- Feature requests are more than welcome. I've added the features that I use often, but ironically, I am not that big a DataFrame user, so I might have a blind spot for a lot of features.
Thanks for the replies - I will definitely try contributing.
@ritchie46, if I want to use Polars in a distributed execution context (multiple machines), how should I do this? With pandas/numpy, I have dask, but what about Polars?
@CMCDragonkai I think it's meant to be only for single machine use (in memory)
@ritchie46 could you please share your thoughts on adding distributed mode to polars?
As I understand you are currently not planning this, would you mind explaining why? Is it just too complicated? Are there other reasons? Can this happen in the distant future?
Honestly distributed Polars would be just amazing!
@ritchie46 could you please share your thoughts on adding distributed mode to polars?
As I understand you are currently not planning this, would you mind explaining why? Is it just too complicated? Are there other reasons? Can this happen in the distant future?
Honestly distributed Polars would be just amazing!
All things at their time. First more plans on single node polars. 🙂
This sounds a little... promising! Thanks Ritchie, have a good day :)
This is pretty interesting project, shows how powerful things can be built using Rust + Arrow. A few basic questions regarding your future plans: