rust-dataframe / discussion

Use the issues for discussion
24 stars 1 forks source link

WIP Roundup #4

Open jblondin opened 5 years ago

jblondin commented 5 years ago

I'm starting to work on a 'WIP Roundup' document which will provide a initial introduction and analysis of the existing DataFrame WIPs.

Currently, I'm looking at:

Are there any more we'd like to add to this list?

jblondin commented 5 years ago

I've spent some time going through the libraries I mentioned and wrote up a quick introduction / analysis to each. I've uploaded it as PR #5. You can also view it rendered here.

Basically, the TL;DR is: none of the libraries really cover (or would be able to cover) all the use cases we mention in #3, but I feel like we should be able to come up with a design that hits everything we're looking for by pulling ideas from each.

milesgranger commented 5 years ago

Nice work! Thank you for taking the time to go through all of them!

I'm the maintainer of black-jack, and the DataFrame side of things needs more love indeed, I've focused a lot on the functionality of Series to start with, and now with your write up, I'm a bit confused myself why it doesn't support arbitrary data types...

Regarding the statement:

When accessing data, the type must be known by the user, and providing an incorrect type will result in a panic.

Can you point me to where this is? I interpreted this as attempting to get a column from the dataframe without knowing its type. If this is the case, it's not correct. DataFrame::get_column returns an Option which will be None if the type is wrong, name doesn't exist, or both.

Anyhow, good work again. I'm excited for how we can combine our efforts! :+1:

jblondin commented 5 years ago

DataFrame::get_column returns an Option which will be None if the type is wrong, name doesn't exist, or both.

You are correct; I was looking at the groupby implementation that does an unwrap() that would fail if you're using the wrong <T>. I'll update the document to clarify!

milesgranger commented 5 years ago

Ah ok, thanks! That should be addressed anyway. Thanks again!

paddyhoran commented 5 years ago

Of interest to this group might be that the C++ community within Apache Arrow are kicking off a data frame project within the larger Arrow project.

Some of the leaders in the Arrow C++ community have significant experience building such libraries, we could build our data frame solution within Arrow and leverage this knowledge.

Just a thought...

davidB commented 5 years ago

Hi @jblondin ,

Great work. Can you complete the page with rust channel (stable or nighlty) ? For my case nightly is a "no-go" to use a crate.

saritseal commented 4 years ago

The main issue with pandas dataframe is it is not distributed. It is a great library for a single node. Anaconda did try to make it distributed through Dask, but still the adoption is quite poor.