Welcome! - Githubissues

hwchen commented 5 years ago

Hi! I’m excited to begin discussion of strategies for implementing a dataframe.

I imagine this repo as the main archive of discussions, with perhaps a discord channel for real-time chat.

I think discussion can be in the issues for now. We could do something more formal eventually, whether a wiki or md files, if we want to crystallize some directions.

Some topics I’m interested in:

user api (type checking, ergonomics)
backend (performance, integration with other data engines)
use cases (pain points from other systems, examples of current production systems to switch to Rust)
prior art (discussion of design decisions from other data engineering/scientific computing libraries)
WIP (post updates about your current attempt, design decisions, etc

LukeMathWalker commented 5 years ago

Hello!

I want to get back here to lay down some thoughts, but I thought it would interesting as well to collect the spurious pieces I have seen floating around in my corner of Rust about dataframes:

Discussion mentioning dataframes (and some ideas): https://github.com/rust-ndarray/ndarray/issues/539
An abandoned Rust dataframe implementation: https://github.com/kernelmachine/utah
How Pandas works under the hood (2015): http://www.jeffreytratner.com/slides/pandas-under-the-hood-pydata-seattle-2015.pdf
How Pandas wants to do it in the future (Apache Arrow): http://wesmckinney.com/blog/apache-arrow-pandas-internals/
DataFusion (a distributed dataframe in Rust? - now donated to the Apache Arrow project): https://github.com/andygrove/datafusion

nevi-me commented 5 years ago

Hi! I have https://github.com/nevi-me/rust-dataframe in addition to what @LukeMathWalker mentioned.

LukeMathWalker commented 5 years ago

Another interesting conversion concerning DataFrames: https://github.com/rust-ndarray/ndarray/issues/539

LukeMathWalker commented 5 years ago

Other food for though on columnar storage: https://www.reddit.com/r/rust/comments/afo4ln/exploring_columnoriented_data_in_rust_with_frunk/

hwchen commented 5 years ago

@LukeMathWalker thanks for continuing to add references. I'd like to start putting together a document with these sources and perhaps some commentary, like an annotated bibliography.

galuhsahid commented 5 years ago

Might be interesting to see Go's approach to this: https://github.com/go-gota/gota

jblondin commented 5 years ago

Hi everyone! I just wanted to mention my crate that I've been working on lately: https://github.com/jblondin/agnes.

I guess I should be on reddit more, since it's pretty similar to this (and I originally based the structure on frunk's HLists):

Other food for though on columnar storage: https://www.reddit.com/r/rust/comments/afo4ln/exploring_columnoriented_data_in_rust_with_frunk/

It's still early code, and I've kinda been working on it in an one-person echo chamber (never a great idea -- my cats are decent debuggers but horrible at calling out bad design decisions), but I think it has some potential. It is typesafe (columns are referred to by unit-like marker structs which are associated with that column's data type), avoids copies as much as possible, and has basic join, print, iteration, and serialization functionality. I wrote a user guide here. I probably need to write up a design document as well.

I'm planning on most likely replacing the lowest-level data storage with ndarray to for ease of interoperability (especially if ndarray is going to eventually interop with Apache Arrow as Luca mentions here.

Let me know if there's anything I can do to help this initiative -- I'd love to see a stable dataframe library in Rust!

paddyhoran commented 5 years ago

Just my opinion...

ndarray and it's ecosystem are gaining some good momentum but I don't believe that a dataframe library in Rust should be based on ndarray. This is how pandas is now, based on numpy, and Apache Arrow is being developed in part to solve some of the issues that this created.

I believe we should build a data frame library as a 'front end' to Apache Arrow. This library would serve the purpose of data access and "data wrangling" and could provide a way to zero copy convert to ndarray data structures. ndarray could then focus on the computations you want to apply to "cleaned" data.

Arrow is seeing adoption from a range of projects and adopting this underlying infrastructure would allow us to take advantage of the Arrow ecosystem.

I'm a committer to the Rust Arrow implementation along with a few others and we would welcome the input regarding requirements of higher level libraries. There are others focusing on lower level details in Arrow, there is already a query execution engine called datafusion in Arrow as mentioned above. This group could then focus on api design and feedback to Arrow.

The key thing to gain consensus on is which project is the dataframe library. The Rust community is smaller and I think we all need to focus on one data frame project and drive it forward.

This probably requires someone to step forward and volunteer to drive such a project forward. I don't need such a library bad enough to do this but I would contribute to such a project if it existed.

jblondin commented 5 years ago

I believe we should build a data frame library as a 'front end' to Apache Arrow. This library would serve the purpose of data access and "data wrangling" and could provide a way to zero copy convert to ndarray data structures. ndarray could then focus on the computations you want to apply to "cleaned" data.

I see your point and can agree with this -- using Rust as a data science language will require a lot of interoperability and Apache Arrow is the best way forward for this that I've seen. ndarray probably should be seen as a computation target (linalg, stats, etc) instead of a baseline data format.

nevi-me commented 5 years ago

I agree with @paddyhoran, using Arrow also benefits us with not having to worry about a lot of IO. I created https://github.com/nevi-me/rust-dataframe with the intention of bikeshedding a dataframe that relies on Arrow for both in-memory data, as well as some computation.

Although rust-dataframe looks stagnant, I'm still working on ideas around it on paper. Also, I'm also contribution to Arrow with the things that I'd like to be able to do in the library (I'm mainly working on IO support for basic things like CSV, JSON).

I also think that if/when ndarray supports Arrow, it would make for a great UDF interface where one needs multi-dimensional data, and we could use ndarray's stats functionality in dataframes built in Rust.

The other effort I've been trying, though time is a huge constraint as I have a hectic work schedule + studying, is creating Arrow interfaces to SQL DBs in Rust. I've got a simple PostgreSQL one working, but haven't had time to put it on GH.

LukeMathWalker commented 5 years ago

I think that interoperability should be a core principle of whatever we decide to invest in: it's unreasonable to expect anyone to work in a Rust-only environment for domains such as Machine Learning or Data Engineering. I don't think is anyone's interest to create another isolated computational environment, it would be just a waste of time.

On the other side though, I'd like to build an API that feels native and first-citizen to Rust. One point that I feel strongly about is using the compiler and the type system to their fullest extent. I'd love to see typed DataFrames, with compile-time checks on common manipulations (e.g. as access to columns by index name), steering as far away as possible from a "stringy" API. It should also be possible to use common Rust patterns (Enums, NewTypes, etc.) as first-class citizens, thus avoiding the "boundary" feeling that I often experience in Python when my Pandas code comes into contact with my business logic code. Something very similar to what I experience when working with databases/ORMs.

ndarray can be a good target for computation-heavy workloads, but Apache Arrow looks like a much more apt solution for what we are trying to build. I don't have a lot of visibility over the project though, what is its state right now? @paddyhoran

nevi-me commented 5 years ago

Hi @LukeMathWalker I'll answer the question that you've asked @paddyhoran

Arrow is very usable, although we might make minor/breaking changes to the parts of the library that we're still working on (we don't support some data types that the CPP and other implementations support, and some might require some refactoring).

We have:

IO
- read/write of CSVs, with writing of all scalar types supported, and only datetime read missing (https://issues.apache.org/jira/browse/ARROW-4804)
- read of JSON, though its performance isn't great, so shouldn't yet be used where performance matters
- interop with native parquet (part of the project) not yet there
basic compute, it's very easy to add more to it, see (https://github.com/nevi-me/rust-dataframe/blob/master/src/functions/scalar.rs) as an example
IPC - only started working on it now, but I doubt there'd be a lot of use-cases here.

The foundational part which one would use to rely on Arrow is sound and relatively stable.

jblondin commented 5 years ago

One possible concern with the Arrow implementation (please correct me if I'm wrong @nevi-me @paddyhoran) is that it seems to currently require the nightly toolchain. Specifically, a dependency on packed_simd and use of the specialization feature (perhaps more, this is just what I gathered from a quick look).

I personally don't see this as a huge problem as eventually these things will be stabilized and we're just starting this project, but I thought I'd point it out.

nevi-me commented 5 years ago

Yes, I suppose we could hide packed_simd behind a feature flag, but we have to wait for specialisation to become stable.

One thing I'm personally unsure of is what will happen after 0.15 is released in a few months, because the release after that might be 1.0.0. We don't follow semver, and because we are a multilingual library, some languages might still be behind when Arrow cpp/python/java are considered stable.

LukeMathWalker commented 5 years ago

I am not too worried by using the nightly toolchain to leverage specialization.

What does this versioning strategy imply @nevi-me? Do we risk to have breaking changes without a bump in the major version number? It wouldn't be a major problem if Arrow is a private dependency, but if we do happen to expose or use its types in our public API it becomes more troublesome.

nevi-me commented 5 years ago

It would likely be a private dependency, the IPC part of the format is versioned, so when reading Arrow data from say am external system, that system would declare its version. So that helps with avoiding breakages.

If a library that uses Arrow doesn't stay far behind the latest version, small changes would theoretically be easy to handle.

One significant consideration though is that if publishing a crate that depends on Arrow, we'd likely have to either move at Arrow's cadence (we're aiming for a release every 2 months going forward), or fork it like what DataFusion did before it was donated to Arrow. This depends on how much we'd contribute upstream to Arrow, as I'd imagine some functionality might better being upstream. A rising tide that lifts all boats

jesskfullwood commented 5 years ago

Hi all. I have created a library similiar to @jblondin , here: https://github.com/jesskfullwood/frames. It does maps, joins, groupby, filter all in typesafe manner, and it allows arbitrary fields types (e.g. you can have enums and structs in your columns). But while it is functional, it is much less polished and I somewhat gave up on it when I decided I couldn't get the ergonomics that I wanted (something as intuitive as R data.table but FAST (even for strings) and TYPESAFE). I decided I would revisit it when GATs and specialization have landed.

I absolutely am looking for something to use in production, at work we have an unmanageably complex series of R scripts and I'm desperate to introduce some type-safety. Something like frameless only.. not Spark.

rust-dataframe / discussion

Welcome! #1