rust-ml / linfa

A Rust machine learning framework.
Apache License 2.0
3.66k stars 238 forks source link

Add arrow data storage support for linfa pre-processing and training module #285

Open MrDataPsycho opened 1 year ago

MrDataPsycho commented 1 year ago

Preprocessing and transformation with Data frames are heavily used for ml model training in scikitlearn. The two most popular DataFrame libraries (Polars, DataFusion) written in rust are based on apache arrow in-memory data format but not based on ndarray. Which also looks like will be the trend for any new data frame players in Rust. It does not look like there will be a data frame that wraps ndarray under the hood, the way pandas wrap numpy.

By adding arrow support in linfa, any data frame based on arrow will have default support which means any arrow-based data frame can be passed to any preprocessing or training modules of linfa. Without dealing with ndarray. The way pandas can be passed to sci-kit-learn. Hence I propose to have direct arrow support in linfa to have it a more generalized framework. By doing that Polars/DataFusion users can already use rust for ml training out of the box.

YuhanLiin commented 1 year ago

I don't want to abandon ndarray completely, since all the downstream code relies on it. If we add arrow support then we should support both, preferably via a generic trait that abstracts over the underlying data format. Unfortunately integration between ndarry and arrow doesn't yet exist directly, but they did talk about it here

MrDataPsycho commented 1 year ago

That makes sense totally. Just to add support for arrow without abandoning ndarray. Thanks

MrDataPsycho commented 1 year ago

I have done further investigation and it looks like you must have ndarray or array-like data structure support alongside. Because there are file types like images, and texts that can not be loaded as data frames, which needed to be loaded one by one or as a batch in an array/tensor for further training, the way datasets in pytorch.utils do.