rust-dataframe / discussion

Use the issues for discussion
24 stars 1 forks source link

Use-cases #3

Closed nevi-me closed 1 year ago

nevi-me commented 5 years ago

One of the reasons why I'm interested in dataframe libraries for Rust is that Rust could make for a good ETL tool.

What other use-cases do people have?

hwchen commented 5 years ago

My own use case is definitely on the etl side. So things that are important to me are:

I'm sure there's some I'm missing, but these come to mind first

LukeMathWalker commented 5 years ago

My main use case concerns ML workloads:

galuhsahid commented 5 years ago

Hi, hope y'all don't mind me chiming in - I'm very interested in dataframe libraries for Rust, & agreed I think Rust could make a great ETL tool!

I think my main use cases have been covered by the previous comments. Some other ones:

jblondin commented 5 years ago

I'm looking at ML / Data science use cases, as well. Basically, I want a library that can ETL some data and feed it into ndarray or various machine learning libraries, so interoperability is a big part of what I'm looking for.

Some other features beyond what's already been mentioned:

LukeMathWalker commented 5 years ago

I think that things like scaling, normalization, feature encoding do not necessarily belong to the (core) dataframe library @jblondin. I see them more in a Scikit-learnish port, that uses dataframe as a first-citizen input type. What do you think?

jblondin commented 5 years ago

I think that things like scaling, normalization, feature encoding do not necessarily belong to the (core) dataframe library @jblondin. I see them more in a Scikit-learnish port, that uses dataframe as a first-citizen input type. What do you think?

Good point. While I don't think we should be beholden to trying to mimic the Python way of doing things, in the interest of the 'prefer small crates' Rust philosophy, most of my points should be in a separate ML-focused preprocessing crate. Similarly, we might want to put the time-series-specific features @galuhsahid mentions in a separate crate as well.

Thank you for bringing that up! This thread may be useful for defining some crate boundaries as well as needed use cases.

jblondin commented 5 years ago
  • expressive manipulation (named axes);

@LukeMathWalker I'm not sure I understand exactly what this means. What would this entail?

LukeMathWalker commented 5 years ago

I have definitely been too concise there @jblondin, my fault. I meant

compile-time checks on common manipulations (e.g. as access to columns by index name), steering as far away as possible from a "stringy" API.

quoting myself from #1.

jesskfullwood commented 5 years ago

compile-time checks on common manipulations (e.g. as access to columns by index name), steering as far away as possible from a "stringy" API.

This is something I really struggled with. it would be lovely to do df["age"].mean() and have it not compile if "age" is not a valid column label, but there is no way to do this in Rust. The closes I got was to define a trait

pub trait ColId: Copy {
    const NAME: &'static str;
    type Output;
}

Then use a macro

define_col!(Age, u16, "age")

which expands to something like

#[derive Copy]
struct Age;

impl ColId for Age {
    type Output = u16;
    const NAME: &'static str = "age"
}

Then do df[Age].mean(). Which works but is pretty ugly and unintuitive.

frameless has a "symbol" syntax which allows a much cleaner interface like (quoting the docs):

case class Apartment(city: String, surface: Int, price: Double, bedrooms: Int)
val apartments = Seq(
  Apartment("Paris", 50,  300000.0, 2),
  Apartment("Nice",  74,  325000.0, 3)
)
val apts = TypedDataset.create(apartments)
apts.select(apts('city)).show() // select city column with symbol
apts.select(apts('surface) * 10, apts('surface) + 2).show()  // select two columns and manipulate

Time for an RFC? :smile:

jblondin commented 5 years ago

Then use a macro

define_col!(Age, u16, "age")

That's basically what the tablespace macro in agnes does:

tablespace![
  table example_table {
    Age: u16 = "age",
  }
]

I'd agree that a cleaner, simpler, approach would be preferred, but I'm not exactly sure how to go about doing that :smile:

nevi-me commented 5 years ago

Hi @jesskfullwood another solution that could work is if you lazily evaluate your table/dataframe; though you might not get as ergonomic as df["age"].mean().

If you had a Column which has a type, you could:


pub struct Column {
    data: ArrayRef,
    data_type: DataType, // where this is an enum of different types
}

pub trait AggregationFn {
    mean(&self) -> Result<f64>;
    sum(&self) -> Result<f64>, // of course this can be any output result
}

impl AggregationFn for Column {
    mean(&self) -> Result<f64> {
        if self.data_type.is_numeric() {
            Ok(self.data.mean()) // assuming this is implemented somewhere as a kernel
        } else {
            Err(MyError("cannot calculate mean of non-numeric column"
        }
    }
}
jesskfullwood commented 5 years ago

@nevi-me This is the way I originally did it and is basically the approach that Arrow takes. But it is quite limiting and largely negates the point of using Rust IMO. The DataFrame doesn't 'know' what is contained within so cannot statically check whether a given operation (e.g. "fetch this column") is valid. This is the major problem I have with pandas et al.

You are also limited in the types a given Column can contain, because each type must be enumerated within the DataType enum ahead of time. Essentially this limits you to just primitive types. I think it would be much nicer to be able to have e.g. enums like

enum Sex { Male, Female, NotStated }

within a Column rather than falling back to

is_male: bool

Re lazy-evaluation, I think that is a separate topic. If you had a hypothetical typesafe dataframe, one could imagine buiding up operations into a type a la how Futures work, e.g.

join(df1, df2, UserId1, UserId2) // frame1, frame2, join col 1, join col 2

could either directly evaluate the join resulting in a new Frame<...>, or build up a Join<...> type which could be executed at a later point. The latter is how Frameless works.

One benefit of lazy evaluation is that in theory the query can be optimized similar to a database so that you only execute the parts strictly necessary to generate the result you ask for.

ETA: I should mention, the optimization layer has conveniently already been written for us: Weld.

jblondin commented 5 years ago

@jesskfullwood I do think it's possible to create typesafe dataframe wrapper around Arrow (I'm currently working on it).

Adding custom data types (e.g. enums) might be a bit more difficult -- I'm not currently sure how to handle types outside of Arrow's (or at least the Rust Arrow implementation's) primitive data types. I think it should be theoretically possible, though, with Arrow's union, list, and struct frameworks.

As a more general use case question, what are our needs, datatype-wise, beyond the typical primitives / strings? @jesskfullwood brings up an interesting use case with enums (or really any arbitrary type), but we'd have to figure out how that would work with our interoperability goals.

LukeMathWalker commented 5 years ago

I strongly agree with @jesskfullwood - having a list/enum of acceptable/primitive types feels like an anti-pattern to me. We should be able to handle arbitrary Rust types. The question becomes: how can we make this play nicely with Apache Arrow?

A possible solution would be to use a trait, where a Rust struct/enum provides methods that convert it to a memory layout that uses Apache Arrow primitives. It basically tells us how to lay it down in memory using the capabilities offered by Apache Arrow. This might be a little tiresome to do at first, but we could probably get to the point when we can automate it for most types using a #[derive(ArrowCompatible)] macro.

LukeMathWalker commented 5 years ago

Btw, I didn't know about frameless - super cool! Thanks @jesskfullwood :smile: