rust-dataframe / discussion

Use the issues for discussion
24 stars 1 forks source link

Heterogeneous column types #6

Open jblondin opened 5 years ago

jblondin commented 5 years ago

So, I thought I'd start opening up issues to enable discussion of individual dataframe features we'd like to see. I'd like to start with 'heterogeneous column types': the ability to have a dataframe with columns of different types.

In looking through existing WIPs in #4, I came across a few different methods of implementing this:

  1. Using an enum for either a column or for individual values. utah (and really any arbitrarily-typed dataframe library) can house enums as values, which allows you to mix types however you want (even within the same column), at the cost of run-time type safety and some performance. I didn't see any library currently use column-based enums, but I could see having something like

    enum Column { 
    Float(Vec<f64>),
    Int(Vec<i64>),
    Text(Vec<String>), 
    }

    and in fact did it this way in an early version of agnes.

  2. Using Any-based storage, along with some metadata for relating columns to data types at run-time. Used by rust-dataframe and black-jack.

  3. Using cons-lists to provide compile-time type safety. Used by agnes and frames.

Each of these has its own advantages and disadvantages. For example, 1 and 2 lack compile-time type-checking, but have much cleaner type signatures and potentially cleaner error messages than 3 (where you have something like DataFrame<Cons<usize, Cons<f64, Cons<String, Nil>>>> for a relatively simple three-column dataframe).

You could also have a combination of the above techniques -- I could see something like cons-list type-checking column metadata structure while the data itself is stored in some sort of Any-based structure.

I'm personally a fan of the compile-time type-checking that cons-lists provide, but they can be hard to work with for those unfamiliar with them. I've started work on a labeled cons-list library (which will replace the one I'm using in agnes) to hopefully help out with some of these issues.

What are everyone's thoughts / opinions? Are there other options we should consider? I'd love to hear what people think the best approach is!

jesskfullwood commented 5 years ago

If you don't have type-safety (option 3), then why use Rust? Pandas is good enough (actually I think data.table is the best dataframe library, shame about the rest of R).

On the point of bad type signatures, for frames I made some horrible macros to generate the types Frame1<T>, Frame2<T1, T2>, Frame3<T1, T2, T3> up to Frame32<...> as aliases for Frame<Cons<T1, Cons<T2, Cons<...>>>>.

This doesn't help when Rust reports errors, but it does make it possible to describe a Frame in a relatively succinct way. Especially since you can then alias like

type MyFrame = Frame6<...>

(Of course as soon as you do a join or a groupby or whatever, the type changes)

jblondin commented 5 years ago

On the point of bad type signatures, for frames I made some horrible macros to generate the types Frame1<T>, Frame2<T1, T2>, Frame3<T1, T2, T3> up to Frame32<...> as aliases for Frame<Cons<T1, Cons<T2, Cons<...>>>>.

That does seem pretty useful. Also, since macros can be used in type contexts, you could have something like:

type MyFrame = Frame![Col1, Col2, Col3];

I did something similar with creating label lists in agnes.

The type signatures in errors are still a problem, though.

LukeMathWalker commented 5 years ago

It looks like it's a bit of a trade-off between compile-time type-checking vs readability and usefulness of error messages. I have seen you have made some progresses on lhlist @jblondin - what do you think of it? Does it provide a significant benefit over raw Cons lists?

jblondin commented 5 years ago

I'm pretty happy with how lhlist is coming along. It's intended to implement the 'extensible record' model from the Haskell HList paper (section 5), and it's getting fairly close: I've implemented creation, label-based implementation, and iteration. I need to add some set membership functionality (currently you can reuse a label in a list, but lookup will only access the first one), work on the error messages a bit, and add a little bit of useful functionality (e.g. concatenation).

I think it'll be useful for what we're doing in that it'll provide a framework for our goal of compile-time-checked non-stringy column indexing. It should make the column lookup easier to use and a lot less messy.

DeppLearning commented 5 years ago

(actually I think data.table is the best dataframe library, shame about the rest of R).

You are not alone! https://h2oai.github.io/db-benchmark/ FST, somewhat related, allows rapid serialization of data.tables to disk: https://github.com/fstpackage/fst Maybe some of their tricks apply to rust, I'm too much of a beginner in Rust to judge, so.. sorry if this doesn't fit here.

jesskfullwood commented 5 years ago

We use data.table + fst a lot at day job. We batch-process a large amount of data in-memory on a single beefy server (RSS around 100GB) and are in this annoying sweet spot where data.table by far the fastest tool we have found. I say unfortunate because you really shouldn't be writing serious production code in R but anything else would compromise performance.

data.table itself is highly optimized C with over a decade of development behind it. Still, I reckon a decent pure-Rust version could beat it handily. In particular, data.table uses R routines for string handling which are still super slow. I benchmarked a simple Rust version as around 2X slower on numerical workloads (SIMD?) but 12X faster for strings.

mamcx commented 4 years ago

If you don't have type-safety (option 3), then why use Rust?

Because that type-safety is for the LANG, but data CAN'T BE type safe in a universal way. Imagine I open psql, and start typing queries. Is interactive, and it can be type safe all.

My opinion is that any library that need to operate on arbitrary data MUST be dynamic, you could add a type safe layer on top "easily" but not the other way.

FelixBenning commented 4 years ago

Is this discussion/project still alive?

FelixBenning commented 4 years ago

I think the issue with the third proposal is, that I still have not understood it. It seems complicated. What are Cons-lists? And regarding the discussion of enums in #3 : While it makes a lot of sense at first that this

enum Sex { Male, Female, NotStated }

is better than

is_male: bool

You still have the issue that you need to handle these enums in your eventual calculations. I mean it is nice that your dataframe can store arbitrary types. But you do not just want to store them. This is not supposed to be a database. And once you try to implement simple functions you could apply to these columns, like mean, you already need some knowledge about the type. Or at least assumptions you can make. And these assumptions could probably be encoded as a trait. And then you could have Columns be an enum of these traits.

I mean for enum types like the Sex enum proposed above, you would probably want a way the ability to convert it into a dummy variable. I.e. it would implement a trait like "dummyable", which would tell you how this enum would be converted into multiple columns of dummyvaraibles (which are of bool type). e.g. Sex would be converted into conversion is_female is_na
Male 0 0
Female 1 0
Not Stated 0 1
It might be possible to provide a default implementation for arbitrary enums, by making the first element of the enum be the default i.e. without a dummy variable. But in some cases you might want a different dummy conversion. E.g. in case of non-smoker, occasional smoker and smoker, you might want: conversion is_occ_smok is_smoker
non-smoker 0 0
occasional smoker 1 0
smoker 1 1

Since (if you want to do a linear regression) the effects of regular smoking are added on top of the occasional smoking effects. On the other hand you would want a different conversion, when you want to get proportions (i.e. when you want to take the mean).

Anyway the point is: if you want to enable basic statistics/machine learning modules to build on top of dataframes. Then they need to be able to make some assumptions about the dataset. So an enum of allowed types for the columns might not be an antipattern after all.


And if you want to use truly unique types, then you would have to reimplement all the functionality the dataframe provides for this particular type. Which at this point, defeats the purpose of using a dataframe framework. And you would not want to build this functionality on top of this dataframe framework, if the types are encoded as some strange Cons-List, which you would have to first understand before you can build anything on top. Ideally you could extend the existing enums or have a default/custom element in the type enum which allows you to define functionality for these unique things.

But I think that it is unlikely that there are more types than: integer, float, string, categorical.

dexterlemmer commented 4 years ago

[Edit: Added warning to skip some of my comments.]

WARNING: To any body who haven't yet suffered through the next several comments by me and @FelixBenning: I suggest you skip all of my comments starting from this one (inclusive) up to (and including) https://github.com/rust-dataframe/discussion/issues/6#issuecomment-640280140. The only comment in this mess that I feel may have a sufficiently high cost/benefit ratio for most readers is https://github.com/rust-dataframe/discussion/issues/6#issuecomment-640268640. Just watch out for @FelixBenning's comment in between. I personally don't think it's worth while reading it, but he might disagree. :wink:

@FelixBenning Regarding cons lists, I think @jblondin meant HLists. A cons list is just a special type of linked list. An HList is a cons list implemented in the type system. This makes it possible to define and reason about arbitrary length heterogeneous lists of types at compile time. So for example you can statically check the type of the i'th element in HList![T1, T2, ..., Tn] for any n: usize and i: 0..n. The disadvantage is that HList isn't actually a type. It's a kind and since Rust doesn't yet support HKT's, the ergonomics of working with HLists are terrible.

Regarding enum Sex { Male, Female, NotStated } is better than is_male: bool:

  1. Isn't this off-topic here? Why didn't you post it on #3?
  2. I'll answer you, anyway, because my answer gives further insight on what I think we are striving for figuring out how to implement with this issue (#6): I've done what you did myself in Python. However it sucks compared to actually being able to have columns of any type. There are multiple approaches in Rust (such as approach 3 (HLists) of this thread) that allows you to have your cake and eat it too. i.e. The implementer of the dataframe crate can get total support for all types for free since the compiler knows exactly what type the data in every column is, the following code should work just fine (barring any mistakes I might've made, since I last programmed in Rust before epoch 2018 was out so my Rust is a bit... rusty):
import heterogeneous_dataframe::{DataFrame, DF};

// A type the crate user just made up so the imported crate cannot possibly know
// about it
trait Foo {
  fn foo(&Self);
}
struct Foo {}
impl Foo for Foo {
  fn foo(&Self) {println!("Foo");}
}

// Another type the imported crate knows and cares nothing about
#[Derive(Debug)]
enum Sex {Male, Female, NotStated}

// A nice macro for easily constructing syntax sugar. It also generates some
// Zero Sized Types (ZST) for indexing specific columns and constructs
// the HList that enables statically typed heterogeneously typed columns:
let df = DF!{
  let name = vec!["John", "Mary", "Anon"];
  let age = vec![33u8, 44, 0];
  let sex_col = vec![Sex::Male, Sex::Female, Sex::NotStated];
  let foo_col = vec![Foo{}, Foo{}, Foo{}];
}

// Compiles since the compiler knows the type of the `name` column is String.
assert_eq!(df[Name, 1], "Mary")

// Compiles and prints "Sex::NotStated" since the compiler knows the type of
// the `sex_col` column is `Sex`
println!("{df[SexCol, 2]:?}")

// Compiles and prints "Foo" three times since the compiler knows the type of
// the `foo_col` column is `Foo`
for x in df[FooCol] {
  x.foo()
}

// This throws a type error at compile time if un-commented,
// since the compiler knows that the type of the `sex_col` column is not `u32`:
//let myval:u32 = df[SexCol, 1];
dexterlemmer commented 4 years ago

Edit: You might want to skip this...

Oh. Another disadvantage of HLists is that since its type must be completely known at compile time, you have to specify the exact number and types of columns at compile time. Therefore, even though something like this might work

let df = DataFrame::read_csv!::<(f32, f32, f32, u32)>("/path/to/data.csv");

This won't work if we use HLists internally, since the read_csv macro cannot possibly construct the HList at compile time:

let df = DataFrame::read_csv!("/path/to/data.csv");
dexterlemmer commented 4 years ago

Edit: You might want to skip this...

A proposed fourth approach to heterogeneous column types

We seem to have a tradeoff: On the one hand it would be very nice if we can have compile time type checks and zero cost abstractions, but on the other hand we need the ability to construct dataframes based on data that might only be available at runtime. (Bonuses like fast and ergonomic indexing, slicing and iteration and consistently good error messages would also be great.)

Can we perhaps eliminate the trade-off (or at least give the user the option to choose)and yet have a single consistent type? We might be able to after all. Here's what I think:

Approach 4: A typed front-end for a byte-array (or usize-array) back-end

  1. API: Column Categories: We define four categories of columns:

    1. C1<T:Sized> Statically typed T, Statically known n_cols: i.e. The amount of columns in category C1 is known at compile time and so is the type of each C1 column, the type of the column must implement Sized for the back-end to work. We store the columns' metadata (name, index, type) in an HList or on some other kind of heterogeneous type list based on generics and associative types.
    2. C2<ColumnEnum> Dynamically typed T, Statically known n_cols: We store the columns' metadata in an array of our own Enum (i.e. [Column]).
    3. C3<ColumnEnum> Dynamically typed T, Dynamically computed n_cols, but still with the advantages of using our own enum such as :Sized and stack allocated: We store the columns' metadata in a Vec<ColumnEnum>.
    4. C4<dyn T:Any> Dynamically typed T, Dynamically computed n_cols and pretty much any type allowed as long as it implements certain types we need like Copy, Clone and probably also Serialize/Deserialize, etc.: This would probably require an approach like black-jack's, which uses baggie internally to store its data. (Note that baggie itself won't work, since it's a HashMap, not some kind of array, but we can may be store a similarly designed array of Box<dyn Any>'s or bind them to dyn Any+Sized or something (I don't quite know how Any works, it's after my time) to at least make accessing and iterating over either the pointers to them or themselves fast.
  2. API: DataFrame Types: We define four types of DataFrames for ensuring that users don't pay for what they don't use and that they can know and control which features the DataFrames don't have:

    1. DF1: Contains only C1 columns. Stores column metadata such as strides, pointers to starts of columns, column indexces, etc. in arrays for fast and ergonomic indexing. Supports indexing through ZST, or label:Hash+Eq+Sized+Copy, or index:usize.
    2. DF2: The same as DF1, except that it contains a mixture of C1 and C2 columns (in an arbitrary order, so you can have columns arranged like [C2, C1, C1, C2, C2] and not just like [C1, C1, C2, C2, C2].
    3. DF3: Adds C3 columns at the end of a DF2. Also replaces all metadata arrays (except for ZST indexces) with vecs.
    4. DF4: Adds C4 columns at the end of DF3.
  3. API: Creating dataframes: We can obviously use macros. We can also use functions, despite the HLists if we use the builder pattern. (We have to construct an HList column by column with one function call per column because we need to recursively define associated types to compute the HList. After the HList, the rest of the construction can obviously easily be done in one function call for each category of columns.)

    • Actually, I think we will need macros. The reason for this is that the indexing ZST's will need to be uniquely defined and will need associated TypeNums for making it possible to compute pointers to cells (explained below).
  4. Back-end: Just a giant vector of bytes (or may be usize's would make more sense). The real magic is in how we connect the API with the back-end, not in the back-end itself. That said, we'll probably need to make our own implementation to ensure for example an optimal resize strategy when adding capacity.

  5. Implementing the Index traits for each type of index to get ergonomic yet fast access to a single cell.:

    1. First convert the index into an offset into the back-end array as fast as possible using the available metadata in stored in the dataframe. Note that ZST indexes' offsets should be computed during compile time.
    2. If a ZST was used, we can do an unsave memory access (with due diligence in keeping track of lifetimes) for indexing without bounds checks.
    3. If a label was used, we might still be capable of skipping bounds checks if we do use a Label new type and the Label type is designed to prove that the label is for this dataframe and cannot be shared with the dataframe while either is mutable.
      • Note that on C1 columns we might need to make the label a constant in order for it to even work (since otherwise we cannot compute the return type for the Index trait). Hmm. Can the return type be computed even then?
      • Note: We'll need disambiguation between usize labels and indexes, either by creating a Label newtype, storing the label in an Option or by using the Pandas approach of iloc and loc indexes (although I'm not certain loc and iloc can have the same ergonomic syntax in Rust as in Pandas)
    4. We cannot implement the Index trait for C1 columns for usize indexes since the return type cannot be computed at compile time. Otherwise these are normal.
    5. If the caller provides a const usize index, however, it'll work for all column categories.
      • Note: C3 and C4 columns will definitely need runtime bounds checking since we cannot know at compile time how many such columns there are.
  6. Implementing index slices:

    1. This is done similarly to implementing single cell indexing. However, notice that a dataframe acts like a smart pointer (or a vec). We'll need to create a new dataframe (or a type with a name like DataFrameSlice) with new metadata, but if the user asked for a reference, we won't need to copy, clone or move the back-end array.
  7. Implementing zero cost row-, column- and window iterators:

    1. A rows iterator (i.e. iterator iterating down columns, not across columns) would be often used and simple to design, implement and optimize. Notice that doing this across multiple columns or even multiple categories of columns doesn't make this any harder, so we also have our first window iterator.
    2. As long as a columns iterator (i.e. iterating across columns, not down them) is only iterating over either just C2/C3 columns or just C4 columns, it's again simple.
    3. To extend the columns iterator to iterate over both C2/3 columns and C4 columns simultaneously, one of three strategies can be followed:
      1. Coerce the C2/3 cells into the type of C4. Yuck! However it should work for most use cases.
      2. Return two iterators in a tuple. Provide an API that allows seamlessly switching over from the first to the second. I have a sort of idea how this can be done, but can't quite figure it out. Note if the iterator can roll over to the next row, it gets more interesting, but we can perhaps just use a single pair of iterators for all rows and ping pong work between them.
      3. Simply not implement the capability. :man_shrugging:
    4. To extend the columns iterator to handle C1 at all (whether on its own or together with other column categories), one of the following three approaches will have to be taken:
      1. Coerce every C1 cell into the type of C4. (Yuck! And I don't know if it's even possible and useful.)
      2. For each row: Create multiple iterators, put them in a tuple and somehow integrate them. Note that this will have to happen at compile time. At first glance that means, we cannot wrap to the next row, but we probably can if we keep the entire tuple of iterators in memory and just return back to running the first iterator in the tuple after the last (obviously incrementing the row index first).
      3. Again, simply skip implementing the capability.
    5. For the special case of a columns iterator constructed by the builder pattern rather than from a single function call, seamless integration should be easy across arbitrary column categories.
  8. Missing data?: I'm not going to solve this here. I'll merely note that the back-end can probably easily be extended with a bitmask.

PS: Apologies for walls of text.

Especially since my previous long comment was mostly just a somewhat off-topic reply and I don't know whether the idea presented in this comment is any good either. ;-)

FelixBenning commented 4 years ago
1. Isn't this off-topic here? Why didn't you post it on #3?

No, but I apparently did not manage to get my point across very well. I am not interested in this particular example, i.e. I am not interested into a certain use case. Which is why this does not belong into #3. I instead tried to make a more general point using this as an example. So I'll take another shot at it clarifying what I meant:

I am going to play dumb and claim there only 5 types of objects in rust:

So we can simply create an enum of these 5 types and have covered all possible types. This is of course not quite true. So what are the issues with this statement?

  1. Sizes (i8, i16, i32, i64, i128, i256, i512, f8,...), but since there are only finitely many sizes which are actually used, this is not a fundamental problem of this claim, and would still allow an enum
  2. Structs Of course you can create arbitrary many more objects using Structs. But the thing is: if you have struct data entries in a dataframe, you are arguably dataframing wrong. Every Row is literally a list of all the properties of one "object"/measurement. If some property has sub properties, you should arguably flatten this into additional columns. If you really want to retain some hierarchical structure of properties, you can achieve that with multiindexes.

But multi indexes could simply be implemented by creating a multiIndexDataframe, which is simply a list of enums with the two options multiIndexDataframePtr and dataframePtr. This way you can have arbitrarily deep multi indexes.

So what is left is basically just 5 possible types (times possible sizes). Which is perfectly suited for an enum. The only thing this could not deal with is oversized types, if someone starts to implement them. But I very much doubt that anyone is really going to need larger than 512 bytes for any kind of numeric value. And for SIMD we only need to keep our enum up to date with the current chip generation, which is very little work.

The only thing which might possibly cause issues, is String lengths. But if someone tries to store books in dataframe cells, then they probably care so little about performance, that we can also store string pointers.


Another argument is practicability (the above should be completely sufficient, but I started with this argument, so I will try to clarify this a bit more in order to explain my past comments):

When people use a dataframe library, they want to use prebuilt methods and functions. They want to be able to use stuff like import_from_csv as you suggested, they want to use .mean(), they want to use .fit_least_squares_linear_regression(), etc. You can not possibly provide these features, if you do not know the type of objects they are supposed to apply to. And I tried to illustrate the issue of implementing a .fit_least_squares_linear_regression() by arguing how difficult it would be to implement this for an arbitrary enum (Sex, Smoker, etc.). For this reason you would actually need some dummyfication on enums. You could have a default implementation for this dummyfication, but in a lot of cases, you would have to override this implementation as explained above. Which would currently be a problem, since you can not reimplement traits. But since specialization is coming, this would eventually solve itself. For this reason I would actually rewrite my original list of types into

cc @dexterlemmer

dexterlemmer commented 4 years ago

[Edit6: Added section impl4 is a superset of all of impl1/2/3] [Edit5: Added more column conversions for impl4.] [Edit4: A bunch of u8s should've been usizes] [Edit3: Edited in a new paragraph into ### 3. We can (and IMHO should) do better than impl1 (df_crate::ColEnum) by using impl2 (Any+metadata). It's marked.] [Edit2: Strike through text that isn't valid any more.] [Edit: Fixed a typo.]

Sorry everyone. I was an ass. I wrote 3 very long, badly structured and badly thought through comments above[^Note to self @dexterlemmer never write stuff on the Internet when you're tired! You knew that already.]. I'll delete my previous posts and write below what I should have from the start.

Terminology note: Implementations.

The OP talks about three implementation strategies for dataframes with heterogeneous column types. After this section, I'll just refer to them as impl1/2/3. Here's a recap of the implementation strategies:

I'll also propose a new implementation (and API):

Overview

  1. Explanation of impl3.
  2. I feel impl3 (HLists) alone is insufficient.
  3. We can (and IMHO should) do better than implementation 1 by using implementation 2.
  4. Proposal: Impl4: Column<T>, AnyCol<T>, Row<S>, a bunch of DF types mirroring std's Tuple, Array and Vec, and finally some private buffers.
  5. I would be willing to try implementing impl4 myself, but...

1. Explanation of impl3 (HLists)

This is simply a way for us to fully statically type our heterogeneous columns. In rust, the frunk crate implements HLists. Static typing is great. Most of what makes Rust great is that it's statically typed by default. Sure, many people probably won't claim they like Rust because it's statically typed, but I can guarantee you that almost anything most people might say is great about Rust either requires static typing or interact with other great features in a way that requires static typing.

Here's an example of using HLists from frunk's README (note the compiler allows this to type check and would've raised a type error if the types of the elements didn't match):

// You can convert a tuple to an HList and vice-versa
let h2 = hlist![ 42f32, true, "hello" ];
let t: (f32, bool, &str) = h2.into();
assert_eq!(t, (42f32, true, "hello"));

The advantage of HLists over tuples is that they are lists, so we can do things that are impossible with tuples such as iterate over an HList, pushing and popping elements, etc.

2. I feel impl3 (HLists) alone is insufficient.

HLists' strength (full static typing) is also their weakness:

  1. We cannot dynamically change the types, number or order of columns. Furthermore, we also cannot implement abstractions that internally requires dynamic columns. For example:
    // This will not compile unless the compiler can infer the type from other code
    let _ = dataframe::read_csv("/path/to/data.csv")
    // We need to do this in stead
    let _ = dataframe::read_csv::<usize, f32, f32, f32, u8>("/path/to/data.csv")
  2. HList-based dataframes are un-ergonomic:
    1. Ugly type errors are common. The compiler might tell you
      HCons<bool, HCons<&str, HCons<Option<usize>, HNil>>> does not match HCons<bool, HCons<&str, HNil>>

      rather than simply

      (bool, &str, Option<usize) does not match (bool, &str>)
    2. You cannot implement any trait designed for runtime values rather than for static types. For example, the Index and Iterator traits are impossible to implement for dataframes based on impl1. In stead users will have to use macros written by us or the frunk crate to access their data.

3. We can (and IMHO should) do better than impl1 (df_crate::ColEnum) by using impl2 (Any+metadata)

My whole series of posts started in reaction to @FelixBenning who stated that the restriction of impl1 (dataframes based on impl1 supports only a specific limited set of types) isn't a problem since any data type can be encoded in dataframes with only a few types. He showed as an example that we can one-hot encode any enum with unit variants.

To some extend I agree with @FelixBenning , we can indeed encode any data type. We can for example also encode an array by splitting its elements across multiple columns. Encode a vec by using an additional dataframe that encodes the items of the vec as separate rows, etc. Heck, since this is Rust, we can probably encode any type's digital representation (or that of a pointer pointing to it) and (with a considerable amount of effort) teach the borrow checker how to ensure accessing the data is still save!

But just because we could, doesn't mean that we should! All encodings lose convenience and type safety. Most encodings also have significant performance and/or memory overhead -- quite an important consideration in data science!

Edit: I've said the following in one of my comments I've instructed people to skip:

Why would we arbitrarily not allow our user to use Column<NonStandardFloat1024>.median() if he wants to? And why should we arbitrarily force a user to wait until runtime before he gets an error informing him he accidentally tried to call Column<f32>.add::<Column<bool>() in a situation where he already knows the datatypes of all his columns at compile time?

I assert that while dataframes based on impl1 are useful in separate crates, it would be a shame if we decide to focus the community's efforts on impl1. And as I understand it, focusing the community's efforts is what rust-dataframe -- and therefore this issue thread -- is all about.

4. Proposal: Impl4: Column<T>, AnyCol<T>, Row<S>, a bunch of DF types mirroring std's Tuple, Array and Vec, and finally some private buffers.

Types

We provide two column types:

We provide a single row type:

We provide four dataframe types:

Finally, none of the above column-, row- or dataframe types actually store any data. They're merely the API's over two hidden data buffers:

Indexing and slicing

For a start, we can probably simply provide the first 6 index types given below. Eventually we may support multiple index types:

  1. usize for normal array-like indexing.
  2. Range<usize> for normal array-like slicing.
  3. (Range, Range) for slicing a DF conveniently. (It's not just about not requiring multiple sets of brackets. It's also about being able to index into a row at all (see below).)
  4. (usize, usize) for consistency given the first three index types.
  5. (usize, Range) for consistency given the first four index types.
  6. (Range, usize) for consistency given the first four index types.
  7. const usize or TypeNum for normal array indexing (with compile time bounds checking when used to index a ColsTuple).
  8. HLabel<H:Hash> for indexing by label, backed by a HashMap<HLabel, usize>.
  9. OLabel<O:Ord> for indexing by label in a sorted by index label DF or Column.
  10. ColIndex, RowIndex and DFIndex. These are like &usizes. However they borrow from a Column or DF and are typed in such a manner that no bounds checking is ever necessary. (The borrow checker will ensure that they can never refer to a different Column or DF than the one we're using them to index and that we can never have shared mutable state between them and the Column or DF we're accessing.)
  11. Unique zero sized types (ZST's) for those who want statically typed labels. Note that these would only be available for a Row<Tuple<..>> and for a ColsTuple, but not for a Row<[T]>, ColsArray, Row<vec<T>> or ColsVec.

Indexing and slicing a Column<T> (and hence also an AnyCol) should be just like indexing into a vec from an API point of view. Internally, impl Index<T> for Column<T> should use std::mem::transmute to convert between T and &[u8] since they store their data in their internal ColumnBuffer, which stores only u8s.

Indexing into a DF, returns the column at that index. Note that we can always compute the position of the column in the buffer if we store stride information.

We'll also need to provide row indexing, with AbstractDF<S, R:Row<S>>::index::<(Range, usize)>() -> &R and AbstractDF<S, R:Row<S>>::rows()[usize] -> &R.

Slicing a DF, always returns a &DF. so for example df_tuple[..] simply return a &DFTuple, etc.

WARNING: There's a little gotcha here: Between how slicing of a DF works and the way that the Index trait works we must use a tuple index if we want a row. Since df_tuple[..] returns a &ColsTuple, trying df_tuple[..][i] returns a &&Column<Ti>, i.e. the i'th column (not row). That said, letting a DF return a &DF if it's sliced, is very important for ergonomic and safe windowing. It also allows future proofing. (For example making the buffer partitioned like in Dask.) It's also consistent with the way arrays and vecs works, iirc.

Iterators

We should definitely implement both column-iterators and row-iterators. Note that implementing Row<(T1, T2, ..., Tn)>::into_iter() would be impossible in general. We could provide macros, but I think we should just refer the user to frunk and/or suggest he convert into a Row<[T]>. To make using frunk easy for our users, we should also support Row<HList<..>> even though none of the proposed dataframe types uses it directly.

Column conversions

Edit: We should provide at least the following conversions between our columns and other collections:

  • impl From<Column<T>> for &[T] and impl Form<&[T]> for Column<T>
  • impl From<Column<T, D>> for ndarray::Array<T, (1,D)> and impl From<Column<T, D>> for ndarray::ArrayView<T, (1,D)> and `impl ndarray::Array<T, (1,D)> for Column<T, D>``.
    • Note that here we have a dimension type parameter in column that I didn't include anywhere else. I just didn't want to unnecessarily complicate the rest of the code and furthermore, for a start we don't need the dimension parameter. For a start, we can just convert to/from ndarray's with dynamic dimensions and not bother with static dimensions ourselves.
    • Note that we may need to explicitly provide column conversions to/from AnyCol and ExternalCollection<T: impl NumericWhatever>. I'm not sure how well Any would work without that.
  • Note that we'll need to use std::mem::transmute to implement any of these conversions.

We should provide From<Column<T>> for AnyCol and From<AnyCol> for Column<T> or something similar. Note, I don't think the above mentioned conversion traits are legal, so we should come up with a good alternative.

Row conversions

We should impl From<S> for Row<S> and impl From Row<S> for S for all four types of Rows (i.e., tuple, hlist!, array and vec).

Converting between a Row<[T]> and a Row<vec<T>> would be straight forward. So would converting between a Row<(T1, T2, ..., Tn)) and a Row<hlist!<T1, T2, ..., Tn>>.

We should also (if possible) provide the ability to convert between a Row<[T]> and a Row<(T1, T2, ..., Tn)>. The conversion would require wrapping converting between Ti and dyn Any or something. Note that this will require macros if at all possible.

impl4 is a superset of all of impl1/2/3

If anyone still think that impl1/2/3 still has some advantage, do note that impl1/2/3 are all available to our users as a direct consequence of impl4's design:

I would be willing to try implementing impl4 myself, but...

But it would be in my free time and I'll first need to re-acquaint myself with Rust first, because I've not really worked with Rust since about two years ago. Therefore, it'll probably take several weeks before I can even make a start. Therefore if someone else agree with me that we should go for impl4 and decides to run with it himself, that might be better.

dexterlemmer commented 4 years ago

Edit: You might want to skip this

OK. Since @FelixBenning has replied to one of my previous comments while I wrote the above one, I'll rather not delete my previous comments.

@FelixBenning. I'll read your reply and get back to you.

dexterlemmer commented 4 years ago

Edit: You might want to skip this if you're not @FelixBenning...

@FelixBenning I think I've answered you already, here:

3. We can (and IMHO should) do better than impl1 (df_crate::ColEnum) by using impl2 (Any+metadata)

...

But just because we could [encode literally any data type -- even those @FelixBenning thinks we cannot], doesn't mean that we should! All encodings lose convenience and type safety. Most encodings also have significant performance and/or memory overhead -- quite an important consideration in data science!

I assert that while dataframes based on impl1 are useful in separate crates, it would be a shame if we decide to focus the community's efforts on impl1. And as I understand it, focusing the community's efforts is what rust-dataframe -- and therefore this issue thread -- is all about.

(The text in brackets was not in the original.)

I'll just add that I can see a user actually needing a prebuilt function/method for some type we don't provide. Who says Sex has no prebuilt functions or even that Iterator<Item=Sex> has no prebuilt functions? I also see the user using all sorts of data types that may make perfect sense to put into a dataframe but that we arbitrarily do not allow (DNABasePair, Angle, Point, Color, ...).

Btw. You may have a misunderstanding about functions like mean in Rust. We shouldn't impl mean ourselves!!! We should just impl Iterator<T> for Column<T> and impl From<Column<T>> for &[T] and impl From<Column<T>> for ndarray::Array<T, _> and the user gets an efficient mean for free from ndarray::Array<T, _>. It's also trivial for the user to add SIMD, or whatever else he might want to do (or for us). The only reason we might want to provide our own mean is if it either isn't available on any container or iterable we provide conversion methods for, or we think we can provide better performance or additional functionality, or if we want to provide it as a convenience method. That said, we might very well want to impl Mean for AbstractDF<S, R:Row<S>>. We can't. But we can literally implement it for every concrete type of dataframe (perhaps not for ColsTuple) as long as the column types are all numeric.

Furthermore, I see my impl4 as letting the user decide what functionality he needs for his use case: Static or dynamic typing, encodings or a single column storing weird data, etc. It's his choice. As it should be.

dexterlemmer commented 4 years ago

Edit: You might want to skip this if you're not @FelixBenning...

Oh. BTW, @FelixBenning. Just in case I still haven't gotten my point across very well:

There's no reason whatsoever why a user cannot use mean() without type parameters on any specific column in an impl3::dataframe (HLists). Even though he cannot get an import_from_csv() without type parameters. Neither is there any reason at all the user cannot use a prebuilt .fit_least_squares_linear_regression() on any column of an impl3::dataframe or on an impl4::ColsTuple. Sure, it won't work if the column stores Sex. But who cares?! If the user wants to fit a least squares regression on Sex, he either one-hot encodes it, or he has a sensible conversion to a numeric datatype available or he's crazy :wink: . Which ever the case may be, it's not our problem. (Admittedly, .fit_least_squares_linear_regression() may not be available for (Column<f32>, Column<f32>), even though it is available for [Column<f32>]. But the user might not need it or he might be willing to manually convert the tuple to the array in return for the very real advantages of static typing.)

Let's use a shorter function name now: If .median() isn't implemented for Sex, the compiler won't allow him to even try it on an impl3::Column<Sex> or on an impl4::Column<Sex>. But the compiler will still allow him to use Column<NonStandardFloat1024>.median() on the column right next to the Column<Sex> in the same dataframe (for all dataframes defined in impl3 and impl4) -- assuming the crate defining NonStandardFloat1024 also implements Median for NonStandardFloat1024 and/or Ord for NonStandardFloat1024.

Why would we arbitrarily not allow our user to use Column<NonStandardFloat1024>.median() if he wants to? And why should we arbitrarily force a user to wait until runtime before he gets an error informing him he accidentally tried to call Column<f32>.add::<Column<bool>() in a situation where he already knows the datatypes of all his columns at compile time?

All of the above said. Sometimes static typing just doesn't make sense and sometimes it does make sense but not enough sense to overcome the disadvantages of working with HLists (or tuples). However, I've never thought nor claimed that impl3 or impl4::ColsTuple is always the correct choice.

FelixBenning commented 4 years ago

[Edit 1: changed the multilevel dataframe implementation] [Edit 2: why I am so bold to claim that this row casting would work even though I do not know how to store type information dynamically]

@dexterlemmer I only skimmed your proposal since I am in my exam weeks currently. But from what I got you essentially propose to implement an enum of first class citizens and handle the rest with dynamic typing (any + metadata). That is fair as a compromise between efficiency and flexibility, but also means we have to do both, which is double the work. And then you propose some implementation detail which is unlikely to be relevant to the high level discussion we have. So I feel like I can still contribute meaningfully without having read that in detail.

I think we all agree that we want to avoid dynamic typing as much as possible. So I think it is worth discussing how far we can get with enums of types. Since everything we can not cover will end up in the fallback case.

I'll just add that I can see a user actually needing a prebuilt function/method for some type we don't provide.

Good point - I completely forgot about methods. I also realized that I forgot pointers (i.e. arrays, lists, etc.).

I still think that we can deal with structs though, even with methods. I'll use the Point object as an example for illustration: Let's say you make a measurement at a point. Then you would have

struct Point {
  x: float, 
  y: float
}
struct measurement{
  value: float,
  point: Point
}

If we would not care about performance at all, we could just store any kind of data as a list of structs. Where the struct is the measurement and thus fundamentally heterogeneous. Issue is: most often you want to access the same field for all the measurements and not the entire measurement. For this reason dataframes want to store things as a struct of arrays and not an array of structs.

So how would you store that in a dataframe? Well you could store it as id value point
0 10 (10, 20)
1 12 (12, 10)
2 9 (8, 15)
But what if you suspected, that only the x coordinate of the point determines the value? Well then you would only query the x coordinate, and we would again run into the same speed of access discussion as above which would probably result in a multiIndex dataframe: id value point point
x y
0 10 10 20
1 12 12 10
2 9 8 15

You could implement dataframes like this, and get multi-indices for free

enum MultiColumn{
  multiC(&Dataframe),
  staticC(&ColumnStatic)
  dynC(&ColumnDynamic)
}
struct Dataframe{
  columns : Vec<MultiColumn>
}

For the ColumnStatic type, you would have a "only simple datatypes" guarantee, hierarchical dataframes would be able to save structs, and ColumnDyn would deal with the rest (e.g. Lists).

Method Problem

Fundamentally any Row is always an object (i.e. a measurement). So why not simply include a type field in the Dataframe?

struct Dataframe{
  structType: ?,
  columns : Vec<MultiColumn>
}

You could store the type Point in this typefield, and the Row struct in the top level dataframe. (I am actually unfamiliar how you can store type information dynamically, so there is the question mark, but since you are proposing something like this for columns, it should also work for dataframes right?)

Edit 2: Somwhere there must be a list of methds/method pointers for type X, where I can pass my initialized object of type X into. These methods would receive a self pointer, which points to a space of memory where all the fields of the struct are placed in order. So you could just place the information in the right order and pass this pointer to the method and it should apply the desired method.

So lets say you want to apply a method to every point in your dataset. In order to do that you iterate through the rows of the lower level dataframe and cast every row to Point, apply the method and store the result into your dataframe as a row. Since you iterate through your dataframe, you can still fetch aligned slices from memory. It might even be possible to avoid fetching the entire row, if it is possible to figure out what fields a method uses ahead of time. For example: a shift_point_to_the_right method would only need the x coordinate.

Edit 2: This efficiency gain is probably not possible right now, since methods probably do not have a header which tells us what fields they use. But maybe it is possible to build something like this into the compiler in the future, and you could activate this feature with a method decorator or so. Even better than a header: Whenever the method uses the self pointer to fetch a variable, the compiler converts that into fetch self+i, where i is the number of the field where the variable is. So what if the compiler would convert that into fetch df[j][i] where df[j] is the row we want to treat like a struct? It could then just fetch df[j][i],..., df[j+8][i] which is continuous memory, since we are storing columns

So there is the potential to gain a lot of efficiency, by explicitly not saving structs as structs.

Additionally, compatibility with arrow would force us to implement something like this anyway. As we would have to have a method, which converts our list of structs to a struct of lists in order for it to be saved in a format arrow would accept.

Arrays, Lists, Vec, etc.

You could convert same length iterables into columns. But beyond that we are probably forced to save a pointer and type information in an any column. At least I can not think of any better idea right now.

Built in Statistics

Neither is there any reason at all the user cannot use a prebuilt .fit_least_squares_linear_regression() on any column of an impl3::dataframe or on an impl4::ColsTuple. Sure, it won't work if the column stores Sex. But who cares?! If the user wants to fit a least squares regression on Sex, he either one-hot encodes it, or he has a sensible conversion to a numeric datatype available or he's crazy

I think the main types of data are either numeric, categorical or strings (and of course collections of such types - mostly structs). Categorical data (i.e. slim enums) should have sensible defaults for conversion into dummy variables. This would make statistics/machine learning so much easier, if you can just rely on the dataframe framework to do the necessary conversion if needed. Or specify the conversion with very simple syntax. (Similarly missing data should of course be treated, but bitmasks can be slapped on any data type so I am not that worried).

A dataframe without quality of life improvements for statistics is only a database. So prebuilt features are quite important. And dynamic types can make those really difficult. I mean, even if you think that this is not the task of the dataframe itself - libraries built on top of the dataframe have to deal with these issues. And if you make their life incredibly difficult by forcing them to deal with arbitrary types, then you won't get many takers.

mamcx commented 4 years ago

I'm building in the side a relational language and hit the same issue: How actually store data and yet provide easy operations on them. For a lang, this look to me harder but a library can constrain better the problem. This is some ideas I have about this:

Instead of think in terms of method + traits, do method + arity (like kdb/j).

Rust make things a little convoluted where is not possible to easily say "support + for all numeric types" without dancing with traits. I solve it doing kdb+/j style. I type all the ops to only care about their arity and define 2 possibilities:

This mean that I can pass freely:

data.column(apply: std::ops::add,....)
data.column(apply: StrOps::concat,....)
... etc

this, I think, solve

I'll just add that I can see a user actually needing a prebuilt function/method for some type we don't provide.

If the dataframe API only worry by arity, ANY function can be supplied..

This one is the harder to implement, but I think transducers provide the answer. Here, we *don't" supply the actual storage (or more exactly, supply it later for convenience) and only provide transducers that operate in anything dataframe-like.

This align with the idea of Building Efficient Query Engines in a High-Level Language. Combined with the potential of transducers to support not only static storage, but also get feed by vectors, channels, iterators, streams, etc , we could get the best of all worlds!

This open the possibility to do magic like:

let titanic = pd.read_csv("data/titanic.csv") <-- user only care for dataframe op, give own source
let query = transducer::map(titanic,  select_field("age") | toi32 <-- dynamic parse csv somehow) <-- dataframe op
let mean = query.mean() <-- dataframe op

//But I wanna type!
struct Titanic{
  age: Vec<i32> <--columnar storage!
}
//implement dataframe....

let titanic: Titanic = pd.read_csv("data/titanic.csv").into() <-- now we want fast columnar ops
let query = transducer::map(titanic,  age <-- get statically typed, fast!)) <-- dataframe op, same
let mean = query.mean() <-- dataframe op, same

In other words: How about decoupling ops vs storage, so instead of 2 camps fighting ("wanna static types!", "wanna dynamic data!") we marry them and allow to pick the best backend?

FelixBenning commented 4 years ago

I honestly do not understand how fixing the arity (number of arguments?) fixes our issues. I mean do you have types or not? If you have, then I do not see how arity changes anything. If you do not, I do not see how arity improves anything either. It just seems to introduce a limitation on operations you can use. But why? What do you gain from that?

mamcx commented 4 years ago

It help in how provide functionality BESIDES the types. Is only a part of the solution. Note that was related to:

I'll just add that I can see a user actually needing a prebuilt function/method for some type we don't provide.

What I trying to say is that one problem is what functionality (ops) are provided (like .sum, .mean, etc). If that depend on the internal types of the data frame implementation, that is fixed. If I'm loading data from csv, I could wish to do "sum" directly (mapping lines and then converting the column to a number), without loading into a vec of numbers, then, do the sum.

dexterlemmer commented 4 years ago

[Edit4: See below, it's not far down (relatively speaking).] [Edit3: Improved edit1/2 wording in the vicinity of edit2.] [Edit2: Fixed my code comment about FelixBenning's Multicolumn and how to implement it to actually address how he has implemented it and not oly the ways I show out are truly broken and that would make MultiIndex a silly name as well. Oops. Edit2 aslo does: Minor additions to edit1 near the main part of edit2.] [Edit1: A whole lot of updates marked in the text.]

[Edit 1: changed the multilevel dataframe implementation] [Edit 2: why I am so bold to claim that this row casting would work even though I do not know how to store type information dynamically]

Hehe. I actually do the same. At least in my case, I'm proposing what I think is very similar to how backjack already does it (except that blackjack stores a hashmap of Any+metadata, rather than a vec/array/tuple of Any+metadata, like me.

@dexterlemmer I only skimmed your proposal since I am in my exam weeks currently. But from what I got you essentially propose to implement an enum of first class citizens and handle the rest with dynamic typing (any + metadata). That is fair as a compromise between efficiency and flexibility, but also means we have to do both, which is double the work. And then you propose some implementation detail which is unlikely to be relevant to the high level discussion we have. So I feel like I can still contribute meaningfully without having read that in detail.

OK. I definitely did not get my point across well. (Hopefully not because I was unclear but rather because you only skimmed my proposal. :smile: )

Edit1: We already have impl1/2/3 to handle some situations. I propose doing six times the work (in terms of functionality, two column types multiplied by three dataframe types), since it would be barely more than 1--2 times the work in practice and we really do need a single dataframe crate that provides all functionality necessary for others to build a data science stack on top of it and ndarry/nalgebra, IMHO.

The implementation detail is rather important. If you are referring to Column vs AnyCol, that's necessary but I guess we could leave AnyCol for later. I'd rather not, though since I want to make sure it works and we will probably eventually be asked to support it -- especially since impl4 needs to be competitive with impl2, which does provide AnyCol-like behavior. If you are referring to ColsTuple, ColsArray and ColsVec, all three are necessary for the full range of capabilities. (You might think why do we actually need ColsArray if we have ColsVec, but I'll ask you why do we need both Array and Vec in std? Furthermore, Adding ColsArray if we already have ColsVec would be virtually free. If you are referring to DFBuffer, that's perhaps an implementation detail we can leave for later, but I didn't want ColsTuple, ColsArray and ColsVec, indexing, slicing, etc. to seem like magic. Also, you happen to force the user of our library to reinvent DFBuff himself as I point out that you point out later. :stuck_out_tongue_winking_eye:

I think we all agree that we want to avoid dynamic typing as much as possible. So I think it is worth discussing how far we can get with enums of types. Since everything we can not cover will end up in the fallback case.

Actually in impl4 "enums of types" is a fallback case for when that is what the user wants. However, he gets to choose which enums (even his own) and he gets to choose on a per dataframe instance case.

Edit1: Additionally. Actually any one can choose. I'll show below how either we or another library crate building on top of ours or the user could implement dataframes for specific types.

I still think that we can deal with structs though, even with methods. I'll use the Point object as an example for illustration: Let's say you make a measurement at a point. Then you would have

struct Point {
  x: float, 
  y: float
}
struct measurement{
  value: float,
  point: Point
}

If we would not care about performance at all, we could just store any kind of data as a list of structs. Where the struct is the measurement and thus fundamentally heterogeneous. Issue is: most often you want to access the same field for all the measurements and not the entire measurement. For this reason dataframes want to store things as a struct of arrays and not an array of structs.

You and I have very different understandings of the meaning of the word "heterogeneous" in the context of this discussion. That may be part of our other misunderstandings:

So how would you store that in a dataframe? Well you could store it as id value point 0 10 (10, 20) 1 12 (12, 10) 2 9 (8, 15)

Sure you can... In impl4::ColsTuple or impl4::ColsVec<IdValueTuple2Enum>. It's unlikely our own version of impl2 (which is what you suggest) would support the type (Int, Int), though. What's more likely is that it supports typle List<DType> (indeed arrow::datatypes::DataType does have a List variant), which I guess would also work :wink: . (Hey! I just noticed it has a struct<Vec<Field>> variant as well. Cool! I wonder if it's as powerful as it seems. impl4 might just be overkill after all!)

But what if you suspected, that only the x coordinate of the point determines the value? Well then you would only query the x coordinate, and we would again run into the same speed of access discussion as above which would probably result in a multiIndex dataframe: id value point point x y 0 10 10 20 1 12 12 10 2 9 8 15

You could implement dataframes like this, and get multi-indices for free

enum MultiColumn{
  multiC(&Dataframe),
  staticC(&ColumnStatic)
  dynC(&ColumnDynamic)
}
struct Dataframe{
  columns : Vec<MultiColumn>
}

For the ColumnStatic type, you would have a "only simple datatypes" guarantee, hierarchical dataframes would be able to save structs, and ColumnDyn would deal with the rest (e.g. Lists).

Interesting proposal. I think impl4 is both more expressive and more intuitive. But I guess that'll work. That said. I'd rather support all datatypes a user has any business putting in a dataframe and save multi-indices for cases where my dataframe is truly hierarchical rather than use it as a hack to enable users to encode un-supported datatypes.

Method Problem

Fundamentally any Row is always an object (i.e. a measurement). So why not simply include a type field in the Dataframe?

Not... quite. You're supposed to work with dataframes where every row is a measurement, but you might be handed dataframes by someone else that weren't. It's often useful to be able to record dataframes in formats where not every row is a measurement. Or so I'm told... And then I need to go "tidy up" (as R calls it) the mess. It would be a nightmare to tidy up the mess if my dataframe crate cannot encode the mess (how do I even get the data imported). Well truthfully, last time this happens to me was before I did any data science of my own and I needed to turn a customer's Excell sheets into a normalized database. However, it's apparently still common enough that many R textbooks insists on spending a chapter or two teaching you how to do it with dataframes.

struct Dataframe{
  structType: ?,
  columns : Vec<MultiColumn>
}

You could store the type Point in this typefield, and the Row struct in the top level dataframe. (I am actually unfamiliar how you can store type information dynamically, so there is the question mark, but since you are proposing something like this for columns, it should also work for dataframes right?)

So lets say you want to apply a method to every point in your dataset. In order to do that you iterate through the rows of the lower level dataframe and cast every row to Point, apply the method and store the result into your dataframe as a row. Since you iterate through your dataframe, you can still fetch aligned slices from memory. It might even be possible to avoid fetching the entire row, if it is possible to figure out what fields a method uses ahead of time. For example: a shift_point_to_the_right method would only need the x coordinate.

Actually, you are starting to figure out what I did. Look here:

struct ColsTuple<S, R:Row<S>> {
  //[Edit1: _:PhantomData serves a similar purpose as your `structType` field, except
  // it allows the **compiler** to remember the type and the **program** to forget the
  // the type.]
  _: PhantomData<S, R>,

  // [Edit1: `strides`, `buff` and `shape` (fields below) combine to reimplement
  // `Vec<R>`. The problem with trying to store `Vec<Multicolumn>` is as follows:
  // What is the datatype of `Multicolumn`? If it is a tuple, you're stuck with macros in
  // your **indexing**. i.e. you are restricted to impl3. Furthermore if it is a tuple we
  // end up storing data rowwise, which is very slow in practice for typical use cases
  // of dataframes due to a lot of cash misses.
  // If it's a `Vec`, it won't compile since `Vec` isn't heterogeneous. Oh! And again, we're
  // storing the data row-wise.
  // [Edit2: And if `MultiIndex` works the way you've shown us above, then your proposal
  // becomes like my original (which was **more** complex than impl4), except that
  // you drop a lot of my original complexity and my original advantages. You don't say
  // how &StaticColumn or &DynColumn are implemented, but I'll give you a hint:
  // &StaticColumn needing to be static and the recursive nature of your dataframe
  // (via multicolumn), eventually rediscovers hlists as you try to make it work and
  // &DynCol ends up working like my AnyCol. As you then battle the difficulties
  // involved in hlists and try to add fast and ergonomic indexing and the ability to
  // interface with anything that doesn't use hlists (including your &DynCol),
  // you find yourself needing to seperate your storage from your API.
  // Additionally, you've now forced both us and our users to care about
  // multi-indexes, even if the user's index isn't naturally hierarchical and even
  // if we want to leave implementing MultiIndexes for later.]
  // Therefore we are forced to implement our own heterogeneous vec-like thing
  // [Edit2: for the actual data storage. This vec-like thing should]
  // allow us to store our data column-wise. (Such a vec-like
  // thing [Edit2: (a vec-like thing capable of heterogeneous item types and
  // statically typed)] doesn't yet exist in Rust [Edit2: as far as I'm aware].)]

  // Note, we can rather use `strides: [usize]` if we also move this definition into the
  // builder macros below.
  strides: Vec<usize>,
  buff: *DFBuffer, // note the raw pointer
  shape: (usize, usize),
}

#proc_macro ColsTuple(...) -> ... {
  // Parse user's code. Determine the types of his columns, for example
  // if his column types are (respectively) T1, T2, ..., Tn, we do
 type S = (T1, T2, ..., Tn)
  // Check if S is already implemented, if not, do:
  impl ColsTuple<S, R:Row<S>> {
    // impl details
  }
  impl ColsTuple<S, R:Row<S>> for AbstractDF<R> {
    // impl details
  }
  // [Edit1: Also provide info necessary for the #DeriveGeneric*OnColsTuple macros
  // to know for which generic types they need to derive their trait impls.]
}

struct ColsArray<T, Column<T>> {
  _: PhantomData<[T]>,
  strides: usize, // note this is different from ColsTuple.strides
  buff: DFBuffer, // note the raw pointer
  shape: (usize, usize),
}

struct ColsVec<T, Column<T>> {
  _: PhantomData<Vec<T>>,
  strides: usize,
  buff: DFBuffer, // note the raw pointer
  shape: (usize, usize),
  capacity: (usize, usize), // This is new!
}

Note that all three Cols... structs in impl4 knows their exact type at compile time. They zero-cost cast their types to &[u8] and back again using std::mem::transmute to read/write the buffer. (Though being able to do this may require all their types to impl Serialize and Deserialize, which is why I said above that my data may need to implement Serialize and Deserialize.) Also, the shift_point_to_right method you are talking about and making it efficient is why T:Sized and either the dataframes or the buffer must store stride information. The user never gets to see any of this. He just gets a nice (and fast) API. I was inspired by how std's Array, Vec and Tupl and arry/vec slices are actually implemented (or how they work if I understand correctly), hence the names ColsArray, ColsVec and ColsTuple.

Edit 2: This efficiency gain is probably not possible right now, since methods probably do not have a header which tells us what fields they use. But maybe it is possible to build something like this into the compiler in the future, and you could activate this feature with a method decorator or so. Even better than a header: Whenever the method uses the self pointer to fetch a variable, the compiler converts that into fetch self+i, where i is the number of the field where the variable is. So what if the compiler would convert that into fetch df[j][i] where df[j] is the row we want to treat like a struct? It could then just fetch df[j][i],..., df[j+8][i] which is continuous memory, since we are storing columns

You are on the right track. The compiler already does this. Like I said, my dataframe types work just like Tuple, Array and Vec internally and these types store pointers to an internal buffer, which then gets updated during indexing by compiler generated code. Methods don't need to keep track of which fields they are operating on. They know on which column, dataframe or &dataframe they are working and that type stores pointers to tehir start and computes strides during compile time to compute pointers for indexes efficiently during runtime.

So there is the potential to gain a lot of efficiency, by explicitly not saving structs as structs.

Yes... We save structs as &[u8]'s to gain efficiency (mostly for having a single identical buffer for all three representations of the same dataframe (i.e. ColsTuple, ColsVec and ColsArray would all use identically the same DFBuffer if storing the same columns and data, therefore casting between them is zero cost). Edit1: Actually, I realized that wasn't the only reason I ended up working this way. We really do need to store structs (or whatever types the user selects for our columns) as &[u8] because otherwise heterogeneous column types cannot be implemented without either dynamic typing (enums are dynamically typed, they're just less dynamically typed than AnyCol) or an unergonomic and forced static typing (impl3, HLists).] However, the user doesn't know or care about any of this. He just gets a nice API which allows him to store structs of his choosing as structs of his choosing and a nice back-end that's zero cost (or pretty darn close to zero cost). That's why its called a zero-cost abstraction.

Additionally, compatibility with arrow would force us to implement something like this anyway. As we would have to have a method, which converts our list of structs to a struct of lists in order for it to be saved in a format arrow would accept.

I don't understand what you are saying. May be I'm getting tired. Edit1: Oh. Now I get it. I had forgotten what you had said near the very top of your post. Nah! Your comment is not applicable to impl4, since it doesn't store lists of structs. It stores a heterogeneous list of columns (just like arrow), not a list of structs.

Arrays, Lists, Vec, etc.

You could convert same length iterables into columns. But beyond that we are probably forced to save a pointer and type information in an any column. At least I can not think of any better idea right now.

Basically, yes. Except that the type information is stored in a PhantomData which just type erases it at runtime. We only care that the compiler knows the type, since then it can compute everything we need to know to implement the transmutations and compute the strides for us. What's more, you may not know this (but you probably do) but in Rust, enums all have constant sizes even if their variants have different sized, so...

enum DType {int8, int32};
// prints either 32 or more depending on optimizations. not 8.
println!("{size_of!(DType::int8(123))}");
// prints the exact same amount as the previous line.
println!("{size_of!(DType::int32(1234))}");

That's why enums can be accessed very fast in arrays (and in my own types). However, that also makes them very memory inefficient.

Built in Statistics

Neither is there any reason at all the user cannot use a prebuilt .fit_least_squares_linear_regression() on any column of an impl3::dataframe or on an impl4::ColsTuple. Sure, it won't work if the column stores Sex. But who cares?! If the user wants to fit a least squares regression on Sex, he either one-hot encodes it, or he has a sensible conversion to a numeric datatype available or he's crazy

I think the main types of data are either numeric, categorical or strings (and of course collections of such types - mostly structs). Categorical data (i.e. slim enums) should have sensible defaults for conversion into dummy variables. This would make statistics/machine learning so much easier, if you can just rely on the dataframe framework to do the necessary conversion if needed. Or specify the conversion with very simple syntax. (Similarly missing data should of course be treated, but bitmasks can be slapped on any data type so I am not that worried).

Sure. But in your proposal, neither the numeric type NonStandardFload1024 nor the numeric type Angle can reasonably be supported. I can see a user doing:

let mut df = ColsTuple::with_header!{
  obs: u8,
  angle: Angle,
  point: Point,
}
// This would work, even if `to_radians` isn't supported for `float32`,
// or whatever type your encoding uses
println(df[1].to_radians().sum());
// This'll also work:
println(df[2].sum());
// But this won't work since points aren't ordered. It would likely (erroneously)
// work for an encoding of point if `sum` works for the encoding.
df.sort_by_column(2);

Edit1: iirc, Rust does support packed enums -- which stores their variants as small as possible per variant. However, I doubt packed enums are Sized, which means you cannot even store them in DFBuff (unless our impl can somehow prove which variant is used in each column and that only that variant gets used throughout that column), let alone in a Vec (Vec's only store T:Sized). So that doesn't help us much in terms of memory efficiency

A dataframe without quality of life improvements for statistics is only a database. So prebuilt features are quite important. And dynamic types can make those really difficult. I mean, even if you think that this is not the task of the dataframe itself - libraries built on top of the dataframe have to deal with these issues. And if you make their life incredibly difficult by forcing them to deal with arbitrary types, then you won't get many takers.

I couldn't agree more. :wink: That said. If we provide the right datafrae abstraction and some basic support for statistics, other crates can provide the rest or we can add the rest later. For example, I've shown how we (or preferably the arrow crate can build Arrow support on top of impl4. That said, arrow might want to do things differently and in stead not bother with their own enum any more and in stead do this:

import impl4;
impl crate::io::ReadParquet For Column<u8> {
  /* impl details */
}
// And now implement all traits arrow wish to support for all types arrow wish to support it.

And the user can do this:

import impl4;
import arrow;

// This df has all dtypes coercable to the same type.
let df = ColsVec.read_parquet("path/to/parquet/dir");

// This df has incompatible types in different columns. The user is willing to
// accept runtime type checks and runtime dispatch overhead, i.e. dynamic typing
// in return for convenience and memory effeciency.
let df2 = ColsVec.read_parquet::<AnyCol>("path/to/parquet/dir2");

// Here the user would prefer to use arrow's DType enum and pay the memory
// overhead for heterogeneous types that are fast and (partially) compile time
// type checked, i.e. like arrow currently works and like you suggest we do.
let df3 = ColsVec.read_parquet::<DType, Column<DType>>("path/to/parquet/dir3");
// Or, since arrow probably added a type alias to their wrapper of impl4 for convenience:
let d3 = ArrowVec.read_parquet("path/to/parquet/dir3");

// And here the user is willing to accept some unergonomic code in
// exchange for fully statically type checked dataframe with heterogeneous types.
// Note I assume, arrow would actually be capable of supporting this since they
// have an `arrow::datatypes::Dtype::Struct` variant, and iirc,
// parquet does actually support saving columns of structs. But it might
// only work if the structs use only specific field types, which I think isn't
// an issue in this example.
let df4 = ColsTuple.read_parquet!((u8, Color, Point, Angle), "path/to/parquet/dir4")

Now isn't that nice!

PS. Hey thanks. You make me really think hard about my proposed impl4, its capabilities and limitations and how to explain it all as well as how I'll go about it when I finally get to write its code out. That said. I thought I was going to be able to actually code some Rust tonight to reacquaint myself with it... Until I saw your post. Oh well.

PPS. Good luck with your exams.

dexterlemmer commented 4 years ago

[Edit4: Fixed bug in code.] [Edit3: Typo] [Edit2: Rewording. I think I should go sleep :smile: .] [Edit: Rewording.]

@mamcx Wow! Looks like you have us an impl5 proposal. I think mine is better. But after my long answer to Felix (may I call you that @FelixBenning ?), I'm now to tired and hasty to really look into it. And it looks to me like you are still in the early figure it out stage that I was in when I wrote some of the posts I ended up editing to suggest people skip ahead.

I hope you can show us what you have in mind and write up a more complete and understandable proposal. Or that tomorrow I'm less tired and actually realize its obvious and well written and I understand it. :smile:

On the other hand. I don't think the arity approach makes sense for a dataframe. Sure, it might make sense for databases, but if you have the full power of Rust and the full generality of a good Rust dataframe API, I don't think we should force our users (or ourselves) to worry about arity and ops. Let the compiler figure it out and simply (for example) make add work for, say Point throughout our code simply from the author of Point doing impl Add for Point and us simply doing

impl<T> add<Column<T>> for Column<T> {
  fn add(self, other) -> Self {
    self.iter().zip(other).map(|x, o| x+o).collect()
  }
}

Now the user of our respective libs doesn't care about our impls at all, only that the compiler can find some impl of the functionality he uses somewhere.

import pointees; // made up name for crate that defines `Point`
import impl4;
...
df.append(df[1]+df[2]);

Now if it's some function that either us or the author of the type the user uses in his columns don't support, the user can write the missing impl himself. Though he may end up needing to wrap the type in a new type due to Rust's ownership rules.

FelixBenning commented 4 years ago

@dexterlemmer Since I apparently have not understood what you are intending, and I am not quiet sure if you did understand my intention either, I think that at this point it might be more productive, if we would just talk about it in person and clear up all misunderstandings - what about tomorrow 11. June 20 @ 20:00 Central European Summer Time (GMT+2)? Maybe the Rust discord? (actually not sure if it has channels)

dexterlemmer commented 4 years ago

@FelixBenning . I heartily agree that we should get this discussion off of this issue. We could try to talk in person. However, there's not an on topic discord channel on Rust discord yet. We can may be friend each other and talk privately? I don't currently use any other suitable social media and I don't intend to either. Oh. We could move the discussion to a new issue in this repo or on one of my or your repos. I made a possible start at https://github.com/dexterlemmer/impl4/issues/2#issue-636604559.

Note. It would make sense if we still keep to the 20:00 GMT+2 time you've suggested even if we use github in stead of Discord. Whatever we decide to use, you might want to first read https://github.com/dexterlemmer/impl4/issues/2#issue-636604559, since I've already made a start there with two posts about (1) "What exactly are our goals with this discussion?" and (2) "Can we clean up the mess we've already made after we've decided we've reached our goal?"

etemesi254 commented 3 years ago

Hello!

TL:DR Check out this github repo https://github.com/etemesi-ke/dami

An example https://github.com/etemesi-ke/dami/blob/master/doc/10-minutes.ipynb Am going to read it Well Okay

Am quite late in this discussion, but I once needed a DataFrame implementation too in rust, and came up with an implementation for heterogeneous columns using the following methods

Adding a Series to A DataFrame then becomes

impl DataFrame{
      /// Add a series to the Dataframe
       pub fn add_to_df<T:Default+'static+Clone>(&mut self,other:series<T>){
         // Maintain order of insertion
         self.names.push(other.get_name());
         self.values.insert(other.get_name(), other.get_dtype());
         self.get_appropriate_block(other);

  }
}

What the get_appropriate_block() function will do is that it will find an appropriate block for the Series

The above quoted github repo uses an interesting way of grouping homogeneous columns together, because order doesn't matter unless when printing

Therefore assuming we have 2 f32 columns and 2 String columns in order(string,f32,string,f32), all strings are grouped in one block and all f32's are grouped in another block

What about applying a function to columns in the DataFrame? use the turbofish ::<Type,_>() operator

Consider the function that squares all f32

/// Square a function
pub fn square(a:f32)->f32{
    a.powi(2)
}

To apply this function to a DataFrame, you could write

df.apply::<f32,_>(square);

It will find the block containing all f32 columns and apply it, not touching any other block, allowing for any types of function to be applied in the DataFrame, whether it's transforming df.transform in this case or mapping,

Added Bonuses

Personally, I think that is quite amazing, what do you think?