go-gota / gota

Gota: DataFrames and data wrangling in Go (Golang)
Other
2.98k stars 276 forks source link

Vector arithmetic #153

Open danielpcox opened 3 years ago

danielpcox commented 3 years ago

Adds new Math method to dataframe.DataFrame capable of computing n-ary arithmetic functions against entire selected columns, storing the the result in a new column (or replacing an existing one). Supports int and float64 types. Supports operator specification by string (e.g., "+", "/", etc.) or unary, binary, or trinary int or float64 function (e.g., for supplying a float64 function from Go's math module). For example:

/*  `input` is a 5x4 DataFrame:

   Strings  Floats   Primes Naturals
0: e        2.718000 1      1
1: Pi       3.142000 3      2
2: Phi      1.618000 5      3
3: Sqrt2    1.414000 7      4
4: Ln2      0.693000 11     5
   <string> <float>  <int>  <int>
*/
df := New(
    series.New([]string{"e", "Pi", "Phi", "Sqrt2", "Ln2"}, series.String, "Strings"),
    series.New([]float64{2.718, 3.142, 1.618, 1.414, 0.693}, series.Float, "Floats"),
    series.New([]int{1, 3, 5, 7, 11}, series.Int, "Primes"),
    series.New([]int{1, 2, 3, 4, 5}, series.Int, "Naturals"),
)

// New method `Math` takes a new column name, an operator (string or func) and at least one column name
withNewDiffColumn = df.Math("Diff", "-", "Floats", "Primes")

fmt.Println(withNewDiffColumn)

/* New DataFrame now has a column named "Diff" which is
    the result of subtracting Primes from Floats.

    Strings  Floats   Primes Naturals Diff
 0: e        2.718000 1      1        1.718000  
 1: Pi       3.142000 3      2        0.142000  
 2: Phi      1.618000 5      3        -3.382000 
 3: Sqrt2    1.414000 7      4        -5.586000 
 4: Ln2      0.693000 11     5        -10.307000
    <string> <float>  <int>  <int>    <float> 
*/

There are more examples in the docs and tests.

This PR also adds new FindElem method to dataframe.DataFrame which lets a user pull a particular series.Element out of a DataFrame by specifying a column and value to select a row (assumed to be unique), and another column to find a particular value within that row. For example, the following line will search through the "Metric" column of each row for a value "envoy_cluster_upstream_rq_active", and then it will return the series.Element from that row corresponding to the "Value" column:

df.FindElem("Metric", "envoy_cluster_upstream_rq_active", "Value")
danielpcox commented 3 years ago

Thanks for your comments. I agree this can be reworked to accommodate more than math. However, I'd be very sorry to see the math-specific string flavors of op go, or to add more overhead to a simple operation such that it can no longer be performed succinctly in the 90% case.

Counter-proposal:

What if instead we split df.Math into three different new methods:

  1. The first would have signature Arithmetic(resultcol string, op string, operandcols ...string) DataFrame, which only takes the limited string ops ("+", "-", "*", "/", and "%"), and takes variadic operands like df.Math currently does. I'd also like Arithmetic to be allowed to coerce values, because its purpose is to make common operations as easy as possible, but more on that below.
  2. The second would have signature something like ElemMultiApply(resultcol string, op func(elements ...Element) Element, operandcols ...string) DataFrame, where the user passes a variadic function on Elements and however many columns, and it gets applied without coercion.
  3. The third would have signature something like FloatMultiApply(resultcol string, op interface{}, operandcols ...string) DataFrame and use the same techniques as in the current df.Math to support unary, binary, and trinary op functions on at least float64 values (and I'd really like to be able to automatically convert the ints in mixed operand columns to float64 as necessary to enable pleasant access to the math package on integers - but see below for coercion discussion). It would have to be able to support all three arities to be able to take any function from math directly.

No coercision?

Is the request not to do automatic coercion a gota policy set in stone? I thought there was already automatic coercion in gota. Capply and Rapply in the readme say "casting the types as necessary", and the function I used to figure out what the output should be (int or float64) was already there; I just moved it out of a function to make it accessible to Math.

Are you sure you wouldn't want it in, when it's only automatic coercion in one direction (int -> float64) and it only happens when the input columns are mismatched (at least one float64 column among the operands)? I would personally much prefer a concise API with a few well-documented potential gotchas to a verbose API that makes me do extra work in the most common cases. Coercion is also how I managed to make it possible without much ceremony to pass any function from Go's math package in as op and have it correctly apply to columns of mixed type (they get detected and cast to float64 to be compatible, and the output is always float64).

There's also type coercion in Pandas and R, and people seem to be able to handle it. I think a nice pile of warnings in the documentation would suffice, at least for me, and what we get for it is agility and API clarity. (And the reason I'm using gota in the first place is because idiomatic Go doesn't let me express a complex high level thought succinctly enough to do it often.)

All of that said, I'm flexible here, and it's your project. :)

FindElem

As for FindElem, I think I can just remove that without sacrificing much. I currently perform the same operation in my existing code with df.Filter(...).Elem(0,1).Float() which is succinct enough. (If we do add something that gives you only the first match later though, I'd suggest First or FirstRow rather than Head, because df.head in Pandas shows the first n, defaulting to 5.)

As an aside, for when there are many rows, I was thinking of adding Index(columnname string) DataFrame which would build an index of the values in that column to their row number, and if a user chose to build such an index up-front, anything that needed to search for a value (e.g., Filter) would make use of it to improve performance. That's still possible with the Filter-Elem-Float paradigm for looking up a value.

chrmang commented 2 years ago

Thank you for your detailed explanation. Let me explain my opinion about coercion.

In many cases it can be a great thing. Especially when dealing with AI it is useful, because sooner or later all variables are float and a loss of one or two digits precision is not a problem. At other use-cases, this would be not acceptable - think of financial services. And it can cause bugs like https://github.com/go-gota/gota/issues/154 . This bug was in gota, other bugs can be in user code. It's all about the use-case. Go is designed as a typed language and we should use the benefits of compile-time type checking. You are right, gota is full of automatic coercion, but this is the preferred way in Python, R and JavaScript. What benefit would people have if we write a 1:1 copy of pandas in Go? Speed? pandas is written in C with a Python wrapper - it's already fast. What else if not type safety? In the future maybe I will find a way to replace interface {} with generices or generate or ... Nevertheless we should move forward and improve the library usability.

The idea of your counter-proposal is good. You know my opinion about coercion, but let's have a try. Go for it. Can you change your PR, please?

danielpcox commented 2 years ago

What benefit would people have if we write a 1:1 copy of pandas in Go? Speed? pandas is written in C with a Python wrapper - it's already fast. What else if not type safety?

Hmm... I admit, in my particular case, the only reason I'm using Go is because I have to. So at least one benefit would be that people don't have to leave their preferred language (which is great for other purposes) to manipulate tabular data in a readable way. The company I work for has services written in Go, and we need a compact way to express quite a large collection of high level operations on tables. I don't think idiomatic Go is the best language for doing that, (partly because of static typing, but mostly because the for loop reigns supreme in Go, and because of inline error checking), but I can't rewrite someone's entire service just because I'd prefer to do the number crunching with Pandas. I was pleased to find gota because it bends a few of the laws that makes Go especially painful for high-level data wrangling, and therefore finally made my code readable. It's not Go, it's a DSL, and that's my favorite thing about it.

The idea of your counter-proposal is good. You know my opinion about coercion, but let's have a try. Go for it.

Do you mean "go for it" as-written, or without any automatic int->float64 coercion? If the latter, perhaps I should also add a fourth method to DataFrame that explicitly coerces a column's types, so there isn't a big drop in readability when the types don't match.