Open danielpcox opened 3 years ago
Thanks for your comments. I agree this can be reworked to accommodate more than math. However, I'd be very sorry to see the math-specific string flavors of op
go, or to add more overhead to a simple operation such that it can no longer be performed succinctly in the 90% case.
What if instead we split df.Math
into three different new methods:
Arithmetic(resultcol string, op string, operandcols ...string) DataFrame
, which only takes the limited string op
s ("+", "-", "*", "/", and "%"), and takes variadic operands like df.Math
currently does. I'd also like Arithmetic
to be allowed to coerce values, because its purpose is to make common operations as easy as possible, but more on that below.ElemMultiApply(resultcol string, op func(elements ...Element) Element, operandcols ...string) DataFrame
, where the user passes a variadic function on Element
s and however many columns, and it gets applied without coercion.FloatMultiApply(resultcol string, op interface{}, operandcols ...string) DataFrame
and use the same techniques as in the current df.Math
to support unary, binary, and trinary op
functions on at least float64
values (and I'd really like to be able to automatically convert the int
s in mixed operand columns to float64
as necessary to enable pleasant access to the math
package on integers - but see below for coercion discussion). It would have to be able to support all three arities to be able to take any function from math
directly.Is the request not to do automatic coercion a gota policy set in stone? I thought there was already automatic coercion in gota. Capply
and Rapply
in the readme say "casting the types as necessary", and the function I used to figure out what the output should be (int
or float64
) was already there; I just moved it out of a function to make it accessible to Math
.
Are you sure you wouldn't want it in, when it's only automatic coercion in one direction (int
-> float64
) and it only happens when the input columns are mismatched (at least one float64
column among the operands)? I would personally much prefer a concise API with a few well-documented potential gotchas to a verbose API that makes me do extra work in the most common cases. Coercion is also how I managed to make it possible without much ceremony to pass any function from Go's math
package in as op
and have it correctly apply to columns of mixed type (they get detected and cast to float64 to be compatible, and the output is always float64
).
There's also type coercion in Pandas and R, and people seem to be able to handle it. I think a nice pile of warnings in the documentation would suffice, at least for me, and what we get for it is agility and API clarity. (And the reason I'm using gota in the first place is because idiomatic Go doesn't let me express a complex high level thought succinctly enough to do it often.)
All of that said, I'm flexible here, and it's your project. :)
As for FindElem
, I think I can just remove that without sacrificing much. I currently perform the same operation in my existing code with df.Filter(...).Elem(0,1).Float()
which is succinct enough. (If we do add something that gives you only the first match later though, I'd suggest First
or FirstRow
rather than Head
, because df.head
in Pandas shows the first n, defaulting to 5.)
As an aside, for when there are many rows, I was thinking of adding Index(columnname string) DataFrame
which would build an index of the values in that column to their row number, and if a user chose to build such an index up-front, anything that needed to search for a value (e.g., Filter
) would make use of it to improve performance. That's still possible with the Filter-Elem-Float paradigm for looking up a value.
Thank you for your detailed explanation. Let me explain my opinion about coercion.
In many cases it can be a great thing. Especially when dealing with AI it is useful, because sooner or later all variables are float and a loss of one or two digits precision is not a problem. At other use-cases, this would be not acceptable - think of financial services. And it can cause bugs like https://github.com/go-gota/gota/issues/154 . This bug was in gota, other bugs can be in user code. It's all about the use-case.
Go is designed as a typed language and we should use the benefits of compile-time type checking. You are right, gota is full of automatic coercion, but this is the preferred way in Python, R and JavaScript. What benefit would people have if we write a 1:1 copy of pandas in Go? Speed? pandas is written in C with a Python wrapper - it's already fast. What else if not type safety?
In the future maybe I will find a way to replace interface {}
with generices or generate or ... Nevertheless we should move forward and improve the library usability.
The idea of your counter-proposal is good. You know my opinion about coercion, but let's have a try. Go for it. Can you change your PR, please?
What benefit would people have if we write a 1:1 copy of pandas in Go? Speed? pandas is written in C with a Python wrapper - it's already fast. What else if not type safety?
Hmm... I admit, in my particular case, the only reason I'm using Go is because I have to. So at least one benefit would be that people don't have to leave their preferred language (which is great for other purposes) to manipulate tabular data in a readable way. The company I work for has services written in Go, and we need a compact way to express quite a large collection of high level operations on tables. I don't think idiomatic Go is the best language for doing that, (partly because of static typing, but mostly because the for
loop reigns supreme in Go, and because of inline error checking), but I can't rewrite someone's entire service just because I'd prefer to do the number crunching with Pandas. I was pleased to find gota because it bends a few of the laws that makes Go especially painful for high-level data wrangling, and therefore finally made my code readable. It's not Go, it's a DSL, and that's my favorite thing about it.
The idea of your counter-proposal is good. You know my opinion about coercion, but let's have a try. Go for it.
Do you mean "go for it" as-written, or without any automatic int
->float64
coercion? If the latter, perhaps I should also add a fourth method to DataFrame
that explicitly coerces a column's types, so there isn't a big drop in readability when the types don't match.
Adds new
Math
method todataframe.DataFrame
capable of computing n-ary arithmetic functions against entire selected columns, storing the the result in a new column (or replacing an existing one). Supportsint
andfloat64
types. Supports operator specification by string (e.g., "+", "/", etc.) or unary, binary, or trinaryint
orfloat64
function (e.g., for supplying afloat64
function from Go'smath
module). For example:There are more examples in the docs and tests.
This PR also adds new
FindElem
method todataframe.DataFrame
which lets a user pull a particularseries.Element
out of aDataFrame
by specifying a column and value to select a row (assumed to be unique), and another column to find a particular value within that row. For example, the following line will search through the "Metric" column of each row for a value "envoy_cluster_upstream_rq_active", and then it will return theseries.Element
from that row corresponding to the "Value" column: