holgerbrandl / krangl

krangl is a {K}otlin DSL for data w{rangl}ing
MIT License
560 stars 50 forks source link

Rework API to refer to current row with `it` #122

Open LeandroC89 opened 3 years ago

LeandroC89 commented 3 years ago

Hello, maybe I'm just missing something but is there a way to reference current row?

For example during addColumn, is there a way to know which row is being populated at the time?

This would be useful for running sum like operations, and maybe to another problem I'm not being able to solve, which would be "How do we create a column that will copy the value from column A if it is not NA but copy it from column B if it is?", I know this is an entirely new question by itself but knowing what row I'm populating would allow me to solve it via workaround.

Thank you in advance!

holgerbrandl commented 3 years ago

What about using rowNumber?

On Wed, May 26, 2021 at 10:27 PM Leandro @.***> wrote:

Hello, maybe I'm just missing something but is there a way to refence current row?

For example during addColumn, is there a way to know which row is being populated at the time?

This would be useful for running sum like operations, and maybe to another problem I'm not being able to solve, which would be "How do we create a column that will copy the value from column A if it is not NA but copy it from column B if it is?", I know this is an entirely new question by itself but knowing what row I'm populating would allow me to solve it via workaround.

Thank you in advance!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/holgerbrandl/krangl/issues/122, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABRB6GARLIQTLZO2MXG6VTTPVKRJANCNFSM45SZ7VAQ .

LeandroC89 commented 3 years ago

Thank you, I knew the addRowNumber method but never thought about the variable.

It's not a direct solution (I may not be looking at this the right way) but since it is still an array and not a value per se, it can't easily be used.

Example would be data.addColumn("Row"){ rowNumber.toString() } would result in something like [1, 2, ...] (Just an example and not an useful test case)

If I wanted to retrieve one value from the previous row, I can't directly use rowNumber as it is an array.

The way I managed to do it was using the map() function

data.addColumn("Previous Value"){ rowNumber.map{entry: Int -> if (entry < 2 ) 0 else data["Value"][entry - 2] } } (It would also have worked with any column with unique values (and using filter) now that I think of it)

I'm leaving this open in case you'd like to add something to discussion (as like I said I may not be looking at this the right way). but please feel free to mark it solved as I was now able to solve my problem (using map() on the rowNumber).

Thank you for your help.

holgerbrandl commented 3 years ago

Regarding "How do we create a column that will copy the value from column A if it is not NA but copy it from column B if it is?": It's a pity that we can't implement vector valued operators in Kotlin. This would make the workflow so much more streamined (similar to R/python). Anyway, I guess you could solve this with

data.addColumn("AorB"){ it["A"].values().zip(it["B"].values()).map{ (a, b) -> a ?: b} }

Regarding your example, what about lead/lag (see https://github.com/holgerbrandl/krangl/blob/master/CHANGES.md#v013)

In general, I struggle myself with krangle's syntax very often, and don't find it intuitive in many situations. In particular when adding columns (which is the most frequent op imho). One way to overcome this would be to refer to just the row with it by default in addColumn and provide access to the entire data-frame via the lambda context. Similar to how filterByRow works

 df.filterByRow { (it["age"] as Int).rem(10) == 0 } // round birthdays :-)

// mock
// A or B if null
irisData.addColumn("equipment"){ it["A"] ?: it["B"]}
// mean subtraction
irisData.addColumn("equipment"){ it["A"] - df["A"].sum()!! }

Clearly, this would be a major breaking change, but I wouldn't mind anything that makes it more fun/convenient to use.

It would also come along with some new inconsistency in summarize where it must refer to the group/dataframe (or could be missing in favour of a df that comes via the extension-scope.

Any thoughts?

LeandroC89 commented 3 years ago

Regarding "How do we create a column that will copy the value from column A if it is not NA but copy it from column B if it is?" I had done it with a similar approach once I realized I could use map.

df = df.addColumn("NewColumn"){
    rowNumber.map{ entryNr: Int ->
            (df["A"][entryNr - 1]?: df["B"][entryNr - 1])
    }
}

Regarding the lead/lag, I was unaware of these functions and they would have been very useful in most of my use cases, I really need to spend more time reading the API. Thank you for pointing these out.

Regarding the it referring to the row, I feel like that would be more intuitive as I struggled a lot with simple string operations and ended up with the whole column contents on each cell of the new column, when I first started to explore krangl ( as I initially thought that it would be the current row being "processed" and not the whole dataframe ). I have used map ever since though, I just never thought of accessing previous row value using it before .

As much as I agree that it would make it more intuitive, I have no idea how that would affect what already works, like adding a column based on a formula or array. I will try to play with this a little once I have some time too.

holgerbrandl commented 3 years ago

thought that it would be the current row being "processed" and not the whole dataframe

I guess that would be the common intuition in kotlin. The reason why I designed the API to refer to the entire table is that this the way dplyr works in R.