holgerbrandl / krangl

krangl is a {K}otlin DSL for data w{rangl}ing
MIT License
560 stars 50 forks source link

How to append a total row #51

Closed cgrinds closed 6 years ago

cgrinds commented 6 years ago

What's the best way to total a column? Say you have a df like this:

| Name  | Duration | Color  |
-----------------------------
| Foo   | 100      | Blue   |
| Goo   | 200      | Red    |
| Bar   | 300      | Yellow |

I don't see a sum() or total() method on DataCol - only mean, min, etc. I can total the column myself like so: val total = df["duration"].asInts().sumBy { it -> it!! } but how to I append this to the data frame to end up with this:

| Name  | Duration | Color  |
-----------------------------
| Foo   | 100      | Blue   |
| Goo   | 200      | Red    |
| Bar   | 300      | Yellow |
| Total | 600      |        |
holgerbrandl commented 6 years ago

Good point. I've already added the missing sum.

Adding column totals is also not really modelled in other APIs like dplyr. Also for panas the accepted solution seems to summarize first and then to bind rows. See https://stackoverflow.com/questions/20804673/appending-column-totals-to-a-pandas-dataframe

With the current version 0.9 you could the same with:

val data = dataFrameOf(
    "Name", "Duration", "Color")(
    "Foo", 100, "Blue",
    "Goo", 200, "Red",
    "Bar", 300, "Yellow")

val columnTotalsV09 = data.cols.map {
    it.name to when (it) {
        is DoubleCol -> it.asInts().sumBy { it!! }
        is IntCol -> it.asInts().sumBy { it!! }
        else -> null
    }
}.toMap().run {
    dataFrameOf(*keys.toTypedArray())(values)
}

listOf(data, columnTotalsV09).bindRows().print()

I've also slightly polished the APi for next release, to shorten it to just:

val columnTotals = data.cols.map {
    it.name to when (it) {
        is IntCol -> it.sum()
        else -> null // ignored column types
    }
}.toMap().run {
    dataFrameOf(keys)(values)
}

bindRows(data, columnTotals).print()

I hope that this answers your question.

cgrinds commented 6 years ago

Thanks. It seems like a lot of code for something so common. Would it be better to encapsulate the type-checking when on the column class, perhaps in a sum method?

The addition of bindRows is what I was missing and allows me to append the total row. Thanks! It would be nice to have an appendRow, but I understand that works against your read-only dataframe model.

Checking the stackoverflow link, both of these APIs look pretty nice: df['Total'] = df.sum(axis=1) or df.loc['Total']= df.sum()

holgerbrandl commented 6 years ago

The python answers do not give a meaningful result, if there are non-numeric columns in the df. In the solution above we get at least an NA in such a case.

Indeed you could encapsulate it via`


fun DataFrame.addTotals(expr: (DataCol) -> Any?) =
    cols.map { it.name to expr(it) }.toMap().run {
        bindRows(this@addTotals, dataFrameOf(keys)(values))
    }

data.addTotals { it.sum() }.print()

Only issue here would be that added row would not say "Total" anywhere, but this is a general issue of adding Totals to a table. Where would you put it? There might not even be a StringCol in your dataset. panas solves it by using it's row index I think, but this feels odd to me.

We could still have appendRow and return a new data-frame instance. But what would it do what bindRows does not?

cgrinds commented 6 years ago

The appendRow comment was more about the API and tension between immutable and mutable data structures. When you go looking for the API to add a row to a dataframe, I don't think bindRow comes to mind.

At least not until you learn that dataframes are immutable. Then it makes more sense because you aren't appending a row, but instead cloning a table and copying a row into it. The clone-copy dance is normal for immutable datastructures, but the point was bindRow seems less natural than appendRow. Although perhaps bindRow is better since it makes it clear that this isn't a normal append...

I ran into a similar issue when trying to create a dataframe with data that already existed in memory. Krangl requires you to flatten the data so you can call dataFrameOf("name", "duration", "color")(flattenedRows) instead of using something like dataframe.addRow()

cgrinds commented 6 years ago

The other reason this came to mind is because I'm adapting some existing code to use Krangl and then rewriting that same code to use Tablesaw. It's interesting to compare/contrast the API differences.

holgerbrandl commented 6 years ago

I got used to immutability when using R, and now like the idea. It also seems to be an accepted good design pattern in many situations.

You don't have to flatten your data. You can use krangl.BuilderKt#dataFrameOf(krangl.DataCol...) to bind a list of columns directly. In case this does not solve your problem, what type of builders/constructors are missing?

Feel welcome to post more problems/examples. krangl is still evolving, so the more feedback the better.

holgerbrandl commented 6 years ago

What about adding a

fun dataFrameOf(rows: Iterable<DataFrameRow>): DataFrame {
    // reshape into columns, and infer types  
}

?

Here, DataFrameRow is just typealias DataFrameRow = Map<String, Any?>.,

This would allow to convert any rowwise records encoded as a map into a df.

cgrinds commented 6 years ago

I missed krangl.BuilderKt#dataFrameOf(krangl.DataCol...) since it was added four days ago and isn't in 0.9. That could work, but I like DataFrameRow better since that closely matches what I already do with Json. That would be a nice addition.

holgerbrandl commented 6 years ago

I've added the iterator-based dataFrameOf. See tests in commit for an usage example.

Concerning json, you could also try to bundled json support in krangl. It reads from file, url, json-string, and supported flattening. See krangl.test.JsonTests for examples.

cgrinds commented 6 years ago

Thanks! I'll check it out