Closed cgrinds closed 6 years ago
Good point. I've already added the missing sum
.
Adding column totals is also not really modelled in other APIs like dplyr. Also for panas the accepted solution seems to summarize first and then to bind rows. See https://stackoverflow.com/questions/20804673/appending-column-totals-to-a-pandas-dataframe
With the current version 0.9 you could the same with:
val data = dataFrameOf(
"Name", "Duration", "Color")(
"Foo", 100, "Blue",
"Goo", 200, "Red",
"Bar", 300, "Yellow")
val columnTotalsV09 = data.cols.map {
it.name to when (it) {
is DoubleCol -> it.asInts().sumBy { it!! }
is IntCol -> it.asInts().sumBy { it!! }
else -> null
}
}.toMap().run {
dataFrameOf(*keys.toTypedArray())(values)
}
listOf(data, columnTotalsV09).bindRows().print()
I've also slightly polished the APi for next release, to shorten it to just:
val columnTotals = data.cols.map {
it.name to when (it) {
is IntCol -> it.sum()
else -> null // ignored column types
}
}.toMap().run {
dataFrameOf(keys)(values)
}
bindRows(data, columnTotals).print()
I hope that this answers your question.
Thanks. It seems like a lot of code for something so common.
Would it be better to encapsulate the type-checking when
on the column class, perhaps in a sum
method?
The addition of bindRows is what I was missing and allows me to append the total row. Thanks! It would be nice to have an appendRow, but I understand that works against your read-only dataframe model.
Checking the stackoverflow link, both of these APIs look pretty nice:
df['Total'] = df.sum(axis=1)
or df.loc['Total']= df.sum()
The python answers do not give a meaningful result, if there are non-numeric columns in the df. In the solution above we get at least an NA in such a case.
Indeed you could encapsulate it via`
fun DataFrame.addTotals(expr: (DataCol) -> Any?) =
cols.map { it.name to expr(it) }.toMap().run {
bindRows(this@addTotals, dataFrameOf(keys)(values))
}
data.addTotals { it.sum() }.print()
Only issue here would be that added row would not say "Total" anywhere, but this is a general issue of adding Totals to a table. Where would you put it? There might not even be a StringCol in your dataset. panas solves it by using it's row index I think, but this feels odd to me.
We could still have appendRow
and return a new data-frame instance. But what would it do what bindRows
does not?
The appendRow
comment was more about the API and tension between immutable and mutable data structures. When you go looking for the API to add a row to a dataframe, I don't think bindRow
comes to mind.
At least not until you learn that dataframes are immutable. Then it makes more sense because you aren't appending a row, but instead cloning a table and copying a row into it. The clone-copy dance is normal for immutable datastructures, but the point was bindRow
seems less natural than appendRow
. Although perhaps bindRow
is better since it makes it clear that this isn't a normal append...
I ran into a similar issue when trying to create a dataframe with data that already existed in memory. Krangl requires you to flatten the data so you can call dataFrameOf("name", "duration", "color")(flattenedRows)
instead of using something like dataframe.addRow()
The other reason this came to mind is because I'm adapting some existing code to use Krangl and then rewriting that same code to use Tablesaw. It's interesting to compare/contrast the API differences.
I got used to immutability when using R, and now like the idea. It also seems to be an accepted good design pattern in many situations.
You don't have to flatten your data. You can use krangl.BuilderKt#dataFrameOf(krangl.DataCol...)
to bind a list of columns directly. In case this does not solve your problem, what type of builders/constructors are missing?
Feel welcome to post more problems/examples. krangl
is still evolving, so the more feedback the better.
What about adding a
fun dataFrameOf(rows: Iterable<DataFrameRow>): DataFrame {
// reshape into columns, and infer types
}
?
Here, DataFrameRow is just typealias DataFrameRow = Map<String, Any?>
.,
This would allow to convert any rowwise records encoded as a map into a df
.
I missed krangl.BuilderKt#dataFrameOf(krangl.DataCol...)
since it was added four days ago and isn't in 0.9
. That could work, but I like DataFrameRow
better since that closely matches what I already do with Json. That would be a nice addition.
I've added the iterator-based dataFrameOf
. See tests in commit for an usage example.
Concerning json, you could also try to bundled json support in krangl
. It reads from file, url, json-string, and supported flattening. See krangl.test.JsonTests for examples.
Thanks! I'll check it out
What's the best way to total a column? Say you have a df like this:
I don't see a sum() or total() method on DataCol - only mean, min, etc. I can total the column myself like so:
val total = df["duration"].asInts().sumBy { it -> it!! }
but how to I append this to the data frame to end up with this: