holgerbrandl / krangl

krangl is a {K}otlin DSL for data w{rangl}ing
MIT License
560 stars 50 forks source link

Consider adding a replaceColumn() method #110

Closed devdanke closed 3 years ago

devdanke commented 3 years ago

It would be nice if there were a replaceColumn() method for cases where a numeric column gets typed as a string column. For instance, in a CSV file I wrangle, they sometimes use "" instead of zero for missing values. I agree with Krangl for classifying it as StringCol. I'd like an easy way to clean-up the values so it can be a number column.

Here's an extension method that lets me replace a column:

` fun DataFrame.replaceColumn(columnName: String, expression: TableExpression): DataFrame {

        val target: Pair<Int,DataCol> = this.cols.mapIndexed { index, dataCol ->
             if(dataCol.name == columnName) { Pair(index,dataCol) } else null
        }.filterNotNull().first()

        val newCol = this.addColumn(columnName+"_temp", expression).cols.last()

        val mutableCols = this.cols.toMutableList()
        mutableCols.set(target.first, newCol)

        return dataFrameOf( *(mutableCols.toTypedArray())).rename(columnName+"_temp" to columnName)

}`

I call it like this:

df.replaceColumn("my-col") { it["my-col"].map<String>{ if (it == null || it.isEmpty()) 0.0 else it.toDouble() } }

holgerbrandl commented 3 years ago

By design data frames are immutable but clearly, you can do

var df = iris
df = iris.addColumn("Species"){ "bla"}

to replace an existing column.

Concerning your use-case you could use NA aware conversions

sleepData.addColumn("foo") { it["vore"].toInts() }

which are provided for all supported types.

I still wonder how to make the API more approachable with more docs, a cheatsheet, some extended faq....

holgerbrandl commented 3 years ago

Closed because of inactivity. Feel welcome to reopen if needed.