holgerbrandl / krangl

krangl is a {K}otlin DSL for data w{rangl}ing
MIT License
560 stars 50 forks source link

Reconsider element-wise verbs vs vectorized column operations #64

Open holgerbrandl opened 6 years ago

holgerbrandl commented 6 years ago

Api would be much more fluent because we would no longer suffer from limited operator overlaoding.

It also would require user to learn fewer verbs. Currently also the vectorized helpers seem more confusing than helpful.

Now:

df.addColumn("foo"){ it["bar") + 3)
df.filter { it["weight"] gt 50 }
df.addColumn("with_anz") { it["first_name"].asStrings().map { it!!.contains("anz") } }

With element-wise operations:

df.addColumn("foo"){ it["bar") + 3)
df.filter { it["weight"] > 50 }
df.addColumn("with_anz") { it["first_name"].s.contains("anz") } }

However, this would not work for aggregation:

val sumDF = df.summarize(
    "mean_weight" to { it["weight"].mean(removeNA = true) },
    "num_persons" to { nrow }
)

Also certain column operations would be harder to implement with a element-wise API such as

val sumDF = df.addColumns(
    "proportion" to { it["weight"]/it["weight"].sum() }
)

Maybe the complete vector could be exposed it.df["weight"?