holgerbrandl / krangl

krangl is a {K}otlin DSL for data w{rangl}ing
MIT License
560 stars 50 forks source link

Vector/Array as value? #113

Closed WojtekPtak closed 3 years ago

WojtekPtak commented 3 years ago

Hello! I'm not sure if this place is only for ideas/bugs but I have a question...

I'm wondering if it's possible to decode such data which I used recently in pandas. Its a simple tree I'm using to learn krangl but its a format which I will use from CSV files:

/ A(1) AA(2) AB(3) AC(4) AAA(5) ABA(6) ABB(7) /

val treeDF: DataFrame = dataFrameOf(
    "node_name", "node_id", "path")(
    "A", 1, "[1]",
    "AA", 2, "[1,2]",
    "AB", 3, "[1,3]",
    "AC", 4, "[1,4]",
    "AAA", 5, "[1,2,5]",
    "ABA", 6, "[1,3,6]",
    "ABB", 7, "[1,3,7]"
)

How to get direct parent (last but one element from "path" but null for top parent) ? In python I was able to use lambda with row value and then sth like row['path'][-2] to get e.g 3 for ABA node. I can see there is only String for such purposes so probably I should substring "path" to have only integers and then map it to integers and then select size-2 element

Its possible for sure but I wasnt able to do it - but lets say that I have new column "parent_id" below "A", 1, "[1]", null "AA", 2, "[1,2]", 1 "AB", 3, "[1,3]", 1 "AC", 4, "[1,4]", 1 "AAA", 5, "[1,2,5]", 2 "ABA", 6, "[1,3,6]", 3 "ABB", 7, "[1,3,7]", 3

But now I need "parent_name" column. And again in Python I did it using sth like: def find_node_by_id(df, node_id): row = df.loc[df['node_id'] == node_id] return row to find row with parent definition and then I was able to obtain node_name from row. But it's a bit hard for me - maybe because I'm not very familiar with Kotlin :)

Any advice which methods I should use to work with vectors/arrays? What about grouping eg. such tree by depth: "depth", "nodes" 1, [1] 2, [2,3,4] 3, [5,6,7] I would expect vector as "nodes" value. And searching row in dataframe and use result to get parent_name... I can do it eg by filtering I guess not sure its optimal way for big data

I was searching DataFrame solution for Java - Krangl looks interesting and similar to pandas but I'm not sure if I shouldnt use Spark DataFrame instead :/

WojtekPtak commented 3 years ago

ParentId part works with

// for path [1,3,6] it will return 3 - last but one element from the list fun decodeParentId(it:ExpressionContext): List<Any?> { return it["path"].map { val ids: List = it.subSequence(1, it.length-1).split(',') if (ids.size<2) "" else ids[ids.size-2] } }

and

val testDF1 = treeDF.addColumn("parent_id") {
    decodeParentId(it)
}
WojtekPtak commented 3 years ago

I'm closing - I found a solution - but I'm going to add some examples to repo :)

holgerbrandl commented 3 years ago

Sorry for being so slow here. Great that you could work out a solution yourself. Feel welcome to suggest doc-additions or to raise further questions.