microsoft / Kusto-Query-Language

Kusto Query Language is a simple and productive language for querying Big Data.
Apache License 2.0
509 stars 96 forks source link

graph-match duplicated pattern results #145

Open JiTmun opened 4 days ago

JiTmun commented 4 days ago

Issue

Variable length edges in graph-match pattern yields all intermediary relationship between highest node and lowest node, duplicating results.

Table as input of make-graph

process_parent_command_line process_command_line process_parent_id process_id
root cmd cmd lvl1 0 1
cmd lvl1 cmd lvl2 1 2
cmd lvl1 cmd lvl1.2 1 1.2
cmd lvl2 cmd lvl3 2 3

Table after graph-match with variable edge length

Pattern used is *(parent)-[edge1..10]-> (child)** The goal is to build a process tree to filter outs branches having some parents.

root_pid root_cmd intermediary_nodes last_child last_child_cmd
0 root cmd ["cmd lvl1"] 1 cmd lvl1
2 cmd lvl2 ["cmd lvl3"] 3 cmd lvl3
1 cmd lvl1 ["cmd lvl2"] 2 cmd lvl2
1 cmd lvl1 ["cmd lvl1.2"] 1.2 cmd lvl1.2
1 cmd lvl1 ["cmd lvl2","cmd lvl3"] 3 cmd lvl3
0 root cmd ["cmd lvl1","cmd lvl2"] 2 cmd lvl2
0 root cmd ["cmd lvl1","cmd lvl1.2"] 1.2 cmd lvl1.2
0 root cmd ["cmd lvl1","cmd lvl2","cmd lvl3"] 3 cmd lvl3

Expected graph match output

-- | -- | -- | -- | -- 0 | root cmd | ["cmd lvl1","cmd lvl1.2"] | 1.2 | cmd lvl1.2 0 | root cmd | ["cmd lvl1","cmd lvl2","cmd lvl3"] | 3 | cmd lvl3

That for, graph-match with variable length should have an option to get the longuest branch and discard intermediary nodes Another option would be to have function to test for parenting of a node. Here, we wan:

Kusto code related to example

datatable (process_parent_command_line:string, process_command_line:string, process_parent_id:string, process_id :string)[
        // ComputerA 1 branche with 
        "root cmd", "cmd lvl1", 0, 1,
        "cmd lvl1", "cmd lvl2", 1, 2,
        "cmd lvl1", "cmd lvl1.2",1, 1.2,
         "cmd lvl2", "cmd lvl3", 2, 3,
        ]
| as hint.materialized=true Data
| make-graph process_parent_id --> process_id
        with (union (Data
                    | distinct node_id = process_id,
                                process_command_line),
                    (Data
                    | distinct node_id = process_parent_id,
                                    process_command_line = process_parent_command_line
                    )
                | distinct node_id, process_command_line 
                ) on  node_id // with process_info on node_id
// build process tree
| graph-match (parent)-[edge*1..10]-> (child) 
    //where edge.
    project root_pid = parent.node_id,
            root_cmd = parent.process_command_line,
            intermediary_nodes = todynamic(edge.process_command_line),
            last_child_pid = child.node_id,
            last_child_cmd = child.process_command_line
| extend branch_length = array_length(intermediary_nodes)
royoMS commented 3 days ago

This is currently not possible as part of graph-match, but you can achieve it by tagging for each node if it's a leaf and then use it as a constraint inside graph-match:

datatable (process_parent_command_line:string, process_command_line:string, process_parent_id:string, process_id :string)[
        "root cmd", "cmd lvl1", 0, 1,
        "cmd lvl1", "cmd lvl2", 1, 2,
        "cmd lvl1", "cmd lvl1.2",1, 1.2,
         "cmd lvl2", "cmd lvl3", 2, 3,
        ]
| as hint.materialized=true Data
| make-graph process_parent_id --> process_id
        with (union (Data | distinct node_id = process_id, process_command_line | extend is_leaf = node_id !in (Data | project process_parent_id)),
                    (Data | distinct node_id = process_parent_id, process_command_line = process_parent_command_line, is_leaf = false)
                | distinct node_id, process_command_line, is_leaf 
                ) on  node_id
// build process tree
| graph-match (parent)-[edge*1..10]-> (child) 
    where parent.process_command_line == "root cmd" and child.is_leaf
    project root_pid = parent.node_id,
            root_cmd = parent.process_command_line,
            intermediary_nodes = todynamic(edge.process_command_line),
            last_child_pid = child.node_id,
            last_child_cmd = child.process_command_line
| extend branch_length = array_length(intermediary_nodes)