usc-isi-i2 / kgtk

Knowledge Graph Toolkit
https://kgtk.readthedocs.io/en/latest/
MIT License
351 stars 57 forks source link

`kgtk flatten` - Simple Transitive Closure #425

Open CraigMiloRogers opened 3 years ago

CraigMiloRogers commented 3 years ago

Given a tree structure using one or more predicates (e.g., isa, subclass), compute the transitive closure and generate new KGTK edges that flatten the tree.

CraigMiloRogers commented 3 years ago

Conceptually, this is a transformation among (node1, label, node2). What to do about any non-empty id column values and additional columns?

CraigMiloRogers commented 3 years ago

--isa LABEL [LABEL ...] one or more label values that represent starting points (e.g., isa). --also LABEL [LABEL ...] additional relationships to track beyond the starting point (e.g., subclass).

CraigMiloRogers commented 3 years ago

ID generation: if an original edge has a non-empty ID, generate a -### numeric suffix for generated edges.

CraigMiloRogers commented 3 years ago

Additional columns: copy them from the original records to the newly generated records.

CraigMiloRogers commented 3 years ago

--only-new-edges: send only newly generated edges to the primary output.

CraigMiloRogers commented 3 years ago

--isa-edges PATH, --also-edges PATH: expert options, useful for debugging, that output the set of edges matching the --isa and --also criteria.

CraigMiloRogers commented 3 years ago

--new-edges PATH: expert option, useful for debugging, that receives only the new edges, independently of the primary output file.

CraigMiloRogers commented 3 years ago

How does this differ from kgtk reachable-nodes? Different focus (--isa, --also), retention of additional columns, ID generation philosophy, implementation will not depend upon a graph library. I expect kgtk flatten to perform more slowly than kgtk reachable-nodes for large edge sets, and to consume more memory.

CraigMiloRogers commented 3 years ago

After thinking about this longer, I can see two implementations. One would be in-memory. Another would be based on a sort-and-merge approach which should run on systems with constrained main memory, such as most laptops.