Open CraigMiloRogers opened 3 years ago
Conceptually, this is a transformation among (node1, label, node2). What to do about any non-empty id
column values and additional columns?
--isa LABEL [LABEL ...]
one or more label
values that represent starting points (e.g., isa
).
--also LABEL [LABEL ...]
additional relationships to track beyond the starting point (e.g., subclass
).
ID generation: if an original edge has a non-empty ID, generate a -###
numeric suffix for generated edges.
Additional columns: copy them from the original records to the newly generated records.
--only-new-edges
: send only newly generated edges to the primary output.
--isa-edges PATH
, --also-edges PATH
: expert options, useful for debugging, that output the set of edges matching the --isa
and --also
criteria.
--new-edges PATH
: expert option, useful for debugging, that receives only the new edges, independently of the primary output file.
How does this differ from kgtk reachable-nodes
? Different focus (--isa
, --also
), retention of additional columns, ID generation philosophy, implementation will not depend upon a graph library. I expect kgtk flatten
to perform more slowly than kgtk reachable-nodes
for large edge sets, and to consume more memory.
After thinking about this longer, I can see two implementations. One would be in-memory. Another would be based on a sort-and-merge approach which should run on systems with constrained main memory, such as most laptops.
Given a tree structure using one or more predicates (e.g.,
isa
,subclass
), compute the transitive closure and generate new KGTK edges that flatten the tree.