Open ChrisKeefe opened 3 years ago
If the approach above doesn't cut it, we can further reduce IO in the parsers.
Rather than thinking in terms of Artifacts, we can think in terms of Results. If a given Result inside an Artifact has already been parsed (and so is in dag.nodes
), we can avoid parsing it entirely. The resulting dag will be smaller, but can probably be unioned into dag
. Because this will result in disjoint graphs, union is probably appropriate, but edges between these nodes and their parent nodes in the main dag
may need to be drawn manually.
Parsing behavior might be different depending on whether we're parsing a single Artifact or multiple Artifacts. For example, parse_many
requires we iterate over multiple Artifacts, and also requires that we union the results.
We could handle this in a separate method (parse_one
vs parse_many
), or we could pass a value in the cfg
, and perform additional steps as needed. Probably this looks like always iterating over a lists of e.g. zipfiles (even if there's only one in the list), and using the cfg
, or some counter variable to decide whether we need to union.
There is an important distinction between the optimization we can perform in this parse_many
scenario, and the stepwise make-dags-then-union-them approach. Here, I think we don't need to union at all, so long as we collect all nodes of interest along the way. If we have all graph endpoints, we can traverse all resulting disjoint graphs. Having the data upfront allows us to optimize, making this approach faster, while union
allows interactivity. Both good to have.
Update, #53 handled the basic implementation here. Keeping this open, in case we want to optimize further. Next step would probably look this:
IO is expensive, and Results will frequently be duplicated across multiple Artifacts (e.g. the same FeatureTable will show up in all downstream Artifacts).
When loading a bunch of Artifacts at once, we can reduce IO by generating a ProvDAG from one Artifact and checking whether the root UUID of the next artifact
is in first_provdag.nodes
, parsing and unioning any artifacts that aren't subsets of the already-loaded DAG.