qiime2 / provenance-lib

QIIME 2 Provenance Replay Tools
BSD 3-Clause "New" or "Revised" License
3 stars 4 forks source link

A performant approach to loading multiple Results #29

Open ChrisKeefe opened 3 years ago

ChrisKeefe commented 3 years ago

IO is expensive, and Results will frequently be duplicated across multiple Artifacts (e.g. the same FeatureTable will show up in all downstream Artifacts).

When loading a bunch of Artifacts at once, we can reduce IO by generating a ProvDAG from one Artifact and checking whether the root UUID of the next artifact is in first_provdag.nodes, parsing and unioning any artifacts that aren't subsets of the already-loaded DAG.

ChrisKeefe commented 3 years ago

If the approach above doesn't cut it, we can further reduce IO in the parsers. Rather than thinking in terms of Artifacts, we can think in terms of Results. If a given Result inside an Artifact has already been parsed (and so is in dag.nodes), we can avoid parsing it entirely. The resulting dag will be smaller, but can probably be unioned into dag. Because this will result in disjoint graphs, union is probably appropriate, but edges between these nodes and their parent nodes in the main dag may need to be drawn manually.

ChrisKeefe commented 3 years ago

Parsing behavior might be different depending on whether we're parsing a single Artifact or multiple Artifacts. For example, parse_many requires we iterate over multiple Artifacts, and also requires that we union the results.

We could handle this in a separate method (parse_one vs parse_many), or we could pass a value in the cfg, and perform additional steps as needed. Probably this looks like always iterating over a lists of e.g. zipfiles (even if there's only one in the list), and using the cfg, or some counter variable to decide whether we need to union.

There is an important distinction between the optimization we can perform in this parse_many scenario, and the stepwise make-dags-then-union-them approach. Here, I think we don't need to union at all, so long as we collect all nodes of interest along the way. If we have all graph endpoints, we can traverse all resulting disjoint graphs. Having the data upfront allows us to optimize, making this approach faster, while union allows interactivity. Both good to have.

ChrisKeefe commented 2 years ago

Some breadcrumbs from inline to make this less painful:

1 2 3

ChrisKeefe commented 2 years ago

Update, #53 handled the basic implementation here. Keeping this open, in case we want to optimize further. Next step would probably look this: