qiime2 / provenance-lib

QIIME 2 Provenance Replay Tools
BSD 3-Clause "New" or "Revised" License
3 stars 4 forks source link

Handle in-memory Results #22

Open ChrisKeefe opened 3 years ago

ChrisKeefe commented 3 years ago

As currently implemented (because we don't know anything about QIIME 2 Results objects), the provenance parser assumes a zipfile-based archive structure. If in the future we allow dependencies on the Framework, we might need to refactor significantly to allow the loading of Results from .qza/.qzv files, and the creation of ProvDAGs from those objects.

ChrisKeefe commented 3 years ago

If we define an API for working with different formats of Result (e.g. a DirectoryFormat in memory, a zipfile, etc), then each Mounter/Handler/whatever can offer the necessary methods (e.g. get_file(), read_file(), can_handle())

If we want to dynamically create ProvDAGs from whatever input format, then ProvDAG can keep a list of commonly used Mounters, and something (__init__, for example), can iterate over that list, allowing each mounter to determine whether it can_handle the data that was passed.

class ProvDag():
    mounters = [DirectoryMounter, ZipMounter, ...]

    def __init__(self, some_data, cfg):
        mounter = self.get_mounter(some_data)
        self.dag = nx.DiGraph()
        # if get_data were also to return the archive version number,
        # we might not need to pass `some_data` around as much
        data = mounter.get_data()
        handler = FormatHandler(cfg, some_data)
        handler.parse(some_data)

        ...

    def get_mounter(some_data):
        for mounter in mounters:
            if mounter.can_handle(some_data):
                return mounter

class Mounter(maybe an ABC?):
    # maybe @abstractmethod
    def can_handle(some_data):
        pass

    # maybe @abstractmethod
    def get_data(e_g_some_pathname)
        pass

Reimplementing the parsing logic to deal only with some expected format of data in memory would save us redundant parsing logic, but I think we'll want to avoid reading full ZipArchives into memory to keep IO costs down when the data is large (and unnecessary to our purposes).

ChrisKeefe commented 2 years ago

Random inline notes dump:

In future, this can decide whether it is dealing with a zip archive or an Artifact in memory, and can get the appropriate interface the Vx parsers should use when they interact with that artifact's data representation. If we want, we could probably pass the interface in when we return the instantiated parser object. This will slightly complicate tests that currently assume a parser is always dealing with a zip archive.

ChrisKeefe commented 2 years ago

We can hack this in the short term by .save-ing results in memory to a temp file and then reading them in as Archives. Gross, but gives us more flexibility for Alpha at zero cost.

ChrisKeefe commented 2 years ago

When this turns into a proper internal API, each mounter will need to implement the abstraction "what is a result called" for things like error messages: archive_identifier?

For files that representation will likely be zf.filename. For artifacts in memory, we should probably revert to archive root uuid.

Commit cdfc3139a0d53bd28fc691a91112e7b65ff7402a is a record of the switch from uuid to filenames

Tests need to cover both cases, obviously.

ChrisKeefe commented 2 years ago

Support for archives stored in an Artifact Cache should be considered. @Oddant1 is probably a good resource for questions on this.