Open ChrisKeefe opened 3 years ago
If we define an API for working with different formats of Result (e.g. a DirectoryFormat in memory, a zipfile, etc), then each Mounter/Handler/whatever can offer the necessary methods (e.g. get_file()
, read_file()
, can_handle()
)
If we want to dynamically create ProvDAGs
from whatever input format, then ProvDAG can keep a list of commonly used Mounters, and something (__init__
, for example), can iterate over that list, allowing each mounter to determine whether it can_handle
the data that was passed.
class ProvDag():
mounters = [DirectoryMounter, ZipMounter, ...]
def __init__(self, some_data, cfg):
mounter = self.get_mounter(some_data)
self.dag = nx.DiGraph()
# if get_data were also to return the archive version number,
# we might not need to pass `some_data` around as much
data = mounter.get_data()
handler = FormatHandler(cfg, some_data)
handler.parse(some_data)
...
def get_mounter(some_data):
for mounter in mounters:
if mounter.can_handle(some_data):
return mounter
class Mounter(maybe an ABC?):
# maybe @abstractmethod
def can_handle(some_data):
pass
# maybe @abstractmethod
def get_data(e_g_some_pathname)
pass
Reimplementing the parsing logic to deal only with some expected format of data in memory would save us redundant parsing logic, but I think we'll want to avoid reading full ZipArchives into memory to keep IO costs down when the data is large (and unnecessary to our purposes).
Random inline notes dump:
In future, this can decide whether it is dealing with a zip archive or an Artifact in memory, and can get the appropriate interface the Vx parsers should use when they interact with that artifact's data representation. If we want, we could probably pass the interface in when we return the instantiated parser object. This will slightly complicate tests that currently assume a parser is always dealing with a zip archive.
We can hack this in the short term by .save-ing results in memory to a temp file and then reading them in as Archives. Gross, but gives us more flexibility for Alpha at zero cost.
When this turns into a proper internal API, each mounter will need to implement the abstraction "what is a result called" for things like error messages: archive_identifier
?
For files that representation will likely be zf.filename
. For artifacts in memory, we should probably revert to archive root uuid.
Commit cdfc3139a0d53bd28fc691a91112e7b65ff7402a is a record of the switch from uuid to filenames
Tests need to cover both cases, obviously.
Support for archives stored in an Artifact Cache should be considered. @Oddant1 is probably a good resource for questions on this.
As currently implemented (because we don't know anything about QIIME 2 Results objects), the provenance parser assumes a zipfile-based archive structure. If in the future we allow dependencies on the Framework, we might need to refactor significantly to allow the loading of Results from .qza/.qzv files, and the creation of ProvDAGs from those objects.