Open victorlin opened 3 days ago
Generally, I would love to have a way to use pathogen-specific Docker images in our workflows! That's been my dream since we added pango-learn to the base image back in the early pandemic.
For the specific candidates you mentioned for removal, I can make some specific notes:
fauna: Most of our flu workflows source data from S3 now, with a separate action that transfers data from fauna to S3. As long as that separate transfer action has fauna in its runtime, we don't strictly need fauna for the main workflows.
evofr and pathogen-embed: Although these tools are only publicly used by forecasts-ncov right now, we do plan to use them for other projects (e.g., adding t-SNE embeddings in avian flu workflow). Providing these general tools to the public through our runtimes also encourages their use outside of our group. Maybe we could spend a little time trying to optimizing the installation time/size of these tools in our runtimes? There could be some larger dependencies of these tools that we could replace with slimmer alternatives.
pango_aliasor: I can see why you'd want to keep this ncov-specific tool out of the base image, but it's also a great example of a tiny package that is helpful to any ncov analyses outside of forecasts-ncov that it doesn't seem to hurt to keep it.
epiweeks: this package is also used by seasonal flu could plausibly used more elsewhere (e.g., measles). It's also a pretty tiny package.
I would also recommend removing would be the pango-learn packages and its binary dependencies of gofasta and minimap2, since all of our pango annotations come from Nextclade now.
For each pathogen/project that relies on tools that may be removed, create and use a custom runtime that installs the tools. Right now the process may be more involved than it should be, and we should provide a good path for extending the base runtimes (examples: docker, conda).
This is the other half of workflows as programs, namely the "the artifacts/bundling (keyword: buildpacks) side of things", no?
(And yes, we should totally do this if it's at all feasible -- there are a number of times I haven't done something because I know it's going to be such a hassle / burden to make the needed dependency available to our runtimes.)
This is the other half of workflows as programs, namely the "the artifacts/bundling (keyword: buildpacks) side of things", no?
Yes, precisely. The whole idea there is that instead of having runtimes and pathogens separately, we have pathogens that are (or contain) their runtimes. We want to avoid having N pathogens and N×M pathogen-runtimes and making the user match them.
The implementation examples Victor gave (and things like ncov-ingest's image) are coming at this from what I'd call a more ad-hoc approach, and I do not think we should go down that path as a way to get to custom runtimes per pathogen. That way lies ecosystem fragmentation and incurs significant usability costs (to both users and developers, us and others).
There's lots of considerations of this work. For example, our runtimes are not small when installed on disk. We're going to want to be able to share a concrete, installed base across pathogens (not just a conceptual base).
We'll also want to consider the cost vs. benefits of moving something out of the base runtimes; it will have non-trivial overhead (both conceptual and actual) and we should only do it when it's worth it. I'm not convinced many candidates given above meet that threshold? What concretely are we gaining with the removal of each?
@jameshadfield
there are a number of times I haven't done something because I know it's going to be such a hassle / burden to make the needed dependency available to our runtimes
Do you have examples? They would be very helpful to guide both eventual work on this topic but also suggest pain points we might be able to alleviate now with the current base runtimes.
Do you have examples?
The one I was reminded of with this week's avian-flu work is https://github.com/nextstrain/avian-flu/issues/80. There have been a bunch of others along the lines of "can't use this pip dependency, not in our runtimes" but I managed to find an alternate solution so it wasn't a dealbreaker.
This applies to docker-base and conda-base.
Context
Our base image has accumulated various pathogen-specific tools over time, some of which signficantly contribute to build time and image size. By removing these pathogen-specific tools, we can ensure the base image/environment reflects a continually updated version of Nextstrain tools and their dependencies. Using fauna as an example, more detailed reasoning is in https://github.com/nextstrain/fauna/issues/170.
Candidates
Tasks
For each pathogen/project that relies on tools that may be removed, create and use a custom runtime that installs the tools. Right now the process may be more involved than it should be, and we should provide a good path for extending the base runtimes (examples: docker, conda).