Closed nickp60 closed 1 year ago
samples2markers is just used for strainphlan right? Perhaps the samples2markers stuff could just be moved to the strainphlan workflow somehow?
Yeah I think that would be best; the only complication I see with completely disconnecting it from the biobakery workflow is that we would want a way to avoid regenerating the sample2markers pickles if we don't have to. The changes I made here add the metaphlan alignment file used to generate the pickle as an output, but there could be cases where we have some already processed. The options I see are:
Hi @funnell, I thought about this a bit more. Since both the sample markers files and the taxa markers files are tied to a specific version of the metaphlan reference database, I think it might be good to put both sets of markers in the resources directory. I have revamped the code to do the following
strainphan_markers_dir
config arg, which I think we should default to the metaphlan dir in the resourcesstrainphan_markers_dir
directory allows on-demand creation or reuse of either strain or sample marker filesLet me know what you think!
I like the idea of caching those sample marker pkl files, but my concern with storing them in an external (to Isabl) location is what happens if we make a change to sample preprocessing, or to some other upstream step? Wouldn't strainphlan then be running with outdated marker files?
Thats a good point. The taxa markers are not sample dependent, so regardless of pipeline version those should be in a central location specific to a db release. How about I make separate path variables for strainphlan_sample_markers_dir
and strainphlan_taxa_markers_dir
? That way we could set strainphlan_taxa_markers_dir to the db release, and set strainphlan_sample_markers_dir
separately
Running this via Isabl will make any sort of per-sample caching difficult. Unless maybe we hardcode a symlink step to consolidate the sample pkls to a (versioned) location?
As far as typical use, I don't think its going to be convenient to run strainphlan via isabl as both the samples and taxa are going to be very specific to a particular run. As an alternative, we could make a versioned directory under {strainphlan_markers_dir}/samples/biobakery_app_version_###
, and have the expectation that this workflow is not run via isabl. The sample pkls won't be tracked by isabl, but at least the input sams are.
The sample2markers step takes ages and is not needed on all runs. This adds an option to disable it.