vdblab / vdblab-shotgun

Shotgun metagenomic sequencing processing pipeline
MIT License
1 stars 1 forks source link

Feature/skip sample2markers #47

Closed nickp60 closed 1 year ago

nickp60 commented 1 year ago

The sample2markers step takes ages and is not needed on all runs. This adds an option to disable it.

funnell commented 1 year ago

samples2markers is just used for strainphlan right? Perhaps the samples2markers stuff could just be moved to the strainphlan workflow somehow?

nickp60 commented 1 year ago

Yeah I think that would be best; the only complication I see with completely disconnecting it from the biobakery workflow is that we would want a way to avoid regenerating the sample2markers pickles if we don't have to. The changes I made here add the metaphlan alignment file used to generate the pickle as an output, but there could be cases where we have some already processed. The options I see are:

nickp60 commented 1 year ago

Hi @funnell, I thought about this a bit more. Since both the sample markers files and the taxa markers files are tied to a specific version of the metaphlan reference database, I think it might be good to put both sets of markers in the resources directory. I have revamped the code to do the following

Let me know what you think!

funnell commented 1 year ago

I like the idea of caching those sample marker pkl files, but my concern with storing them in an external (to Isabl) location is what happens if we make a change to sample preprocessing, or to some other upstream step? Wouldn't strainphlan then be running with outdated marker files?

nickp60 commented 1 year ago

Thats a good point. The taxa markers are not sample dependent, so regardless of pipeline version those should be in a central location specific to a db release. How about I make separate path variables for strainphlan_sample_markers_dir and strainphlan_taxa_markers_dir? That way we could set strainphlan_taxa_markers_dir to the db release, and set strainphlan_sample_markers_dir separately

Running this via Isabl will make any sort of per-sample caching difficult. Unless maybe we hardcode a symlink step to consolidate the sample pkls to a (versioned) location?

As far as typical use, I don't think its going to be convenient to run strainphlan via isabl as both the samples and taxa are going to be very specific to a particular run. As an alternative, we could make a versioned directory under {strainphlan_markers_dir}/samples/biobakery_app_version_###, and have the expectation that this workflow is not run via isabl. The sample pkls won't be tracked by isabl, but at least the input sams are.