naobservatory / mgs-workflow

3 stars 2 forks source link

Try removing reference copying steps #22

Closed willbradshaw closed 1 week ago

willbradshaw commented 3 weeks ago

Previously, it was necessary to copy over relevant reference files from the index directory in order for the pipeline to run and resume properly. This is obviously wasteful in terms of storage space, may no longer be necessary, especially with the move to Batch; I'd like to try removing these and reading from the index directory directly.

mikemc commented 2 weeks ago

Are all of these steps in the PREPARE_REFERENCES workflow?

mikemc commented 2 weeks ago

I'm wondering if it also makes sense to change the pipeline so that the reference files are staged at the start of the specific workflow where they are needed, rather than all at once at the beginning. This would make it faster to quickly identify issues with paths etc and to run just parts of the workflow (where you can also save on disk usage).

willbradshaw commented 2 weeks ago

Are all of these steps in the PREPARE_REFERENCES workflow?

They are in the master branch. In the new branch I'm working on, they've been moved to their corresponding subworkflows.

This won't actually help with runtime, though, since Nextflow stages jobs as soon as their dependencies are satisfied, regardless of where they actually appear in the (sub)workflow files.

mikemc commented 2 weeks ago

This won't actually help with runtime, though, since Nextflow stages jobs as soon as their dependencies are satisfied, regardless of where they actually appear in the (sub)workflow files.

This does help with how I've been interacting with the pipeline this week: Currently I'm running the pipeline just through rRNA deduplication (before any tax assignment or human decontam), which I'm doing just by commenting out the latter part of the main workflow, but I also need to comment out the relevant parts of the PREPARE_REFERENCES workflow to avoid unnecessarily fetching the refs. But with the new structure, it sounds like I can just toggle the subworkflows within the main workflow and automatically get the refs as needed.

willbradshaw commented 1 week ago

Implemented in dev branch.