Develop lee - Githubissues

LeeBergstrand commented 6 months ago

Refactor many aspects of the rotary code base to support further extension and development.

Primary Changes:

Split the rotary.smk pipeline into multiple files that are specialized for each group of tasks done by the pipeline. For example, all the QC rules are placed in a qc.smk file. The rotary.smk file now includes these other files when running the pipeline.
Refactor and rewrite the code for handling the creation of sample TSV files to make them more extensible. Different types of sequencing files or file pairs (e.g., long or paired short reads) are represented as objects. These objects can be grouped into sample objects representing the multiple sequencing files used to represent one physical sample. Several sample objects represent different types of samples (e.g., long read sample, paired short read sample, or long with polishing reads samples). A dataset object represents multiple samples contained in a single sequencing run. Dataset objects containing multiple sample objects are used to either generate sample TSVs or can be generated by sample TSVs.
I refactored the CLI code for Rotary to use the above objects and moved the code for running SnakeMake as a subprocess into its own file.
Addressed: https://github.com/rotary-genomics/rotary/issues/116

LeeBergstrand commented 5 months ago

@jmtsuji Feel free to review when you get a chance. The eventual intent would be to move the following files to their software library and GitHub repository within the Rotary organization. The code could be used to build other snake make pipelines.

run.py
sample.py
dataset.py
utils.py

My intent would be to use this library for our own internal snake-make pipelines at my work.

jmtsuji commented 5 months ago

@jmtsuji Feel free to review when you get a chance. The eventual intent would be to move the following files to their software library and GitHub repository within the Rotary organization. The code could be used to build other snake make pipelines.

run.py

sample.py

dataset.py

utils.py

My intent would be to use this library for our own internal snake-make pipelines at my work.

Reviewing now -- thanks for your patience! Yes, that sounds good to plan to move the files you mentioned to their own software library / repo.

LeeBergstrand commented 5 months ago

@LeeBergstrand The improvements look great! Thanks for your hard work making the general library files for setting up a snakemake run, splitting the snakefile into smaller chunks, adding more temp files, and making various other minor enhancements/fixes. I checked the split snakefiles, and all the rules seem to be in the right places. I have added a few minor comments of things to revise, but otherwise I think this PR is OK to merge.

One bigger picture question, though, is how we should handle download-related rules. Right now, they are added into the same snakefile as the analysis rules that need them. Another option would be to make a separate download.smk file and to put all the download rules in there. We could potentially set up a new rule rule download that can be optionally called by the CLI (but otherwise goes uncalled) in that snakefile. This would allow users to download all needed DB files in a single command if they wanted to, as per #45 . If you like this idea to make a download-dedicated snakefile, we could go ahead and implement it now, while we are already re-organizing the snakefiles, or we could wait until some point in the future... either is OK for me. What do you think?

Thanks again!

P.S. Note that I have not run an end-to-end test yet on this branch... I can do so if you would like.

From a maintenance standpoint, it makes more sense to me to have the database download rules in the same snakemake file as the rule that uses the data. That way, if you need to update anything, such as database versions, everything is self-contained in the same place so that you can edit both rules simultaneously. For example, when you specify a new model version name in the download rule and that same model version name in the processing rule, you can add both as a variable at the top of the file. Also, if you need to port things between software programs, everything is in one package. You can copy the file over. I would still add a download.smk file with the rule rule download but it would be the only thing in that file. This rule would be pretty long because it would check for the output files of all the download rules in the other files. Perhaps you could also have sub-rules (e.g., qc_download) in download.smk that check individual download steps and checkpoints). Because of how Snakemake works, you don't need to have all the download rules in download.smk. You could either import download.smk into rotary.smk and call the rule out of rotary.smk or you could import all the other rules into 'download.smk' and call download.smk

LeeBergstrand commented 5 months ago

@LeeBergstrand The improvements look great! Thanks for your hard work making the general library files for setting up a snakemake run, splitting the snakefile into smaller chunks, adding more temp files, and making various other minor enhancements/fixes. I checked the split snakefiles, and all the rules seem to be in the right places. I have added a few minor comments of things to revise, but otherwise I think this PR is OK to merge. One bigger picture question, though, is how we should handle download-related rules. Right now, they are added into the same snakefile as the analysis rules that need them. Another option would be to make a separate download.smk file and to put all the download rules in there. We could potentially set up a new rule rule download that can be optionally called by the CLI (but otherwise goes uncalled) in that snakefile. This would allow users to download all needed DB files in a single command if they wanted to, as per #45 . If you like this idea to make a download-dedicated snakefile, we could go ahead and implement it now, while we are already re-organizing the snakefiles, or we could wait until some point in the future... either is OK for me. What do you think? Thanks again! P.S. Note that I have not run an end-to-end test yet on this branch... I can do so if you would like.

From a maintenance standpoint, it makes more sense to me to have the database download rules in the same snakemake file as the rule that uses the data. That way, if you need to update anything, such as database versions, everything is self-contained in the same place so that you can edit both rules simultaneously. For example, when you specify a new model version name in the download rule and that same model version name in the processing rule, you can add both as a variable at the top of the file. Also, if you need to port things between software programs, everything is in one package. You can copy the file over. I would still add a download.smk file with the rule rule download but it would be the only thing in that file. This rule would be pretty long because it would check for the output files of all the download rules in the other files. Perhaps you could also have sub-rules (e.g., qc_download) in download.smk that check individual download steps and checkpoints). Because of how Snakemake works, you don't need to have all the download rules in download.smk. You could either import download.smk into rotary.smk and call the rule out of rotary.smk or you could import all the other rules into 'download.smk' and call download.smk

# Include various modules.
include: './qc.smk'
include: './assembly.smk'
include: './polish.smk'
include: './circularize.smk'
include: './annotation.smk'

Snakemake uses the old-school way of importing that C or PHP uses. This command copy-pastes the file's contents of each child snakemake file into the parent snakemake file at the point of the input statement before it is run. That's why I can call variables in the parent file that aren't declared in the child file. Ultimately, it just looks like one big file for the DAG code.

LeeBergstrand commented 5 months ago

P.S. Note that I have not run an end-to-end test yet on this branch... I can do so if you would like.

I've been running end to end tests on this code for a while now. I think you can run it after we move it into dev.

LeeBergstrand commented 5 months ago

@jmtsuji, can we move your remaining suggestions to issues and merge the existing code we have right now? Because the code has changed so much, it's in our best interest to merge things ASAP so we can work off the same codebase and not drift too far apart. I can address your other issues on individual branches of primary development.

LeeBergstrand commented 5 months ago

@LeeBergstrand The improvements look great! Thanks for your hard work making the general library files for setting up a snakemake run, splitting the snakefile into smaller chunks, adding more temp files, and making various other minor enhancements/fixes. I checked the split snakefiles, and all the rules seem to be in the right places. I have added a few minor comments of things to revise, but otherwise I think this PR is OK to merge. One bigger picture question, though, is how we should handle download-related rules. Right now, they are added into the same snakefile as the analysis rules that need them. Another option would be to make a separate download.smk file and to put all the download rules in there. We could potentially set up a new rule rule download that can be optionally called by the CLI (but otherwise goes uncalled) in that snakefile. This would allow users to download all needed DB files in a single command if they wanted to, as per #45 . If you like this idea to make a download-dedicated snakefile, we could go ahead and implement it now, while we are already re-organizing the snakefiles, or we could wait until some point in the future... either is OK for me. What do you think? Thanks again! P.S. Note that I have not run an end-to-end test yet on this branch... I can do so if you would like.

From a maintenance standpoint, it makes more sense to me to have the database download rules in the same snakemake file as the rule that uses the data. That way, if you need to update anything, such as database versions, everything is self-contained in the same place so that you can edit both rules simultaneously. For example, when you specify a new model version name in the download rule and that same model version name in the processing rule, you can add both as a variable at the top of the file. Also, if you need to port things between software programs, everything is in one package. You can copy the file over. I would still add a download.smk file with the rule rule download but it would be the only thing in that file. This rule would be pretty long because it would check for the output files of all the download rules in the other files. Perhaps you could also have sub-rules (e.g., qc_download) in download.smk that check individual download steps and checkpoints). Because of how Snakemake works, you don't need to have all the download rules in download.smk. You could either import download.smk into rotary.smk and call the rule out of rotary.smk or you could import all the other rules into 'download.smk' and call download.smk

Moved my comments to: https://github.com/rotary-genomics/rotary/issues/45

jmtsuji commented 5 months ago

I've been running end to end tests on this code for a while now. I think you can run it after we move it into dev.

OK, sounds good!

@jmtsuji, can we move your remaining suggestions to issues and merge the existing code we have right now? Because the code has changed so much, it's in our best interest to merge things ASAP so we can work off the same codebase and not drift too far apart. I can address your other issues on individual branches of primary development.

Yes, that sounds fine to move my remaining suggestions to issues and merge this PR ASAP. I've moved the remaining discussions in this PR to issues. It looks like everything else is resolved. Feel free to merge once ready!

rotary-genomics / rotary

Develop lee #114