Closed LeeBergstrand closed 5 months ago
@jmtsuji Feel free to review when you get a chance. The eventual intent would be to move the following files to their software library and GitHub repository within the Rotary organization. The code could be used to build other snake make pipelines.
My intent would be to use this library for our own internal snake-make pipelines at my work.
@jmtsuji Feel free to review when you get a chance. The eventual intent would be to move the following files to their software library and GitHub repository within the Rotary organization. The code could be used to build other snake make pipelines.
- run.py
- sample.py
- dataset.py
- utils.py
My intent would be to use this library for our own internal snake-make pipelines at my work.
Reviewing now -- thanks for your patience! Yes, that sounds good to plan to move the files you mentioned to their own software library / repo.
@LeeBergstrand The improvements look great! Thanks for your hard work making the general library files for setting up a snakemake run, splitting the snakefile into smaller chunks, adding more temp files, and making various other minor enhancements/fixes. I checked the split snakefiles, and all the rules seem to be in the right places. I have added a few minor comments of things to revise, but otherwise I think this PR is OK to merge.
One bigger picture question, though, is how we should handle download-related rules. Right now, they are added into the same snakefile as the analysis rules that need them. Another option would be to make a separate download.smk file and to put all the download rules in there. We could potentially set up a new rule
rule download
that can be optionally called by the CLI (but otherwise goes uncalled) in that snakefile. This would allow users to download all needed DB files in a single command if they wanted to, as per #45 . If you like this idea to make a download-dedicated snakefile, we could go ahead and implement it now, while we are already re-organizing the snakefiles, or we could wait until some point in the future... either is OK for me. What do you think?Thanks again!
P.S. Note that I have not run an end-to-end test yet on this branch... I can do so if you would like.
From a maintenance standpoint, it makes more sense to me to have the database download rules in the same snakemake file as the rule that uses the data. That way, if you need to update anything, such as database versions, everything is self-contained in the same place so that you can edit both rules simultaneously. For example, when you specify a new model version name in the download rule and that same model version name in the processing rule, you can add both as a variable at the top of the file. Also, if you need to port things between software programs, everything is in one package. You can copy the file over. I would still add a download.smk
file with the rule rule download
but it would be the only thing in that file. This rule would be pretty long because it would check for the output files of all the download rules in the other files. Perhaps you could also have sub-rules (e.g., qc_download
) in download.smk
that check individual download steps and checkpoints). Because of how Snakemake works, you don't need to have all the download rules in download.smk
. You could either import download.smk
into rotary.smk
and call the rule out of rotary.smk
or you could import all the other rules into 'download.smk' and call download.smk
@LeeBergstrand The improvements look great! Thanks for your hard work making the general library files for setting up a snakemake run, splitting the snakefile into smaller chunks, adding more temp files, and making various other minor enhancements/fixes. I checked the split snakefiles, and all the rules seem to be in the right places. I have added a few minor comments of things to revise, but otherwise I think this PR is OK to merge. One bigger picture question, though, is how we should handle download-related rules. Right now, they are added into the same snakefile as the analysis rules that need them. Another option would be to make a separate download.smk file and to put all the download rules in there. We could potentially set up a new rule
rule download
that can be optionally called by the CLI (but otherwise goes uncalled) in that snakefile. This would allow users to download all needed DB files in a single command if they wanted to, as per #45 . If you like this idea to make a download-dedicated snakefile, we could go ahead and implement it now, while we are already re-organizing the snakefiles, or we could wait until some point in the future... either is OK for me. What do you think? Thanks again! P.S. Note that I have not run an end-to-end test yet on this branch... I can do so if you would like.From a maintenance standpoint, it makes more sense to me to have the database download rules in the same snakemake file as the rule that uses the data. That way, if you need to update anything, such as database versions, everything is self-contained in the same place so that you can edit both rules simultaneously. For example, when you specify a new model version name in the download rule and that same model version name in the processing rule, you can add both as a variable at the top of the file. Also, if you need to port things between software programs, everything is in one package. You can copy the file over. I would still add a
download.smk
file with the rulerule download
but it would be the only thing in that file. This rule would be pretty long because it would check for the output files of all the download rules in the other files. Perhaps you could also have sub-rules (e.g.,qc_download
) indownload.smk
that check individual download steps and checkpoints). Because of how Snakemake works, you don't need to have all the download rules indownload.smk
. You could either importdownload.smk
intorotary.smk
and call the rule out ofrotary.smk
or you could import all the other rules into 'download.smk' and calldownload.smk
# Include various modules.
include: './qc.smk'
include: './assembly.smk'
include: './polish.smk'
include: './circularize.smk'
include: './annotation.smk'
Snakemake uses the old-school way of importing that C or PHP uses. This command copy-pastes the file's contents of each child snakemake file into the parent snakemake file at the point of the input statement before it is run. That's why I can call variables in the parent file that aren't declared in the child file. Ultimately, it just looks like one big file for the DAG code.
P.S. Note that I have not run an end-to-end test yet on this branch... I can do so if you would like.
I've been running end to end tests on this code for a while now. I think you can run it after we move it into dev.
@jmtsuji, can we move your remaining suggestions to issues and merge the existing code we have right now? Because the code has changed so much, it's in our best interest to merge things ASAP so we can work off the same codebase and not drift too far apart. I can address your other issues on individual branches of primary development.
@LeeBergstrand The improvements look great! Thanks for your hard work making the general library files for setting up a snakemake run, splitting the snakefile into smaller chunks, adding more temp files, and making various other minor enhancements/fixes. I checked the split snakefiles, and all the rules seem to be in the right places. I have added a few minor comments of things to revise, but otherwise I think this PR is OK to merge. One bigger picture question, though, is how we should handle download-related rules. Right now, they are added into the same snakefile as the analysis rules that need them. Another option would be to make a separate download.smk file and to put all the download rules in there. We could potentially set up a new rule
rule download
that can be optionally called by the CLI (but otherwise goes uncalled) in that snakefile. This would allow users to download all needed DB files in a single command if they wanted to, as per #45 . If you like this idea to make a download-dedicated snakefile, we could go ahead and implement it now, while we are already re-organizing the snakefiles, or we could wait until some point in the future... either is OK for me. What do you think? Thanks again! P.S. Note that I have not run an end-to-end test yet on this branch... I can do so if you would like.From a maintenance standpoint, it makes more sense to me to have the database download rules in the same snakemake file as the rule that uses the data. That way, if you need to update anything, such as database versions, everything is self-contained in the same place so that you can edit both rules simultaneously. For example, when you specify a new model version name in the download rule and that same model version name in the processing rule, you can add both as a variable at the top of the file. Also, if you need to port things between software programs, everything is in one package. You can copy the file over. I would still add a
download.smk
file with the rulerule download
but it would be the only thing in that file. This rule would be pretty long because it would check for the output files of all the download rules in the other files. Perhaps you could also have sub-rules (e.g.,qc_download
) indownload.smk
that check individual download steps and checkpoints). Because of how Snakemake works, you don't need to have all the download rules indownload.smk
. You could either importdownload.smk
intorotary.smk
and call the rule out ofrotary.smk
or you could import all the other rules into 'download.smk' and calldownload.smk
Moved my comments to: https://github.com/rotary-genomics/rotary/issues/45
I've been running end to end tests on this code for a while now. I think you can run it after we move it into dev.
OK, sounds good!
@jmtsuji, can we move your remaining suggestions to issues and merge the existing code we have right now? Because the code has changed so much, it's in our best interest to merge things ASAP so we can work off the same codebase and not drift too far apart. I can address your other issues on individual branches of primary development.
Yes, that sounds fine to move my remaining suggestions to issues and merge this PR ASAP. I've moved the remaining discussions in this PR to issues. It looks like everything else is resolved. Feel free to merge once ready!
Refactor many aspects of the rotary code base to support further extension and development.
Primary Changes: