Closed danich1 closed 6 years ago
Thanks for the comments @danich1 - I was not aware of subcommand
before, and I am not sure if I understand its best use case completely. My apologies if I'm not exactly following, but I am not sure that this specific workflow will work given each commands requirements.
Specifically, the # FastQC
step written on lines 15-17 of wes_pipeline.sh
acts on each of 120 samples independently. Essentially, the script runs fastqc
on all raw input files. Alternatively, the MultiQC
step runs on all samples in a given directory.
Actually, maybe subcommands
would work here. The FastQC
script would remain more-or-less as is, but I would create a MultiQC
subcommand that would take in a different input. Does this make sense?
I am brainstorming other ways on cutting down on redundancy in this pipeline as well.
Maybe we can chat next to a whiteboard briefly today?
Yeah sounds good to me. I'm free anytime before 12pm or after 1:30pm. Just let me know.
@danich1 and I brainstormed pipeline solutions :point_up_2:
We determined that the best way forward at the moment would be to migrate functionality of the individual Scripts
shown above into the command_wrapper
utility using argparse subcommands
. The argparse subcommands will live in a different python script that is sourced into command wrapper to reduce file complexity.
We also discussed potentially writing a script that generates a bash script for each step of the pipeline for each individual sample. The upside would be less manual feeding of commands to the cluster, but the downside would be difficulty in stopping the pipeline for viewing results of intermediate files. While this upside is clear for this project, a new project will require intermediate file viewing.
For now, I will implement the first command_wrapper
option and shelf this other idea for a later date.
I think the design of this pipeline could be more modularized by utilizing the argparse's subcommand. Through the use of this functionality, the amount of code needed could drop drastically. Ideally, my vision is to have one main file like
command_wrapper.py
that will handle the entire running of the pipeline. i.e. calling multiple subprocess/system calls in the desired order to guarantee everything runs smoothly. An example of what I mean is provided below:Current Method: This following command has to be manually inserted to run the fastqc program:
Next in order to run the multiqc command this has to be manually inputed:
Within the new command_wrapper.py will be:
Through this new method one would just submit a job with the command_wrapper.py script and the correct number of arguments and let python handle the heavy work of calling jobs etc. I'm not sure dartmouth's cluster policy on system commands, but I'd be surprised to see if there is an issue about this. Caution: if we move forward with this new design the command line arguments could increase to an undesired level, so with a little brainstorming I'm sure this plausible issue won't arise.