greenelab / pdx_exomeseq

Pipeline analysis for whole exome sequencing of pancreatic cancer PDX models
MIT License
21 stars 14 forks source link

Modularizing the Pipeline #25

Closed danich1 closed 6 years ago

danich1 commented 6 years ago

I think the design of this pipeline could be more modularized by utilizing the argparse's subcommand. Through the use of this functionality, the amount of code needed could drop drastically. Ideally, my vision is to have one main file like command_wrapper.py that will handle the entire running of the pipeline. i.e. calling multiple subprocess/system calls in the desired order to guarantee everything runs smoothly. An example of what I mean is provided below:

Current Method: This following command has to be manually inserted to run the fastqc program:

FastQC

python scripts/1.run_fastqc.py --command 'fastqc' --data_dir 'data' --output_dir 'results/fastqc_raw' \ --walltime '01:30:00' --nodes 1 --cores 4

Next in order to run the multiqc command this has to be manually inputed:

multiqc results/fastqc_raw/ \ --force --filename 'results/multiqc_report_raw.html'

Proposed Method: The new approach I suggest is to combine the above lines into the python file as a subcommand (quality_control). Now all one has to do is call the program: python command_wrapper.py quality_control \<insert desired args here>

Within the new command_wrapper.py will be:

\< code to process the arguements> .. subprocess1.call(FASTQC command) subprocess2.call(MULTIQC command)

Through this new method one would just submit a job with the command_wrapper.py script and the correct number of arguments and let python handle the heavy work of calling jobs etc. I'm not sure dartmouth's cluster policy on system commands, but I'd be surprised to see if there is an issue about this. Caution: if we move forward with this new design the command line arguments could increase to an undesired level, so with a little brainstorming I'm sure this plausible issue won't arise.

gwaybio commented 6 years ago

Thanks for the comments @danich1 - I was not aware of subcommand before, and I am not sure if I understand its best use case completely. My apologies if I'm not exactly following, but I am not sure that this specific workflow will work given each commands requirements.

Specifically, the # FastQC step written on lines 15-17 of wes_pipeline.sh acts on each of 120 samples independently. Essentially, the script runs fastqc on all raw input files. Alternatively, the MultiQC step runs on all samples in a given directory.

Actually, maybe subcommands would work here. The FastQC script would remain more-or-less as is, but I would create a MultiQC subcommand that would take in a different input. Does this make sense?

gwaybio commented 6 years ago

I am brainstorming other ways on cutting down on redundancy in this pipeline as well.

Maybe we can chat next to a whiteboard briefly today?

danich1 commented 6 years ago

Yeah sounds good to me. I'm free anytime before 12pm or after 1:30pm. Just let me know.

gwaybio commented 6 years ago

danich1_whiteboard_brainstorm

@danich1 and I brainstormed pipeline solutions :point_up_2:

gwaybio commented 6 years ago

We determined that the best way forward at the moment would be to migrate functionality of the individual Scripts shown above into the command_wrapper utility using argparse subcommands. The argparse subcommands will live in a different python script that is sourced into command wrapper to reduce file complexity.

We also discussed potentially writing a script that generates a bash script for each step of the pipeline for each individual sample. The upside would be less manual feeding of commands to the cluster, but the downside would be difficulty in stopping the pipeline for viewing results of intermediate files. While this upside is clear for this project, a new project will require intermediate file viewing.

For now, I will implement the first command_wrapper option and shelf this other idea for a later date.

gwaybio commented 6 years ago

Hi @danich1 - I've updated the command_wrapper script how we previously discussed. The most recent update is here.

I will also add this update to the pull request discussion