mskcc / tempo

CCS research pipeline to process WES and WGS TN pairs
https://cmotempo.netlify.app/
12 stars 5 forks source link

BRASS Pipeline Adoption #851

Closed gongyixiao closed 1 year ago

gongyixiao commented 4 years ago

BRASS (Sanger pipeline): https://dockstore.org/containers/registry.hub.docker.com/sevenbridges/pcawg_sanger_sbg_modified/pcawg_sanger_vc_sbg_modified

anoronh4 commented 3 years ago

i was unable to find a command-line invocation of BRASS through the links in the first comment of this issue. instead i found https://dockstore.org/containers/quay.io/wtsicgp/dockstore-cgpwgs:2.1.0?tab=info and https://github.com/cancerit/dockstore-cgpwgs

BRASS is structured as a string of steps:

  1. input
  2. cover
  3. merge
  4. normcn
  5. group
  6. isize
  7. filter
  8. split
  9. assemble
  10. grass
  11. tabix

they can be run one by one (using -p to indicate which step) or altogether (no -p). only input, cover, and assemble can be run in parallel.

the dockstore-cgpwgs implementation runs input and cover in parallel initially. then later on it runs the rest of the steps, omitting the -p parameter and using an output file from ascat. ascat from ascatNGS is also installed in the wtsicgp/dockstore-cgpwgs:2.1.0 image.

we might break this up into three processes, all run with wtsicgp/dockstore-cgpwgs:2.1.0:

  1. BRASSprep - run input and cover steps simultaneously as in dockstore-cgpwgs
  2. ascat - this may be reused by other processes we might like to incorporate, including hrdetect. run as in the dockstore-cgpwgs implementation
  3. BRASS - run remaining steps as in dockstore-cgpwgs

there might be some reference files which we will have to incorporate. i'm not sure if they are stored inside the docker or not.

gongyixiao commented 3 years ago

there might be some reference files which we will have to incorporate. i'm not sure if they are stored inside the docker or not.

This might (not) be relevant: https://dockstore.org/containers/quay.io/wtsicgp/dockstore-cgpwgs:1.1.5?tab=files

anoronh4 commented 3 years ago

Update:

Process duration pcawg_updated duration pcawg_versionControl
runPCAP 25m 31m
runPCAP 34m 1h 2m
runAscat 2h 18m 1h 49m
runBRASS 7h 12m 8h 58m

memory and cpus are equal for each docker configuration, and it was run on the same sample bams. runPCAP is run twice because it is run it on two bams in parallel. For purposes of this test I did not break up BRASS into multiple steps, but I think this gives us an idea of the runtimes anyways.

anoronh4 commented 3 years ago

i was able to dramatically cut down on total elapsed time for brass by parallelizing jobs for input and cover steps (see commit fb9a68e8c7f2c795a6c9c19ce9e3d5fc5eca3f1d), and i just ran the remaining steps in a single job (see attached timeline.html). however, i could further break up remaining steps in the runBRASS step to get even further parallelization. At this point I'm wondering if it is overkill and how much we value a "clean"-looking DAG for presentation purposes. with runBRASSInput and runBRASSCover already taken care of, I could further break down runBRASS into 7 nextflow processes:

merge
group 
isize
normcn # group isize normcn can be run in parallel after merge
filter
split # filter and split can be run one after the other in the same process
assemble # can be broken up into 24 parallel processes, similar to the way `input` and `cover` were optimized
grass 
tabix # grass and tabix can be run one after the other in the same process

@gongyixiao @stevekm , do you think it's worth it to further break down the steps? timeline.html.zip

sidenote: ascat is also made up of three steps and i was able to parallelize the first step, which also means improving processing time.

gongyixiao commented 3 years ago

I think this can be left for further optimization. Let's get everything running and then do the optimization.

在 2020年12月11日,上午12:53,anoronh4 notifications@github.com 写道:

 i was able to dramatically cut down on total elapsed time for brass by parallelizing jobs for input and cover steps (see commit fb9a68e), and i just ran the remaining steps in a single job (see attached timeline.html). however, i could further break up remaining steps in the runBRASS step to get even further parallelization. At this point I'm wondering if it is overkill and how much we value a "clean"-looking DAG for presentation purposes. I could break down runBRASS into 7 nextflow processes:

merge group isize normcn # group isize normcn can be run in parallel after merge filter split # filter and split can be run one after the other in the same process assemble # can be broken up into 24 parallel processes, similar to the way input and cover were optimized grass tabix # grass and tabix can be run one after the other in the same process @gongyixiao @stevekm , do you think it's worth it to further break down the steps? timeline.html.zip

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.