BRASS Pipeline Adoption

gongyixiao commented 4 years ago

BRASS (Sanger pipeline): https://dockstore.org/containers/registry.hub.docker.com/sevenbridges/pcawg_sanger_sbg_modified/pcawg_sanger_vc_sbg_modified

[x] Validate the usability, editability and reproducibility Dockerfile and the docker image
[x] Gather all the necessary reference files
[x] Develop a test dataset which can successfully run for unit test
[x] Convert the Dockstore.cwl into nextflow friendly scripts
[x] Use real data to test it
[x] Integrate it into TEMPO
[x] Performance evaluation and optimization
[x] Result check

anoronh4 commented 3 years ago

i was unable to find a command-line invocation of BRASS through the links in the first comment of this issue. instead i found https://dockstore.org/containers/quay.io/wtsicgp/dockstore-cgpwgs:2.1.0?tab=info and https://github.com/cancerit/dockstore-cgpwgs

BRASS is structured as a string of steps:

input
cover
merge
normcn
group
isize
filter
split
assemble
grass
tabix

they can be run one by one (using -p to indicate which step) or altogether (no -p). only input, cover, and assemble can be run in parallel.

the dockstore-cgpwgs implementation runs input and cover in parallel initially. then later on it runs the rest of the steps, omitting the -p parameter and using an output file from ascat. ascat from ascatNGS is also installed in the wtsicgp/dockstore-cgpwgs:2.1.0 image.

we might break this up into three processes, all run with wtsicgp/dockstore-cgpwgs:2.1.0:

BRASSprep - run input and cover steps simultaneously as in dockstore-cgpwgs
ascat - this may be reused by other processes we might like to incorporate, including hrdetect. run as in the dockstore-cgpwgs implementation
BRASS - run remaining steps as in dockstore-cgpwgs

there might be some reference files which we will have to incorporate. i'm not sure if they are stored inside the docker or not.

gongyixiao commented 3 years ago

there might be some reference files which we will have to incorporate. i'm not sure if they are stored inside the docker or not.

This might (not) be relevant: https://dockstore.org/containers/quay.io/wtsicgp/dockstore-cgpwgs:1.1.5?tab=files

anoronh4 commented 3 years ago

Update:

An additional step was required to run BRASS, which was the generation of a .bas file for the bam. This was solved by running bam_stats which is on both their combined docker quay.io/wtsicgp/dockstore-cgpwgs and a dedicated PCAP-core docker quay.io/wtsicgp/pcap-core.
I have uploaded the self-contained nextflow files for running BRASS from bam files to the feature/BRASS branch.
I tested the docker mentioned in the paper quay.io/wtsicgp/dockstore-cgpwgs , version 2.1.0 because that was the latest version at the time the PCAWG paper was published. I used this docker by setting -profile pcawg_versionControl. In another set of configurations, i tested the dedicated dockers with most current versioning:
- quay.io/wtsicgp/brass:v6.3.4
- quay.io/wtsicgp/ascatNgs:4.4.0
- quay.io/wtsicgp/pcap-core:5.5.0 I used these dockers by setting -profile pcawg_updated. The dedicated dockers seemed slightly faster:

Process	duration pcawg_updated	duration pcawg_versionControl
runPCAP	25m	31m
runPCAP	34m	1h 2m
runAscat	2h 18m	1h 49m
runBRASS	7h 12m	8h 58m

memory and cpus are equal for each docker configuration, and it was run on the same sample bams. runPCAP is run twice because it is run it on two bams in parallel. For purposes of this test I did not break up BRASS into multiple steps, but I think this gives us an idea of the runtimes anyways.

anoronh4 commented 3 years ago

i was able to dramatically cut down on total elapsed time for brass by parallelizing jobs for input and cover steps (see commit fb9a68e8c7f2c795a6c9c19ce9e3d5fc5eca3f1d), and i just ran the remaining steps in a single job (see attached timeline.html). however, i could further break up remaining steps in the runBRASS step to get even further parallelization. At this point I'm wondering if it is overkill and how much we value a "clean"-looking DAG for presentation purposes. with runBRASSInput and runBRASSCover already taken care of, I could further break down runBRASS into 7 nextflow processes:

merge
group 
isize
normcn # group isize normcn can be run in parallel after merge
filter
split # filter and split can be run one after the other in the same process
assemble # can be broken up into 24 parallel processes, similar to the way `input` and `cover` were optimized
grass 
tabix # grass and tabix can be run one after the other in the same process

@gongyixiao @stevekm , do you think it's worth it to further break down the steps? timeline.html.zip

sidenote: ascat is also made up of three steps and i was able to parallelize the first step, which also means improving processing time.

gongyixiao commented 3 years ago

I think this can be left for further optimization. Let's get everything running and then do the optimization.

在 2020年12月11日，上午12:53，anoronh4 notifications@github.com 写道：

i was able to dramatically cut down on total elapsed time for brass by parallelizing jobs for input and cover steps (see commit fb9a68e), and i just ran the remaining steps in a single job (see attached timeline.html). however, i could further break up remaining steps in the runBRASS step to get even further parallelization. At this point I'm wondering if it is overkill and how much we value a "clean"-looking DAG for presentation purposes. I could break down runBRASS into 7 nextflow processes:

merge group isize normcn # group isize normcn can be run in parallel after merge filter split # filter and split can be run one after the other in the same process assemble # can be broken up into 24 parallel processes, similar to the way input and cover were optimized grass tabix # grass and tabix can be run one after the other in the same process @gongyixiao @stevekm , do you think it's worth it to further break down the steps? timeline.html.zip

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

mskcc / tempo

BRASS Pipeline Adoption #851