luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
301 stars 37 forks source link

Joint somatic variant calling very slow #162

Closed SebastianHollizeck closed 3 years ago

SebastianHollizeck commented 3 years ago

Hey,

i am very interested in the joint somatic variant calling capabilities of octopus, as there is a significant lack in methods currently.

I have ten about 160x WGS cancer samples with one corresponding normal and would like to variant call them.

I just used the command I saw in the user guide

octopus -R /data/reference/dawson_labs/genomes/GRCh38/GCA_000001405.15_GRCh38_full_analysis_set.fa -I /path/to/crams/*.postprocessed.sorted.cram --normal-sample normal --threads 40 -o ~/test_octopus/out.vcf.gz

Which gave me this start of the log

[2021-04-08 15:47:35] <INFO> ------------------------------------------------------------------------
[2021-04-08 15:47:35] <INFO> octopus v0.7.1
[2021-04-08 15:47:35] <INFO> Copyright (c) 2015-2020 University of Oxford
[2021-04-08 15:47:35] <INFO> ------------------------------------------------------------------------
[2021-04-08 16:01:14] <INFO> Done initialising calling components in 13m 39s
[2021-04-08 16:01:14] <INFO> Detected 11 samples: "tumour1" "tumour2" "tumour3" "tumour4" "tumour5" "tumour6" "tumour7" "tumour8" "tumour9" "tumour10" "normal"
[2021-04-08 16:01:14] <INFO> Invoked calling model: cancer
[2021-04-08 16:01:14] <INFO> Processing 3,209,457,928bp with 40 threads (40 cores detected)
[2021-04-08 16:01:14] <INFO> Writing filtered calls to "/home/shollizeck/test_octopus/out.vcf.gz"
[2021-04-08 16:04:17] <INFO> ------------------------------------------------------------------------------------
[2021-04-08 16:04:17] <INFO>            current            |                   |     time      |     estimated   
[2021-04-08 16:04:17] <INFO>            position           |     completed     |     taken     |     ttc         
[2021-04-08 16:04:17] <INFO> ------------------------------------------------------------------------------------
[2021-04-08 16:35:09] <INFO>                chr1:2551629             0.1%          30m 52s                3w
[2021-04-08 17:15:42] <INFO>                chr1:6360683             0.2%           1h 11m                4w

Now I understand this is a lot of data, but i really dont want to wait 3 weeks for the results (and my HPC will not even allow default jobs to run that long :D )

Is there anything else I could modify apart from "parallelising" the calling over the individual chromosomes or even smaller regions? And if i do end up subsetting the chromosomes further, are there any artifacts to be expected in the border regions of the call regions?

I would really love to have another option in this space.

Cheers, Sebastian

dancooke commented 3 years ago

I have ten about 160x WGS cancer samples with one corresponding normal and would like to variant call them. Now I understand this is a lot of data, but i really dont want to wait 3 weeks for the results (and my HPC will not even allow default jobs to run that long :D )

That is a lot of data! However, I wouldn't be too concerned with the estimated ttc from the first few ticks - these tend to be over-estimates due to the thread load balancing being suboptimal early in the run. At around 10% completion I'd expect the estimated ttc to be more accurate. Having said that, if you're not seeing the estimated ttc decrease by around 2% completion then I'd be more worried.

Some things to consider to improve runtime:

If none of the above helps, then you can parallelise the run by chromosome. This shouldn't produce any more windowing artefacts than doing a single run, but may not produce identical calls as the algorithm uses read statistics derived from all input regions at some points that can affect calling. However, the difference will likely be minor.

SebastianHollizeck commented 3 years ago

Hey,

thanks so much for your quick response. The binary is actually in a singularity container built from a docker container, which technically should take the architecture issue out of it?!

Anyways, this is the continuation of the log

[2021-04-08 17:52:05] <INFO>               chr1:10039053             0.3%           1h 47m             3w 6d
[2021-04-08 18:28:06] <INFO>               chr1:12023355             0.4%           2h 23m             3w 5d
[2021-04-08 19:10:37] <INFO>               chr1:15487428             0.5%            3h 6m             3w 6d
[2021-04-08 19:53:26] <INFO>               chr1:19661808             0.6%           3h 49m             3w 6d
[2021-04-08 20:29:16] <INFO>               chr1:22842014             0.7%           4h 24m             3w 6d
[2021-04-08 21:05:18] <INFO>               chr1:26121093             0.8%            5h 1m             3w 5d
[2021-04-08 21:40:50] <INFO>               chr1:28364361             0.9%           5h 36m             3w 5d
[2021-04-08 22:20:11] <INFO>               chr1:32345255             1.0%           6h 15m             3w 5d
[2021-04-08 23:01:54] <INFO>               chr1:35511408             1.1%           6h 57m             3w 5d
[2021-04-08 23:42:56] <INFO>               chr1:38382515             1.2%           7h 38m             3w 5d
[2021-04-09 00:21:48] <INFO>               chr1:41449127             1.3%           8h 17m             3w 5d
[2021-04-09 01:04:05] <INFO>               chr1:44977241             1.4%           8h 59m             3w 5d
[2021-04-09 01:41:43] <INFO>               chr1:47986856             1.5%           9h 37m             3w 5d
[2021-04-09 02:25:25] <INFO>               chr1:50929307             1.6%          10h 21m             3w 5d
[2021-04-09 03:04:20] <INFO>               chr1:55082033             1.7%              11h             3w 5d
[2021-04-09 03:51:13] <INFO>               chr1:58091344             1.8%          11h 46m             3w 6d
[2021-04-09 04:41:18] <INFO>               chr1:61115513             1.9%          12h 37m             3w 6d
[2021-04-09 05:23:21] <INFO>               chr1:64317330             2.0%          13h 19m             3w 6d
[2021-04-09 06:07:23] <INFO>               chr1:67198669             2.1%           14h 3m             3w 6d
[2021-04-09 06:50:50] <INFO>               chr1:70528613             2.2%          14h 46m             3w 6d
[2021-04-09 07:31:58] <INFO>               chr1:73631048             2.3%          15h 27m             3w 6d
[2021-04-09 08:16:45] <INFO>               chr1:76824466             2.4%          16h 12m             3w 6d
[2021-04-09 09:00:04] <INFO>               chr1:79556375             2.5%          16h 55m             3w 6d

Now that we are 17h in and 2.5%, it does not seem like it actually drops.

I will try to increase the vaf thresholds as you suggested and report back

dancooke commented 3 years ago

The binary is actually in a singularity container built from a docker container, which technically should take the architecture issue out of it?!

The Docker image is using built to a Haswell architecture, so you'll get AVX2. If your cluster has more modern CPUs with AVX-512 then you will likely see some speedup by compiling from source. I'm also not sure what (if any) latency there is from using Singularity over the raw binary.

I was thinking, an alternative strategy might be to call each tumour independently and then do the joint call using only passing variants from each tumour. Also, you should be using random forest filtering (the latest forests can be found here). So basically something like:

for N in `seq 1 10`; do
    $ octopus \
        -R /data/reference/dawson_labs/genomes/GRCh38/GCA_000001405.15_GRCh38_full_analysis_set.fa \
        -I /path/to/crams/normal.postprocessed.sorted.cram \
           /path/to/crams/tumour${N}.postprocessed.sorted.cram \
        --normal-sample normal \
        --forest germline.v0.7.2.forest \
        --somatic-forest somatic.v0.7.2.forest \
        --threads 40 \
        -o ~/test_octopus/tumour${N}_out.vcf.gz
done
$ octopus \
        -R /data/reference/dawson_labs/genomes/GRCh38/GCA_000001405.15_GRCh38_full_analysis_set.fa \
        -I /path/to/crams/*.postprocessed.sorted.cram \
        --normal-sample normal \
        --forest germline.v0.7.2.forest \
        --somatic-forest somatic.v0.7.2.forest \
        --disable-denovo-variant-discovery \
        --source-candidates ~/test_octopus/tumour*_out.vcf.gz \
        --threads 40 \
        -o ~/test_octopus/out.vcf.gz

Obviously you can submit the single tumour runs as separate jobs to your cluster. I would probably try both approaches on a small test region and compare outputs and runtimes.

SebastianHollizeck commented 3 years ago

Hey,

i am back with some results. As you suggested, i added the forest parameter and reduced the expected vaf

To have a commandline like

singularity run octopus_0.7.2.sif \
    -R /data/reference/dawson_labs/genomes/GRCh38/GCA_000001405.15_GRCh38_full_analysis_set.fa \
    -I /dawson_genomics/Other/syntheticSequencing/wgs/GRCh38/mutatedBams/sim/Bam/*_mutated.postprocessed.sorted.cram \
/dawson_genomics/Other/syntheticSequencing/wgs/GRCh38/mutatedBams/sim-*/Bam/*_mutated.postprocessed.sorted.cram --normal-sample sim_mutated \
--threads 40 \
-o ~/test_octopus/out.vcf.gz \
--min-expected-somatic-frequency 0.05 \
--min-credible-somatic-frequency 0.005 \
--forest-model /data/reference/dawson_labs/octopus/germline.v0.7.2.forest.gz \
--somatic-forest-model /data/reference/dawson_labs/octopus/somatic.v0.7.2.forest.gz

This actually sped the computation up significantly, however the program ended with an error

[2021-04-09 09:54:23] <INFO> ------------------------------------------------------------------------
[2021-04-09 09:54:23] <INFO> octopus v0.7.1
[2021-04-09 09:54:23] <INFO> Copyright (c) 2015-2020 University of Oxford
[2021-04-09 09:54:23] <INFO> ------------------------------------------------------------------------
[2021-04-09 10:12:11] <INFO> Done initialising calling components in 17m 47s
[2021-04-09 10:12:11] <INFO> Detected 11 samples: "sim-a_mutated" "sim-b_mutated" "sim-c_mutated" "sim-d_mutated" "sim-e_mutated" "sim-f_mutated" "sim-g_mutated" "sim-h_mutated" "sim-i_mutated" "sim-j_mutated" "sim_mutated"
[2021-04-09 10:12:11] <INFO> Invoked calling model: cancer
[2021-04-09 10:12:11] <INFO> Processing 3,209,457,928bp with 40 threads (40 cores detected)
[2021-04-09 10:12:11] <INFO> Writing filtered calls to "/home/shollizeck/test_octopus/out.vcf.gz"
[2021-04-09 10:15:15] <INFO> ------------------------------------------------------------------------------------
[2021-04-09 10:15:15] <INFO>            current            |                   |     time      |     estimated   
[2021-04-09 10:15:15] <INFO>            position           |     completed     |     taken     |     ttc         
[2021-04-09 10:15:15] <INFO> ------------------------------------------------------------------------------------
[2021-04-09 10:19:03] <INFO>                chr1:3289546             0.1%           3m 47s            2d 14h
[2021-04-09 10:25:39] <INFO>                chr1:6310920             0.2%          10m 23s            4d 13h
[2021-04-09 10:32:20] <INFO>                chr1:9474721             0.3%           17m 4s            4d 14h
[2021-04-09 10:40:17] <INFO>               chr1:13081291             0.4%           25m 1s            5d 11h
[2021-04-09 10:47:02] <INFO>               chr1:16063082             0.5%          31m 46s             5d 5h
[2021-04-09 10:53:42] <INFO>               chr1:19294270             0.6%          38m 26s             5d 1h
[2021-04-09 11:00:24] <INFO>               chr1:22578855             0.7%           45m 8s            4d 23h
[2021-04-09 11:07:03] <INFO>               chr1:25757099             0.8%          51m 47s            4d 21h
[2021-04-09 11:13:46] <INFO>               chr1:28880549             0.9%          58m 30s            4d 20h
[2021-04-09 11:20:27] <INFO>               chr1:32185181             1.0%            1h 5m            4d 19h
[2021-04-09 11:27:04] <INFO>               chr1:35335123             1.1%           1h 11m            4d 18h
[2021-04-09 11:33:45] <INFO>               chr1:38718814             1.2%           1h 18m            4d 18h
....................................................................................................................................................................
[2021-04-13 12:47:30] <INFO>                chrY:9871073            94.7%            4d 2h            5h 33m
[2021-04-13 12:52:46] <INFO>               chrY:12590501            94.8%            4d 2h            5h 27m
[2021-04-13 12:59:18] <INFO>               chrY:15744702            94.9%            4d 2h            5h 20m
[2021-04-13 13:05:22] <INFO>               chrY:19093600            95.0%            4d 2h            5h 14m
[2021-04-13 13:11:40] <INFO>               chrY:22317854            95.1%            4d 2h             5h 8m
[2021-04-13 13:19:00] <INFO>               chrY:25659938            95.2%            4d 3h             5h 2m
[2021-04-13 13:19:47] <INFO>               chrY:51668024            96.0%            4d 3h             4h 9m
[2021-04-13 13:19:48] <INFO>               chrY:56693245            96.1%            4d 3h            3h 56m
[2021-04-13 13:22:04] <INFO> chr14_KI270722v1_random:182982         96.2%            4d 3h            3h 50m
[2021-04-13 13:27:00] <INFO> chr22_KI270733v1_random:79226          96.3%            4d 3h            3h 44m
[2021-04-13 13:29:57] <INFO>     chrUn_KI270743v1:152613            96.4%            4d 3h            3h 37m
[2021-04-13 13:31:37] <INFO>                           -             100%            4d 3h                 -
[2021-04-13 13:32:07] <INFO> Starting Call Set Refinement (CSR) filtering
[2021-04-13 13:32:10] <INFO> Removed 914 temporary files
[2021-04-13 13:32:10] <EROR> A program error has occurred:
[2021-04-13 13:32:10] <EROR> 
[2021-04-13 13:32:10] <EROR>     Encountered an exception during calling 'std::bad_alloc'. This means
[2021-04-13 13:32:10] <EROR>     there is a bug and your results are untrustworthy.
[2021-04-13 13:32:10] <EROR> 
[2021-04-13 13:32:10] <EROR> To help resolve this error run in debug mode and send the log file to
[2021-04-13 13:32:10] <EROR> https://github.com/luntergroup/octopus/issues.
[2021-04-13 13:32:10] <INFO> ------------------------------------------------------------------------

And in contrast to the first time, the efficiency of the slurm job was much worse (lots of unused CPU)

Job ID: 7357880
Cluster: rosalind
User/Group: shollizeck@petermac.org.au/shollizeck
State: FAILED (exit code 1)
Nodes: 1
Cores per node: 40
CPU Utilized: 31-06:40:32
CPU Efficiency: 18.84% of 166-01:19:20 core-walltime
Job Wall-clock time: 4-03:37:59
Memory Utilized: 18.46 GB
Memory Efficiency: 30.76% of 60.00 GB

I dont know if this is relevant for debugging or not.

I also have to say, that this is a simulated dataset which we designed to develop workflows for joint somatic variant calling, could this be interfering with the method?

I am willing to let it run again with the debugging enabled, but i am worried that the log file might be too big with that much data, is there anything that I can adjust so it doesnt create gigabytes worth of log? I can obviously run this on a subset of samples first, but as we have real data with similar amount of data, the end goal is to actually run at least 8 tumour samples at the same time

Cheers, Sebastian

dancooke commented 3 years ago

This actually sped the computation up significantly, however the program ended with an error

Ah, this looks to a problem specifically when using the Docker/Singulairty version as others (#158, #163) have reported the same issue on different - much smaller - datasets. I haven't had time to investigate yet but my first guess would be that insufficient memory has been allocated to the container.

And in contrast to the first time, the efficiency of the slurm job was much worse (lots of unused CPU)

hmm, yes that is pretty poor CPU usage. It may be worth increasing the read buffer memory available to each thread; try setting --target-read-buffer-memory to 30GB.

I also have to say, that this is a simulated dataset which we designed to develop workflows for joint somatic variant calling, could this be interfering with the method?

It can, if your spike-in method doesn't account for germline haplotypes or alignment error as you can end up with unrealistic haplotype structure, which Octopus tries to model. These were the two main issues that were addressed with the synthetic tumours in the Octopus paper.

I am willing to let it run again with the debugging enabled

I don't think that's necessary as this problem appears to be specific to the Docker/Singularity build. I'll try to investigate on smaller datasets. In the meantime, I'd recommend trying to install Octopus from source if possible - this may also improve runtimes. In addition, for future runs you can add the --keep-unfiltered-calls option. This will save an unfiltered copy of the calls, which can be helpful if the run fails during filtering for whatever reason (as is the case here).

dancooke commented 3 years ago

The exception is due to the forest files being proved in compressed form - they need to be decompressed.