Closed SebastianHollizeck closed 3 years ago
I have ten about 160x WGS cancer samples with one corresponding normal and would like to variant call them. Now I understand this is a lot of data, but i really dont want to wait 3 weeks for the results (and my HPC will not even allow default jobs to run that long :D )
That is a lot of data! However, I wouldn't be too concerned with the estimated ttc
from the first few ticks - these tend to be over-estimates due to the thread load balancing being suboptimal early in the run. At around 10%
completion I'd expect the estimated ttc
to be more accurate. Having said that, if you're not seeing the estimated ttc
decrease by around 2%
completion then I'd be more worried.
Some things to consider to improve runtime:
--min-expected-somatic-frequency
and --min-credible-somatic-frequency
. The defaults for these are 0.01
and 0.005
, respectively. Basically, this means that the algorithm will try to call somatics mutations down to 0.5%
VAF. As a result, the candidate generator will produce a lot of spurious candidates due to sequencing/mapping error, increasing runtime. For multi-focal tumour sequencing, you may not need this level of sensitively, so increasing these options to say 0.05
and 0.01
will probably save a lot in terms of runtime but not hurt accuracy too much. Note that these thresholds apply to each tumour sample independently for candidate discovery, and all discovered mutations are considered in all samples. So if a somatic mutation is present <1%
in one tumour sample but >1%
in another, and --min-expected-somatic-frequency=0.01
, then the algorithm will consider it in all tumour samples.If none of the above helps, then you can parallelise the run by chromosome. This shouldn't produce any more windowing artefacts than doing a single run, but may not produce identical calls as the algorithm uses read statistics derived from all input regions at some points that can affect calling. However, the difference will likely be minor.
Hey,
thanks so much for your quick response. The binary is actually in a singularity container built from a docker container, which technically should take the architecture issue out of it?!
Anyways, this is the continuation of the log
[2021-04-08 17:52:05] <INFO> chr1:10039053 0.3% 1h 47m 3w 6d
[2021-04-08 18:28:06] <INFO> chr1:12023355 0.4% 2h 23m 3w 5d
[2021-04-08 19:10:37] <INFO> chr1:15487428 0.5% 3h 6m 3w 6d
[2021-04-08 19:53:26] <INFO> chr1:19661808 0.6% 3h 49m 3w 6d
[2021-04-08 20:29:16] <INFO> chr1:22842014 0.7% 4h 24m 3w 6d
[2021-04-08 21:05:18] <INFO> chr1:26121093 0.8% 5h 1m 3w 5d
[2021-04-08 21:40:50] <INFO> chr1:28364361 0.9% 5h 36m 3w 5d
[2021-04-08 22:20:11] <INFO> chr1:32345255 1.0% 6h 15m 3w 5d
[2021-04-08 23:01:54] <INFO> chr1:35511408 1.1% 6h 57m 3w 5d
[2021-04-08 23:42:56] <INFO> chr1:38382515 1.2% 7h 38m 3w 5d
[2021-04-09 00:21:48] <INFO> chr1:41449127 1.3% 8h 17m 3w 5d
[2021-04-09 01:04:05] <INFO> chr1:44977241 1.4% 8h 59m 3w 5d
[2021-04-09 01:41:43] <INFO> chr1:47986856 1.5% 9h 37m 3w 5d
[2021-04-09 02:25:25] <INFO> chr1:50929307 1.6% 10h 21m 3w 5d
[2021-04-09 03:04:20] <INFO> chr1:55082033 1.7% 11h 3w 5d
[2021-04-09 03:51:13] <INFO> chr1:58091344 1.8% 11h 46m 3w 6d
[2021-04-09 04:41:18] <INFO> chr1:61115513 1.9% 12h 37m 3w 6d
[2021-04-09 05:23:21] <INFO> chr1:64317330 2.0% 13h 19m 3w 6d
[2021-04-09 06:07:23] <INFO> chr1:67198669 2.1% 14h 3m 3w 6d
[2021-04-09 06:50:50] <INFO> chr1:70528613 2.2% 14h 46m 3w 6d
[2021-04-09 07:31:58] <INFO> chr1:73631048 2.3% 15h 27m 3w 6d
[2021-04-09 08:16:45] <INFO> chr1:76824466 2.4% 16h 12m 3w 6d
[2021-04-09 09:00:04] <INFO> chr1:79556375 2.5% 16h 55m 3w 6d
Now that we are 17h in and 2.5%, it does not seem like it actually drops.
I will try to increase the vaf thresholds as you suggested and report back
The binary is actually in a singularity container built from a docker container, which technically should take the architecture issue out of it?!
The Docker image is using built to a Haswell architecture, so you'll get AVX2. If your cluster has more modern CPUs with AVX-512 then you will likely see some speedup by compiling from source. I'm also not sure what (if any) latency there is from using Singularity over the raw binary.
I was thinking, an alternative strategy might be to call each tumour independently and then do the joint call using only passing variants from each tumour. Also, you should be using random forest filtering (the latest forests can be found here). So basically something like:
for N in `seq 1 10`; do
$ octopus \
-R /data/reference/dawson_labs/genomes/GRCh38/GCA_000001405.15_GRCh38_full_analysis_set.fa \
-I /path/to/crams/normal.postprocessed.sorted.cram \
/path/to/crams/tumour${N}.postprocessed.sorted.cram \
--normal-sample normal \
--forest germline.v0.7.2.forest \
--somatic-forest somatic.v0.7.2.forest \
--threads 40 \
-o ~/test_octopus/tumour${N}_out.vcf.gz
done
$ octopus \
-R /data/reference/dawson_labs/genomes/GRCh38/GCA_000001405.15_GRCh38_full_analysis_set.fa \
-I /path/to/crams/*.postprocessed.sorted.cram \
--normal-sample normal \
--forest germline.v0.7.2.forest \
--somatic-forest somatic.v0.7.2.forest \
--disable-denovo-variant-discovery \
--source-candidates ~/test_octopus/tumour*_out.vcf.gz \
--threads 40 \
-o ~/test_octopus/out.vcf.gz
Obviously you can submit the single tumour runs as separate jobs to your cluster. I would probably try both approaches on a small test region and compare outputs and runtimes.
Hey,
i am back with some results. As you suggested, i added the forest parameter and reduced the expected vaf
To have a commandline like
singularity run octopus_0.7.2.sif \
-R /data/reference/dawson_labs/genomes/GRCh38/GCA_000001405.15_GRCh38_full_analysis_set.fa \
-I /dawson_genomics/Other/syntheticSequencing/wgs/GRCh38/mutatedBams/sim/Bam/*_mutated.postprocessed.sorted.cram \
/dawson_genomics/Other/syntheticSequencing/wgs/GRCh38/mutatedBams/sim-*/Bam/*_mutated.postprocessed.sorted.cram --normal-sample sim_mutated \
--threads 40 \
-o ~/test_octopus/out.vcf.gz \
--min-expected-somatic-frequency 0.05 \
--min-credible-somatic-frequency 0.005 \
--forest-model /data/reference/dawson_labs/octopus/germline.v0.7.2.forest.gz \
--somatic-forest-model /data/reference/dawson_labs/octopus/somatic.v0.7.2.forest.gz
This actually sped the computation up significantly, however the program ended with an error
[2021-04-09 09:54:23] <INFO> ------------------------------------------------------------------------
[2021-04-09 09:54:23] <INFO> octopus v0.7.1
[2021-04-09 09:54:23] <INFO> Copyright (c) 2015-2020 University of Oxford
[2021-04-09 09:54:23] <INFO> ------------------------------------------------------------------------
[2021-04-09 10:12:11] <INFO> Done initialising calling components in 17m 47s
[2021-04-09 10:12:11] <INFO> Detected 11 samples: "sim-a_mutated" "sim-b_mutated" "sim-c_mutated" "sim-d_mutated" "sim-e_mutated" "sim-f_mutated" "sim-g_mutated" "sim-h_mutated" "sim-i_mutated" "sim-j_mutated" "sim_mutated"
[2021-04-09 10:12:11] <INFO> Invoked calling model: cancer
[2021-04-09 10:12:11] <INFO> Processing 3,209,457,928bp with 40 threads (40 cores detected)
[2021-04-09 10:12:11] <INFO> Writing filtered calls to "/home/shollizeck/test_octopus/out.vcf.gz"
[2021-04-09 10:15:15] <INFO> ------------------------------------------------------------------------------------
[2021-04-09 10:15:15] <INFO> current | | time | estimated
[2021-04-09 10:15:15] <INFO> position | completed | taken | ttc
[2021-04-09 10:15:15] <INFO> ------------------------------------------------------------------------------------
[2021-04-09 10:19:03] <INFO> chr1:3289546 0.1% 3m 47s 2d 14h
[2021-04-09 10:25:39] <INFO> chr1:6310920 0.2% 10m 23s 4d 13h
[2021-04-09 10:32:20] <INFO> chr1:9474721 0.3% 17m 4s 4d 14h
[2021-04-09 10:40:17] <INFO> chr1:13081291 0.4% 25m 1s 5d 11h
[2021-04-09 10:47:02] <INFO> chr1:16063082 0.5% 31m 46s 5d 5h
[2021-04-09 10:53:42] <INFO> chr1:19294270 0.6% 38m 26s 5d 1h
[2021-04-09 11:00:24] <INFO> chr1:22578855 0.7% 45m 8s 4d 23h
[2021-04-09 11:07:03] <INFO> chr1:25757099 0.8% 51m 47s 4d 21h
[2021-04-09 11:13:46] <INFO> chr1:28880549 0.9% 58m 30s 4d 20h
[2021-04-09 11:20:27] <INFO> chr1:32185181 1.0% 1h 5m 4d 19h
[2021-04-09 11:27:04] <INFO> chr1:35335123 1.1% 1h 11m 4d 18h
[2021-04-09 11:33:45] <INFO> chr1:38718814 1.2% 1h 18m 4d 18h
....................................................................................................................................................................
[2021-04-13 12:47:30] <INFO> chrY:9871073 94.7% 4d 2h 5h 33m
[2021-04-13 12:52:46] <INFO> chrY:12590501 94.8% 4d 2h 5h 27m
[2021-04-13 12:59:18] <INFO> chrY:15744702 94.9% 4d 2h 5h 20m
[2021-04-13 13:05:22] <INFO> chrY:19093600 95.0% 4d 2h 5h 14m
[2021-04-13 13:11:40] <INFO> chrY:22317854 95.1% 4d 2h 5h 8m
[2021-04-13 13:19:00] <INFO> chrY:25659938 95.2% 4d 3h 5h 2m
[2021-04-13 13:19:47] <INFO> chrY:51668024 96.0% 4d 3h 4h 9m
[2021-04-13 13:19:48] <INFO> chrY:56693245 96.1% 4d 3h 3h 56m
[2021-04-13 13:22:04] <INFO> chr14_KI270722v1_random:182982 96.2% 4d 3h 3h 50m
[2021-04-13 13:27:00] <INFO> chr22_KI270733v1_random:79226 96.3% 4d 3h 3h 44m
[2021-04-13 13:29:57] <INFO> chrUn_KI270743v1:152613 96.4% 4d 3h 3h 37m
[2021-04-13 13:31:37] <INFO> - 100% 4d 3h -
[2021-04-13 13:32:07] <INFO> Starting Call Set Refinement (CSR) filtering
[2021-04-13 13:32:10] <INFO> Removed 914 temporary files
[2021-04-13 13:32:10] <EROR> A program error has occurred:
[2021-04-13 13:32:10] <EROR>
[2021-04-13 13:32:10] <EROR> Encountered an exception during calling 'std::bad_alloc'. This means
[2021-04-13 13:32:10] <EROR> there is a bug and your results are untrustworthy.
[2021-04-13 13:32:10] <EROR>
[2021-04-13 13:32:10] <EROR> To help resolve this error run in debug mode and send the log file to
[2021-04-13 13:32:10] <EROR> https://github.com/luntergroup/octopus/issues.
[2021-04-13 13:32:10] <INFO> ------------------------------------------------------------------------
And in contrast to the first time, the efficiency of the slurm job was much worse (lots of unused CPU)
Job ID: 7357880
Cluster: rosalind
User/Group: shollizeck@petermac.org.au/shollizeck
State: FAILED (exit code 1)
Nodes: 1
Cores per node: 40
CPU Utilized: 31-06:40:32
CPU Efficiency: 18.84% of 166-01:19:20 core-walltime
Job Wall-clock time: 4-03:37:59
Memory Utilized: 18.46 GB
Memory Efficiency: 30.76% of 60.00 GB
I dont know if this is relevant for debugging or not.
I also have to say, that this is a simulated dataset which we designed to develop workflows for joint somatic variant calling, could this be interfering with the method?
I am willing to let it run again with the debugging enabled, but i am worried that the log file might be too big with that much data, is there anything that I can adjust so it doesnt create gigabytes worth of log? I can obviously run this on a subset of samples first, but as we have real data with similar amount of data, the end goal is to actually run at least 8 tumour samples at the same time
Cheers, Sebastian
This actually sped the computation up significantly, however the program ended with an error
Ah, this looks to a problem specifically when using the Docker/Singulairty version as others (#158, #163) have reported the same issue on different - much smaller - datasets. I haven't had time to investigate yet but my first guess would be that insufficient memory has been allocated to the container.
And in contrast to the first time, the efficiency of the slurm job was much worse (lots of unused CPU)
hmm, yes that is pretty poor CPU usage. It may be worth increasing the read buffer memory available to each thread; try setting --target-read-buffer-memory
to 30GB
.
I also have to say, that this is a simulated dataset which we designed to develop workflows for joint somatic variant calling, could this be interfering with the method?
It can, if your spike-in method doesn't account for germline haplotypes or alignment error as you can end up with unrealistic haplotype structure, which Octopus tries to model. These were the two main issues that were addressed with the synthetic tumours in the Octopus paper.
I am willing to let it run again with the debugging enabled
I don't think that's necessary as this problem appears to be specific to the Docker/Singularity build. I'll try to investigate on smaller datasets. In the meantime, I'd recommend trying to install Octopus from source if possible - this may also improve runtimes. In addition, for future runs you can add the --keep-unfiltered-calls
option. This will save an unfiltered copy of the calls, which can be helpful if the run fails during filtering for whatever reason (as is the case here).
The exception is due to the forest files being proved in compressed form - they need to be decompressed.
Hey,
i am very interested in the joint somatic variant calling capabilities of octopus, as there is a significant lack in methods currently.
I have ten about 160x WGS cancer samples with one corresponding normal and would like to variant call them.
I just used the command I saw in the user guide
Which gave me this start of the log
Now I understand this is a lot of data, but i really dont want to wait 3 weeks for the results (and my HPC will not even allow default jobs to run that long :D )
Is there anything else I could modify apart from "parallelising" the calling over the individual chromosomes or even smaller regions? And if i do end up subsetting the chromosomes further, are there any artifacts to be expected in the border regions of the call regions?
I would really love to have another option in this space.
Cheers, Sebastian