COBALT behaviour depends on reference genome used for alignment

bobrad98 commented 1 year ago

Hi,

When running COBALT with the alignment files which are aligned to the reference genome that contains ALT contigs, the program fails. To be more precise, for command line input

java -jar -Xmx8G /opt/cobalt-1.13.jar
 -reference HCC1143-CCLE-WGS-Normal-full -reference_bam ./HCC1143-CCLE-WGS-Normal-full.processed_aligned.bam 
-tumor HCC1143-CCLE-WGS-Tumor-full -tumor_bam ./HCC1143-CCLE-WGS-Tumor-full.processed_aligned.bam 
-output_dir ./cobalt_large 
-gc_profile ./GC_profile.hg38.1000bp.cnp 
-ref_genome ./GRCh38ERCC.ensembl.fasta
> cobalt.log

Stdout reports an error:

Exception in thread "main" java.util.concurrent.ExecutionException: java.io.IOException: R execution failed. Unable to complete segmentation. 2023-01-12T17:55:34.703251697Z: at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122) 2023-01-12T17:55:34.703255705Z: at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191) 2023-01-12T17:55:34.703257871Z: at com.hartwig.hmftools.cobalt.RatioSegmentation.applyRatioSegmentation(RatioSegmentation.java:34) 2023-01-12T17:55:34.703260069Z: at com.hartwig.hmftools.cobalt.CobaltApplication.run(CobaltApplication.java:133) 2023-01-12T17:55:34.703262159Z: at com.hartwig.hmftools.cobalt.CobaltApplication.main(CobaltApplication.java:69) 2023-01-12T17:55:34.703266442Z: Caused by: java.io.IOException: R execution failed. Unable to complete segmentation. 2023-01-12T17:55:34.703268593Z: at com.hartwig.hmftools.cobalt.RatioSegmentation.ratioSegmentation(RatioSegmentation.java:48) 2023-01-12T17:55:34.703270633Z: at com.hartwig.hmftools.cobalt.RatioSegmentation.lambda$applyRatioSegmentation$0(RatioSegmentation.java:25) 2023-01-12T17:55:34.703272784Z: at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) 2023-01-12T17:55:34.703274756Z: at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 2023-01-12T17:55:34.703276867Z: at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 2023-01-12T17:55:34.703278922Z: at java.base/java.lang.Thread.run(Thread.java:829)

while the internal log says that the error is regarding the execution of the ratioSegmentation R script:

[37m17:54:57.294[m [32mINFO [m [36m[main] -[m Applying ratio gc normalization [37m17:55:01.038[m [32mINFO [m [36m[main] -[m Persisting HCC1143-CCLE-WGS-Tumor-subset05 gc read count to ./cobalt_medium [37m17:55:01.050[m [32mINFO [m [36m[main] -[m Applying ratio gc normalization [37m17:55:04.692[m [32mINFO [m [36m[main] -[m Persisting HCC1143-CCLE-WGS-Normal-subset05 gc read count to ./cobalt_medium [37m17:55:05.574[m [32mINFO [m [36m[main] -[m Persisting HCC1143-CCLE-WGS-Normal-subset05 gc ratio medians to ./cobalt_medium [37m17:55:05.577[m [32mINFO [m [36m[main] -[m Applying ratio diploid normalization [37m17:55:08.206[m [32mINFO [m [36m[main] -[m Persisting cobalt ratios to ./cobalt_medium/HCC1143-CCLE-WGS-Tumor-subset05.cobalt.ratio.tsv.gz [37m17:55:23.959[m [32mINFO [m [36m[worker-23] -[m Executing R script via command: Rscript /tmp/script1585828808180542943.R ./cobalt_medium/HCC1143-CCLE-WGS-Tumor-subset05.cobalt.ratio.tsv.gz referenceGCDiploidRatio ./cobalt_medium/HCC1143-CCLE-WGS-Normal-subset05.cobalt.ratio.pcf 100 [37m17:55:23.960[m [32mINFO [m [36m[worker-47] -[m Executing R script via command: Rscript /tmp/script3891011340808371821.R ./cobalt_medium/HCC1143-CCLE-WGS-Tumor-subset05.cobalt.ratio.tsv.gz tumorGCRatio ./cobalt_medium/HCC1143-CCLE-WGS-Tumor-subset05.cobalt.ratio.pcf 100 [37m17:55:34.697[m [1;31mFATAL[m [36m[worker-23] -[m Error executing R script. Examine error file /tmp/ratioSegmentation.R4578502221624741866.error for details.

This doesn’t occur when running COBALT with the alignment files which are aligned to the reference genome that doesn’t contain ALT contigs. Following command line input finishes successfully (only difference is in the input files):

java -jar -Xmx8G /opt/cobalt-1.13.jar 
-reference HCC1143BL 
-reference_bam ./_2_HCC1143BL.sorted.dedup.recal.bam 
-tumor HCC1143 
-tumor_bam ./_2_HCC1143.sorted.dedup.recal.bam 
-output_dir ./cobalt_medium_large 
-gc_profile ./GC_profile.hg38.1000bp.cnp 
-ref_genome ./GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta 
> cobalt.log

Why is this happening and how can I solve it? Thanks in advance!

p-priestley commented 1 year ago

We have not tested on hg38 alt. Do you have the error file from the R script?

ie. /tmp/ratioSegmentation.R4578502221624741866.error

The R script is pretty simple (see https://github.com/hartwigmedical/hmftools/blob/master/cobalt/src/main/resources/r/ratioSegmentation.R)

and just reads the ratiofile and pcffile (not the ref genome), so it must not like one of the chromosome outputs written by cobalt to these files.

bobrad98 commented 1 year ago

Hi,

For now, unfortunately, I'm not able to get the error file from the script. When I get more info, I'll share it here.

p-priestley commented 1 year ago

I would recommend to try to run that script line by line using the ratiofile and pcffile as inputs and see where it errors out.

Otherwise, if you are happy to share those 2 files, either here or via p.priestley@hartwigmedicalfoundation.nl then i can do it for you.

bobrad98 commented 1 year ago

I managed to see the state of the tool when the errors happen for the following command line input:

java -jar -Xmx8G cobalt-1.13.jar  /
  -reference NOR /
  -reference_bam $NORMAL /
  -tumor TUM /
  -tumor_bam $TUMOR  /
  -gc_profile $GC /
  -output_dir /sbgenomics/workspace/cobalt /
  -threads 36

where $NORMAL and $TUMOR are paths to the respective files.

The output directory contains the following files:

cobalt.version
NOR.cobalt.gc.median.tsv
NOR.cobalt.ratio.median.tsv
TUM.cobalt.gc.median.tsv
TUM.cobalt.ratio.pcf
TUM.cobalt.ratio.tsv.gz (file is too big to be attached, will email you)

so the error is coming from the execution of the ratioSegmentation script on the reference data.

The error file says:

Attaching package: ‘dplyr’ The following objects are masked from ‘package:stats’: filter, lag The following objects are masked from ‘package:base’: intersect, setdiff, setequal, union Loading required package: BiocGenerics Attaching package: ‘BiocGenerics’ The following objects are masked from ‘package:dplyr’: combine, intersect, setdiff, union The following objects are masked from ‘package:stats’: IQR, mad, sd, var, xtabs The following objects are masked from ‘package:base’: anyDuplicated, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply, union, unique, unsplit, which.max, which.min Warning: pcf is not run for sample 1 on chromosome arm because all observations are missing. NA is returned. Error in data.frame(rep(sampleid[i], nSeg), seg.chrom, seg.arm, pos.start, : arguments imply differing number of rows: 1, 0 Calls: pcf -> data.frame Execution halted

p-priestley commented 1 year ago

Hi there,

I took a look at the file you sent me (cobalt.ratio.tsv) and I noticed 2 issues with it

The proximate cause is that the referenceGCDiploidRatio is -1 for all records. This causes the segmentation to fail and indicates that the diploid normalisation process has failed earlier in PURPLE
None of the chromosomes in the ratio file appear to have the chr prefix and may be the reason why the diploid normalisation failed. I thought that this was present in all hg38 ref genomes? Can you confirm whether you expect chr prefix in your ref genome / bam?

bobrad98 commented 1 year ago

In this case, I used ref genome without the chr notation - the idea is to be able to use any ref genome file, regardless of the notation used. Does this mean that COBALT works only for the files with the chr notation?

charlesshale commented 1 year ago

All HMF tools require that GRCh37 has no 'chr' prefix and GRCh38 does have it.

bobrad98 commented 1 year ago

thanks!

hartwigmedical / hmftools

COBALT behaviour depends on reference genome used for alignment #361