Error in seq_len(nrow(seg)) during segmentation step

boyangzhao commented 3 years ago

Describe the issue I am encountering an error during the segmentation step when running PureCN. I can finish running PureCN when using the CNVkit coverage and segmentation files, but when I tried to use the PureCN coverage and segmentation, I see this error below,

2021-08-01T15:47:15.207558055Z Error in seq_len(nrow(seg)) : 
2021-08-01T15:47:15.207587198Z   argument must be coercible to non-negative integer
2021-08-01T15:47:15.207591973Z Calls: runAbsoluteCN ... do.call -> <Anonymous> -> .pruneByHclust -> sapply -> lapply
2021-08-01T15:47:15.207596841Z In addition: Warning message:
2021-08-01T15:47:15.207600327Z In seq_len(nrow(seg)) : first element used of 'length.out' argument
2021-08-01T15:47:15.207604164Z Execution halted

To Reproduce PureCN 1.22.2 (also tried on dev latest PureCN 1.23.6, with same error)

Rscript $PURECN/PureCN.R --out . --funsegmentation Hclust --genome hg38 --intervals S31285117_Covered_GRCh38_liftover_intervals.txt --normal normal.bam_coverage_loess.txt.gz --tumor tumor.bam_coverage_loess.txt.gz --sampleid tumor_test --vcf mutect.annot.vcf

Prior to running PureCN.R, I have also done the following,

Interval file was created with $PURECN/IntervalFile.R
VCF was created with Mutect2 (v4.2.0.0), further annotated vcfanno (to add flag DB to the INFO column, for presence of dbSNP; as Mutect2 has removed --dbsnp from their tool)
Coverage files for tumor and normal was created with $PURECN/Coverage.R, as Rscript $PURECN/Coverage.R --outdir . --bam tumor.bam --intervals S31285117_Covered_GRCh38_liftover_intervals.txt Rscript $PURECN/Coverage.R --outdir . --bam normal.bam --intervals S31285117_Covered_GRCh38_liftover_intervals.txt

Log file Attached is the log file: tumor_test.log

Session Info

R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.1.0

lima1 commented 3 years ago

Hmmm, that's a new one. Before we start debugging, you mind adding a few things to bring it closer to my recommended workflow?

Do you have multiple normal samples? Create a normal coverage database using NormalDB.R
Similarly, install the GenomicsDB-R package to import the GATK4 GenomicsDB (this isn't critical with matched normals, but still highly recommended). This requires installation of GenomicsDB.
Install my patched PSCBS R package (see Best practices vignette)
Do not provide the matched normal coverage, use the NormalDB.R generated pool of normal instead.
Mutect2 VCFs should work out of the box, no further annotation necessary (I need to update the documentation here if I misled you there).

boyangzhao commented 3 years ago

Thanks! Regarding Mutect2, yes indeed. I think I did an early test with Mutect2 without using --germline_resources, which led to 0 germline variants error (0 (0.0%) variants annotated as likely germline (DB INFO flag)) when running PureCN. Initially I thought it had to do with DB INFO flag and a missing dbSNP database annotation. But actually with germline_resources (e.g. using somatic-af-only-gnomad vcf), I confirm PureCN works fine with the VCF as is. Just to confirm, it was mentioned in documentation to stick with GATK best practices, does that mean we also need to apply FilterMutectCalls? This was not mentioned in the examples in the PureCN best practices, so I presume we stop at Mutect2 call itself?

Regarding the normal database, etc, I have now installed PSCBS, and also reran PureCN with normaldb (and no matched normal coverage). The normal was generated using two normal samples (for now; others are in other sources and will take more time to gather all together). I'm still working to get the GenomicsDB and GenomicsDB-R installed, and not yet installed for this test rerun. Now the error I see (also at the segmentation step) is,

2021-08-02T17:59:26.736037216Z Error in round(seg$loc.start) : 
2021-08-02T17:59:26.736049900Z   non-numeric argument to mathematical function
2021-08-02T17:59:26.736056072Z Calls: runAbsoluteCN -> GRanges -> IRanges
2021-08-02T17:59:26.736062869Z In addition: Warning message:
2021-08-02T17:59:26.736067763Z In mean.default(x$seg.mean, na.rm = TRUE) :
2021-08-02T17:59:26.736072005Z   argument is not numeric or logical: returning NA
2021-08-02T17:59:26.736075792Z Execution halted

lima1 commented 3 years ago

I would run the complete GATK somatic best practices. Our pipelines are still Mutect 1.1.7 for now, but in my test runs I used the complete workflow, including the FilterMutectCalls.

Do you get the crash also without providing the VCF?

boyangzhao commented 3 years ago

Ah ok - I will also make use of full Mutect2 workflow for the vcf. But for the error itself, I see the error with or without the vcf.

lima1 commented 3 years ago

Good. Makes it much easier. Would you be able to share a minimal example without the VCF by email?

boyangzhao commented 3 years ago

I've now also tested this with GenomicsDB and GenomicsDB-R installed, same errors. It seemed that with different funsegmentation values, I get different kinds of errors. Maybe it's with the input files. I will send you an email with the coverage/interval inputs.

I see the following,

with PSCBS

Error in (function (classes, fdef, mtable)  : 
unable to find an inherited method for function ‘findOverlaps’ for signature ‘"NULL", "GRanges"’
Calls: runAbsoluteCN ... <Anonymous> -> .PSCBSinput -> findOverlaps -> <Anonymous>
Execution halted

with CBS

Error in h(simpleError(msg, call)) : 
error in evaluating the argument 'x' in selecting a method for function 'median': Only tiny segments. 

This runtime error might be caused by invalid input data or parameters. 
Please report bug (PureCN 1.22.2). 
Calls: runAbsoluteCN ... median -> vapply -> .getExonLrs -> .stopRuntimeError
Execution halted

with Hclust

Error in round(seg$loc.start) : 
  non-numeric argument to mathematical function
Calls: runAbsoluteCN -> GRanges -> IRanges
In addition: Warning message:
In mean.default(x$seg.mean, na.rm = TRUE) :
  argument is not numeric or logical: returning NA
Execution halted

with none

Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'median': Only tiny segments. 

This runtime error might be caused by invalid input data or parameters. 
Please report bug (PureCN 1.22.2). 
Calls: runAbsoluteCN ... median -> vapply -> .getExonLrs -> .stopRuntimeError
Execution halted

lima1 commented 3 years ago

Looks like only a tiny number of off-target regions makes it through the filters for some reason. Would you mind trying IntervalFile.R again without --offtarget. It's not that useful for whole exome anyways.

If you are at it, try sorting the BED file before running IntervalFile.R. Looks like IntervalFile.R is not sorting the chromosome order, only within chromosomes. I will fix this, does not be related to this crash though.

lima1 commented 3 years ago

I also just added a fix that ignores off-target intervals when less than 5% of intervals are off-target.

boyangzhao commented 3 years ago

Thanks! - I've tried it with a sorted interval and without --offtarget; and separately also with your latest commits (v1.23.7), in both cases, I'm still seeing the same errors.

lima1 commented 3 years ago

Your example data runs through for me now. PSCBS does not work without VCF, I now also check for this. (none and hclust are made for third-party segmentations). Can you try again with latest commit and post log file if it does not work?

boyangzhao commented 3 years ago

Thanks! I've now tested with the different options of segmentation (with or without VCF accordingly) and confirm it's all working!

Perhaps unrelated to this, when I tried to run PureCN with the same tumor coverage and interval files, but with normaldb.rds instead of normal coverage (I've only created normaldb based on two normals, as a test), I hit the following error. And this seems to happen only when I do not specify a vcf and has defined normaldb. This is not related to PSCBS as I've either didn't specify funsegmentation or used CBS. Was specifying VCF a requirement for usage with normaldb as input?

Error in changepoints(genomdati[chromi == ic], data.type, alpha, wghts,  : 
  NA/NaN/Inf in foreign function call (arg 2)
Calls: runAbsoluteCN ... <Anonymous> -> .CNV.analyze2 -> segment -> changepoints
Execution halted

lima1 commented 3 years ago

Wow, you're good at finding corner cases :-) Feel free to add to the Dropbox folder and I'll see what causes this. But specifying a VCF is critical to get usable results, so don't worry about it.

boyangzhao commented 3 years ago

haha, I've shared with you the rds file. Ok cool, I have vcf as always part of the workflow, so it'll be ok. Thanks!

lima1 commented 3 years ago

Hm, this is again related to your off-target reads. I'm not sure what happened, but those look really terrible. Looks like a normalization issue with your small number of normals. I'll close for now since this seems to be a data issue and I wouldn't recommend off-target reads in your case anyways.

Feel free to open a new issue when something comes up.

boyangzhao commented 3 years ago

Thanks! yea with off-target reads turned-off, the generated interval files together with the normaldb ran without crash. 👍

lima1 commented 3 years ago

Feel free to post a log file once you think you have a final setup and I'll double check that everything looks good.

boyangzhao commented 3 years ago

For the normaldb, we have many normals where the sequencing is done using the same platform, but the exome capture step upstream may be using different capture kits (albeit very similar) (some are SureSelect all exon v7, some are SureSelect all exons + UTRs, some SureSelect all exons + COSMIC). Shall the normaldb be created separately for each distinct capture kit used, or they're similar enough that they can be all pooled together to generate one normal reference?

Also is there an issue if the normal is done at one sequencing depth, while the tumors are done at a higher sequencing depth?

lima1 commented 3 years ago

Create one for each kit, especially if you have many normals. Getting them to work nicely with the same normaldb is probably not worth the work.

Should be safe to put all the normal VCFs into a single GenomicsDB and mapping bias file. Mapping biases are more a function of reference, read lengths (to much lesser extend insert size distributions), and pipelines - not capture kit. If those are pretty much identical, no brainer to combine.

boyangzhao commented 3 years ago

Thanks. I uploaded here three log files corresponding two three 'finalized' configurations, comparing 1) PureCN matched sample with CNVkit as input, 2) PureCN matched sample with its coverage/segmentation, 3) PureCN unmatched with normaldb (based on 21 normals). Does the params used make sense/ideal?

PureCN matched with CNVkit.log PureCN matched with cov_seg.log PureCN unmatched with pooled normal.log

They all converged to a solution of ~0.8 purity and ~1.86 ploidy. A few follow-up questions 1) mapping bias rds was also created, was this necessary for all three of these approaches? I don't see noticeable differences in results. It's mentioned in Best Practices that mapping bias is recommended for without matched normal, so does that mean for (1) and (2) where I had matched, it's optional? 2) It seemed the run with normaldb was noisier than the other approach (looking at the segmentation plots, and by the SD of log-ratios). In fact, in order of greater noise: matched with PureCN cov/seg (0.23) > matched with CNVkit (0.45) > PureCN unmatched with normaldb (0.7).

lima1 commented 3 years ago

Thanks for sharing. 99% sure there is something wrong with your normaldb. Did you use the same set of normals in CNVkit and PureCN? Are you sure all of them are from the same capture kit as tumor?

For low coverage data and matched normals, yes, I don't think you see a dramatic improvement. Otherwise, it's pretty much essential. You need some normal information to flag SNPs in problematic regions. Otherwise likelihoods can quickly become dominated by artifacts - artifacts you can easily spot with either a matched normal or multiple processed matched ones.

lima1 commented 3 years ago

I added some debug info in the latest commit that you can use to cluster your normal samples:

x <- readRDS("normalDB_assay_hg38.rds")
plot(hclust(x$groups[[1]]$dist.t), labels=gsub("_coverage*.txt.gz$","", basename(x[[1]])))

If you see a very strong outlier batch or something else weird, this might help you clean it up.

boyangzhao commented 3 years ago

Thanks! It's very handy! Just ran it on the normals, and see one sample stands out quite far from the rest. I'll still need to follow-up confirm, but at least I can exclude this one for time being. I ran the normaldb on another sample I know for sure was with the same capture kit, the log ratios SD was in the 0.15s. So likely it was a discrepancy in the kit of sample and reference normaldb built.

boyangzhao commented 3 years ago

I see in the JCO Clin Cancer Inform paper that LOH was also analyzed on the HLAs (class I). As there are plenty other tools specifically for looking at LOH of HLA regions, how robust are the results for HLA LOH by PureCN (for class I and/or class II)? Are additional localized seq alignments upstream of the workflow and/or other special preprocessing needed?

lima1 commented 3 years ago

Specialized tools are probably needed for focal LOH. Most LOH is not focal, but you probably want a combination of tools. You also need reliable purity and copy number to distinguish LOH from allelic imbalance. I never had time to do a benchmark of tools, so curious to learn from your experience.

lima1 / PureCN

Error in seq_len(nrow(seg)) during segmentation step #190