imgag / ClinCNV

Detection of copy number changes in Germline/Trio/Somatic contexts in NGS data
MIT License
74 stars 2 forks source link

Sensitivity issues #50

Open valeandri opened 1 week ago

valeandri commented 1 week ago

Hi @GermanDemidov,

Thanks for the precious tool and the great documentation on it.

I am considering to use the tool for germline (and eventually mosaic) CNV calling in panels and WES scenarios. In order to have a fair comparison with the tool previously used, I am testing the algorithm using the ICR639 dataset. It has 74 CNVs on 72 samples and I am using the 72 samples as "reference". Using this approach, I obtain a sensitivity of 81% compared to the previous one which was around 98%.

Here are the steps and commands used for the benchmark:

  1. bed file annotation and preparation
    BedAnnotateGC -in $inBed -ref $fasta > annotated.bed
  2. bam coverage extraction
    BedCoverage -bam $inBam -in annotated.bed -min_mapq 3 -out $outCov

    Merged coverage than has the following structure:

    
    #chr    start   end 10990   11030   11035   11116   11199   11256   11294   11301   11321   11488   11504   12574   12683   16693   17296   17297   17302   17303   17304   17305   17307   17316   17317   17318   17319   17320   17322   17324   17326   17331   17333   17336   17337   17338   17340   17342   17343   17357   17358   17359   17364   17365   17366   17369   17371   17375   17377   17378   17379   17380   17384   17385   17386   17388   17390   17392   17393   17395   17396   17398   17402   17403   17568   17588   18891   19292   21506   21541   21902   21964   22332   9954
    chr1    10325412    10325414    126.00  126.50  136.00  195.50  122.50  144.50  154.50  211.00  215.00  123.50  119.00  86.50   103.50  232.00  161.50  186.50  185.50  191.00  200.00  201.00  217.00  185.00  215.50  200.50  201.50  232.00  274.50  271.50  220.00  183.00  318.00  197.50  165.50  257.50  194.50  121.00  241.50  209.50  178.50  168.50  212.50  209.00  150.00  115.50  199.50  180.00  190.00  170.50  180.00  181.00  127.50  173.50  191.50  143.50  186.00  170.50  165.50  196.50  186.50  154.00  163.00  175.50  159.00  162.00  176.00  147.50  352.00  348.50  179.50  176.00  214.00  159.50
3. cnv calling

Rscript /ClinCNV/clinCNV.R \ --normal $outCovMerged \ --out $outputFolder \ --bed annotated.bed \ --folderWithScript /ClinCNV/



Am I missing some crucial points?

## Tools versions:

ngs-bits 2019_09 (installed using bioconda)
clinCNV 1.18.3 (latest)
GermanDemidov commented 1 week ago

Hi Valentina,

in general everything is good! For human genome hg38 I think you also need to turn --hg38 flag, but it often gives an error when the wrong reference genome was used. What are the varaints which were not called, are they very short? Are they mosaic? Was this dataset described somewhere so I could take a look at the panel design?

valeandri commented 1 week ago

Thanks for the quick answer!

Here is the link to the dataset publication, where you can find more about it : https://pubmed.ncbi.nlm.nih.gov/30175241/. To give you a quick overview, the samples has been analyzed with the Illumina TruSight Cancer Panel, comprising ~100 cancer related genes.

I did not use --hg38 (even if the tool did not throw any error).

The variants missed are germline and most of them is small, comprising 1 or 2 exons.

If you need any other information, just ask ;)

Valentina

GermanDemidov commented 1 week ago

I see =) so the varaints could be discarded if they were in "centromere" of hg38 (it could be normal part in hg19) and also set --lengthG 0 (which means even the smallest variants are called). Run this and let me know the sensitivity - it is not the maximum what can be done =) (I hope you have this test run and sensitivity check as a script so you can run it several times without involving a lot of human efforts)

100 genes, in particular, is not a lot of genetic material and some parts could be filtered out as e.g. GC extreme - but this additional trick will improve the situation https://github.com/GermanDemidov/segmentation_before_CNV_calling - but first try to run without it, just with--hg38 --lengthG 0 --scoreG 10 (the last command reduces the quality threshold of the called varaints, lengthG as I said is the minimum length in units, which are exons I guess)

valeandri commented 1 week ago

Thank you!!!

By running the script with your suggestions, I got a sensitivity of 94.4%! A great improvement ;)

I also checked the FP rate and looks fine, with an average of 2,1 calls per sample.

Do you have any other suggestions, beside trying out the segmntation as well? (https://github.com/GermanDemidov/segmentation_before_CNV_calling )

Again thanks, valentina

GermanDemidov commented 1 week ago

Hi Valentina,

great news! You can probably also check if the true positive calls have higher log-likelihood score than the false positives and tune your threshold accordingly.

What are the missing CNVs? Are they in PMS2 gene? Can you open .seg files from the false negative samples in the IGV and check the regions of missing CNVs, what do you see?

valeandri commented 1 week ago

Hi German,

I have the results of the FN analysis, and it turns out that PMS2 is not the issue. The benchmark has been performed using the exon as the fundamental unit, so that the total number TP+FN is the total number of events tested.

As you can see from the attached file, most of the exons not called are ate the end or at the beginning of a bigger variant and are annotated as "TooShortOrNA" or "GCnormFailed" in the _cov.seg file. I understood that the exons <50 bp are excluded, so I will fix this in the bed design. How can I improve the other ones? Is there a way to do avoid the exclusion of such exons?

The attached file has the following columns:

Thank you, Valentina

GermanDemidov commented 1 week ago

Hi Valentina,

Yes, I see. I think you can expand small exons to be like 100base pairs. I doubt the actual target for enrichment was 50bp, I would use the original enrichment BED file, not the list of exons.

You can change the 50bp threshold here: https://github.com/imgag/ClinCNV/blob/master/clinCNV.R, line 293.

For GC normalization, you may relax the criteria here https://github.com/imgag/ClinCNV/blob/master/generalHelpers.R#L132 or line 134, put 10 instead of 25 and 50. It means "if you have less than 10 exons with GC of this value, remove them, otherwise, perform GC normalization".

Try to run this modified code and let me know if it is better.

valeandri commented 1 week ago

Hi German,

Thank you as always for the helpful suggestions!

I applied those changes and I get a sensitivity of 96.40%, missing 13 exons out of a total of 357. Most of these exons are at the border of a larger variant, only 3 variants are completely missed (all small variants). Here I print the variants together with the _cov.seg VALUE of the relative exons:

SDHB Exon 1 deletion 1.56 PMS2 Exon 11 duplication 2.73 BRCA1 Exon 18-19 deletion 0.88-1.61

It is a really good result! Of course, I call double the number of the variants, but I can manage filters using the log-likelihood.

Do you think those variants could be recovered?

Thanks! Valentina

GermanDemidov commented 1 week ago

Hi Valentina,

the answer actually depends. Can you check the plots produced, there should be a clustering picture - do you see any clustering of the samples?

1.56 is actually closer to diploid than to deletion (1 vs 2) - this one is impossible to save, unless you have clustering of the samples.

valeandri commented 6 days ago

Hi German,

Here you have the cluster image, with the FN samples highlighted. We can see two separated clusters, even if the distance is not huge. In other situations I do see more defined clusters.

image

GermanDemidov commented 6 days ago

Hi Valentina,

yes, I suspected it to be like this! Can you add a parameter --minimumNumOfElemsInCluster 20 , run it and then let me know if anything changed (for better or for worse)?

valeandri commented 6 days ago

Hi German,

Thanks again for the suggestion! By applying this parameter I recover the PMS2 call, even if the cov value decreases (2.66 vs 2.7). So now I obtain a sensitivity of 97.2%.

How this parameter is influencing the call? If I am guessing it right, it should depend on the distribution of coverage (normalized?) in the baseline's samples. Is it possible to automatically infer it form the normals distribution and the apply it accordingly during the call? I was thinking how to implement this correction in a routine.

P.S. I am experimenting the call on homolog genes after mapping the reads to a masked reference (i.e. the reads mapping to PMS2 and PMS2CL are all mapped to PMS2); should I open a new thread to explore this?

Thanks!!

GermanDemidov commented 6 days ago

Hi Valentina,

that's actually not the effect I was hoping for =) I expected SDHB exon to be saved - what happened to it? Did it remain 1.56? How long it is? SDHB is a perfectly normal gene and it should have been called...

If I am guessing it right, it should depend on the distribution of coverage (normalized?) in the baseline's samples.

Yes, I think the variance in the clustered cohort decreaded and 2.66 became more "statistically significant" despite change of means.

Is it possible to automatically infer it form the normals distribution and the apply it accordingly during the call? I was thinking how to implement this correction in a routine.

In-house we simply keep the correlation value between the coverage profiles of our samples. When we add a new sample, we calculate correlation of this sample with all others. So, every time we know which 100 samples are the most correlated with the particular one. In the end, we dont use clusterization that is provided by ClinCNV, we use it only in research settings. In clinical settings we just take 100 most similar samples as a control.

P.S. I am experimenting the call on homolog genes after mapping the reads to a masked reference (i.e. the reads mapping to PMS2 and PMS2CL are all mapped to PMS2); should I open a new thread to explore this?

I am afraid not going to work =) when you'll see a deletion, you won't be able to say if it happened in PMS2 or PMS2CL. Thats the limitation of short reads. Very similar to SMA gene CNV detection - special tools are required such as https://www.nature.com/articles/s41436-020-0754-0

valeandri commented 6 days ago

Hi German,

that's actually not the effect I was hoping for =) I expected SDHB exon to be saved - what happened to it? Did it remain 1.56? How long it is? SDHB is a perfectly normal gene and it should have been called...

SDHB variant is a single exon deletion of 78 bp, really small. The exon now has a value of 1.44.

In-house we simply keep the correlation value between the coverage profiles of our samples. When we add a new sample, we calculate correlation of this sample with all others. So, every time we know which 100 samples are the most correlated with the particular one. In the end, we dont use clusterization that is provided by ClinCNV, we use it only in research settings. In clinical settings we just take 100 most similar samples as a control.

Ok, it perfectly make sense. In our case, we would probably go for a similar approach, by selecting the most correlated samples during the "baseline" creation. Thank you.

I am afraid not going to work =) when you'll see a deletion, you won't be able to say if it happened in PMS2 or PMS2CL. Thats the limitation of short reads. Very similar to SMA gene CNV detection - special tools are required such as https://www.nature.com/articles/s41436-020-0754-0

Yes you are totally right and we are aware of this big limitation. Our goal is to determine whether we have a coverage variation in such region compatible with a schema of "total CN=gene max CN + homolog/s gene max CN". In simple, we want to call with ploidy=4 (in this case) to suggest a balanced CNV in the gene or its homolog/s. My idea was to call using a somatic or mosaic calling, to mimic what GATK does with calls with ploidy=4. I hope I could explain the concept in a clear way.

Valentina

GermanDemidov commented 6 days ago

Hi Valentina,

is it the whole exon deletion in SDHB? If it is a half exon deletion, then I understand why it is so different from 1, which is the expected value. You can try to run the tool with --superRecall parameters, but I doubt even this will help.

Yes, we also do a similar thing, but instead of calling we load the .seg files into IGV and check the known paralogous disease causing genes by eye =) since the actual calling is impossible in this sense.