Weird genotype calls based on RNAseq

raphaelbetschart commented 1 year ago

Describe the issue: I am observing some weird genotypes calls, when I call variants from RNA-seq data. I've followed the nicely written tutorial, the only thing I changed was a minimum coverage of 5X (instead of 3X). Below I have some examples (GT, AD and PL).

GT	AD	PL	QUAL	GQ
1/1	117,86	58,42,0	42	42
0/1	88,13	2,0,13	4	4

Why is the first SNP called as homozygous ALT, even if I have more reads for the REF compared to ALT (117 vs 86)?

From what I've read, the AD values is calculated by chunks.

Setup

Operating system: CentOS 8
DeepVariant version: 1.5.0, with 1.4.0 RNA model
Installation method (Docker, built from source, etc.): Singularity
Type of data: (sequencing instrument, reference genome, anything special that is unlike the case studies?) RNA-seq

pgrosu commented 1 year ago

Hi @raphaelbetschart,

Just wondering on a few things:

1) What are your QUAL and GQ scores for that call site? 2) Is that SNP at an exon boundary? 3) What tissue is this from? (Different tissue types can have an impact with DeepVariant RNA-Seq.) 4) Is this site in the REDIportal?

Based on this curve, at 5x coverage you should still get a good majority of the CDS label variants that you saw at 3x coverage (this figure is from the Supplementary data available in the paper listed in the Reference section):

Thanks, Paul

References

[1] A deep-learning-based RNA-seq germline variant caller

raphaelbetschart commented 1 year ago

Hi @pgrosu,

I've updated the table in my original post with QUAL and GQ values (the QUAL/GQ of the second variant are low)
How can I check this? I mainly work with WGS data
Should be skin stanza
No, these two sites are not in this portal

pgrosu commented 1 year ago

Hi @raphaelbetschart,

You can get exon information from the GTF-formatted GENCODE files, which will label exon regions like this (including their start and end sites):

chr1    HAVANA  exon    12613   12721   .       +       .       gene_id "ENSG00000290825.1"; transcript_id "ENST00000456328.2"; gene_type "lncRNA"; gene_name "DDX11L2"; transcript_type "lncRNA"; transcript_name "DDX11L2-202"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical"; havana_transcript "OTTHUMT00000362751.1";

With this you can determine where in the exon your variant falls in, and if it is near the end or beginning.

I will focus on the high quality one variant, as the low quality one can be problematic. Skin tissue should be fine based on this figure:

The only other thing I can think of is that given that your number of reads is large, DeepVariant would downsample them before going into the model. So your supporting reads are picked by an allele counter, and it uses them to generate a matrix (image) that gets fed into the model generating the GT and GQ values. The height of these matrices is usually 100 rows. If it is greater it will randomly downsample from these reads, and usually use 95 of them as 5 are used for representing the reference sequence.

I'm assuming you've updated the model as denoted in the tutorial and not used the regular WGS one. I know it's obvious, but as noted in the paper there is a difference between a RNA-seq model versus the WGS/WES one provided by DeepVariant. Other than that is there anything special around this site in IGV? Do you see this as a singular variant without anything surrounding it? Is there anything special of the sequences surrounding the variant (i.e. repeats/etc.)? Does it align uniquely or are there other alignments it can occur at? Do you see anything problematic with the reference-representing reads? I'm assuming your sample is germline diploid, and in the recombinant regions.

Paul

AndrewCarroll commented 1 year ago

Hi @raphaelbetschart

In addition to what @pgrosu said. We do (rarely) observe genotype calls that differ substantially from the standard observations about HET/HOM/REF. We have made some prior observations about this for Germline data as a section in our FAQ and you can see a poster describing this observation as well and relating it to situations with a segmental duplication.

A few other questions/observations -

Have you looked at the site in IGV (as @pgrosu mentions)?

What are the MAPQ values for reads? DeepVariant with default parameters won't see reads with MAPQ<5. If there is a clear bias in MAPQ where the REF allele has a much lower MAPQ, DeepVariant may think the REF alleles are mismapped or might not see them in the pileup.

The coverage of this region is pretty high (e.g. as an abundant transcript or as a gene family with many isoforms). Do you know which of these two is the case?

raphaelbetschart commented 1 year ago

Hi @pgrosu and @AndrewCarroll

My variant doesn't fall close to an exon border.
I've used the RNA-model from here: RNA-model.
You are correct, I have a diplod human germline sample.
The variant is neither in the ENCODE blacklist nor in RepeatMasker, however the Umap M24 track has a value of 0.375.
The MAPQ value is 28.5, which shouldn't be a problem
The coverage of the region is high, because the transcript is abundant (TPM is 1100).

My guess would be that this problem arises from downsampling? The depth at this site is almost 4000X.

pgrosu commented 1 year ago

Hi @raphaelbetschart,

It's beginning to feel more and more there are some issues with this site. With a mappability score so low for that region, and a 4000x depth recovering only 203 reads, suggests multiple scenarios one of which could be as Andrew mentioned of segmental duplication. This would require more analysis of the region and the reads, one of which could be as Andrew suggested varied isoforms. Keep in mind DeepVariant will also cap the reads at 1500 if the depth is too high.

Another thing about low mappability is that for the WGS model there will be local realignment triggered, and that can result also in the lower number of reads. To get a BAM of your realigned reads for that region, just add the following to your DeepVariant script:

--make_examples_extra_args="emit_realigned_reads=true,realigner_diagnostics=/output/realigned_reads

as described in the FAQ section or in the following comment. Make sure to create the realigned_reads directory first, by adding the following line to your script (assuming you have a defined OUTPUT_DIR variable):

mkdir -p "${OUTPUT_DIR}/realigned_reads"

After you relaunch the script, the BAM file(s) representing that region would be the one you would use in IGV.

What percentage of your call sites exhibit this behavior? Were you able to look closer at the reads to compare among each other as Andrew suggested to see if they are either high expression or varied isoforms?

Hope it helps, Paul

raphaelbetschart commented 1 year ago

Hi @pgrosu

In the meantime I was running DRAGEN 4.2.4 RNA pipeline to check the variants, and I have some interesting findings.

GT (DV)	GT(D)	AD (DV)	AD (D)
1/1	0/1	67,67	715,697

samtools coverage at this position reports the following:

numreads	covbases	coverage	meandepth	meanbaseq	meanmapq
3009	1	100	2997	36.7	30.3

I will run DeepVariant again to inspect the realigned BAM.

pgrosu commented 1 year ago

Hi @raphaelbetschart,

Absolutely it is interesting, but if we can reason out what specific internal steps produced those results then I get very excited :) The thing is that this all happens based how the reads get processed with respect to the reference. Once we understand that deeply, then the rest is just a variation on a theme. You probably could get similar results if you used the flag --norealign_reads (or --realign_reads=false), to turn off the realigner, but does that mean the pre-aligned reads are optimal for DeepVariant's matrix interpretation by the model?

Looking forward to what you see with the realigned reads, ~p

raphaelbetschart commented 1 year ago

Hi @pgrosu,

I've realigned the reads and noticed that I have only 198 reads at this position. The input BAM reports 3154 reads at this position. I guess my region of interest is indeed located in a low mappability region.

pgrosu commented 1 year ago

Hi @raphaelbetschart,

That gets us closer to what might be happening, but it still does not provide good insight into the properties of the reads that don't align to the region. Why did they align initially here and where else could they align?

Regarding the reads that do align, how does the window of 200 bases (50 on each side of the variant) x 100 sampled reads support the reference versus the alternate? Is there something special regarding the reference-supporting ones vs the ones supporting the alternate that provide a 1/1 genotype? If you run the same 198 reads through Dragen does the genotype switch to 1/1? Are they all the same expression or do they have variation among them (isoforms)? If they have the same expression what is the quality of the reads, and if they don't what is the variation in the 200 x 100 window? Is there a bias among the alternate and if so what characteristics in the reads become prevalent as compared to the reference-supporting ones?

Again many questions until the root cause is discovered.

Hope it helps, Paul

raphaelbetschart commented 1 year ago

Hi @pgrosu,

After checking my region of interest, I came to the conclusion that is doesn't make sense to perform variant calling at all. Most of my SNVs fail several filters. Maybe at a later point I can get my hands on WGS data from the same samples to perform some comparisons.

pgrosu commented 1 year ago

Hi @raphaelbetschart,

That's a cool insight! Sometimes a negative result can be just as important as a positive one, pointing you towards a new way of looking at your experiment.

Hope you uncover the underlying mechanism, ~p

danielecook commented 1 year ago

@raphaelbetschart if you do have any opportunity to perform a comparison between RNA-seq and WGS data, we would be very interested in the results.

pgrosu commented 1 year ago

Hi Daniel (@danielecook),

Would the following comparison experiment help:

https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP253177&o=acc_s%3Aa

Paul

danielecook commented 1 year ago

@pgrosu - yes that dataset is helpful. It's just a matter of finding the time to do the analysis (:

I will close this for now. @raphaelbetschart if you have any further questions please feel free to reopen or create a new issue.

pgrosu commented 1 year ago

@danielecook If you have available resources through which I can help you out to perform this analysis, feel free to let me know.

google / deepvariant

Weird genotype calls based on RNAseq #701

References