Medaka variant calling, annotation, classification: zigosity determination; supporting reads;

MJMCarvalho commented 3 years ago

I am calling variants on C. albicans strains genomes using the following workflow:

Read correction with Canu (v2.0) Minimap2 for mapping reads against the reference (v2.17) Variant calling (medaka_variant -f -i ), variant annotation (medaka tools annotate variants.vcf reference.fasta alignments.bam variants.annot.vcf), variant classification with Medaka (medaka tools classify_variants variants.annot.vcf) Medaka v1.2.1 Variant annotation and functional effect prediction with SnpEff (v4.3)

I came across the following situation:

Ca22chr3A_C_albicans_SC5314 3038 . G A 26.3085 PASS AR=0,1;DP=189;DPS=121,68;DPSP=199;SC=14307,8962,13935,8754;SR=116,67,8,7;pos1=3038;pos2=3038;q1=26.819;q2=25.798;type=snp;ANN=A|upstream_gene_variant|MODIFIER|C3_00020W_A|C3_00020W_A|transcript|C3_00020W_A-T|protein_coding||c.-961G>A|||||961|,A|upstream_gene_variant|MODIFIER|C3_00040W_A|C3_00040W_A|transcript|C3_00040W_A-T|protein_coding||c.-2933G>A|||||2933|,A|downstream_gene_variant|MODIFIER|C3_00010C_A|C3_00010C_A|transcript|C3_00010C_A-T|protein_coding||c.790C>T|||||790|,A|downstream_gene_variant|MODIFIER|C3_00030C_A|C3_00030C_A|transcript|C3_00030C_A-T|protein_coding||c.1865C>T|||||1865|,A|intergenic_region|MODIFIER|CHR_START-C3_00010C_A|CHR_START-C3_00010C_A|intergenic_region|CHR_START-C3_00010C_A|||n.3038G>A|||||| GT:GQ 1|1:26

DPSP = Depth of reads spanning pos +-25; I now understand that these reads may not necessarily cover the variant position. In this case, we have 199 reads which span position 3038 with 25bp upstream and downstream the position. I understand this is useful to phase the haplotypes; I don’t understand how we can assume these reads support the alteration.
We have 183 reads supporting the reference (SR fwd+rev; 116+67) and 15 reads supporting the alteration (SR fwd + rev; 8+7); in total, 198. I suppose these are only reads spanning position +-25bp at position (DPSP); one read was ambiguous (AR). However the GT of the alteration is 1|1, which means it’s homozygous; I don’t understand how the program classifies this variant as homozygous. Additionally, looking at this position in IGV (print screen attached), it showed total count 290: 183 reads for the alteration (A) and 107 reads for the reference (G). This is not in agreement with the vcf (I understand this is the quality filtering in medaka, but I’d like to understand more about the quality filtering done throughout medaka) and does not support an homozygous alteration. Would you, please, let me know how medaka calculates the zygosity of the alteration? Is it possible that medaka has switched the number of reads supporting the reference and the alteration?

Screenshot 2020-12-21 at 17 42 41

Thank you. Kind regards. Maria

mwykes commented 3 years ago

Hi Maria, Thanks for your questions. One thing before that though - you mention that you are using canu for read-correction; are you using corrected reads for medaka variant calling? If so, I would not recommend doing this; Medaka is trained on uncorrected reads and giving medaka corrected reads may result in lower accuracy variant calls unless you retrain a medaka model on corrected reads. Furthermore, our medaka variant calling models are only trained on depth up to 60X, so may not work as well at the very high depth you are using.

Regarding your question 1) about the annotation tool, there is a known bug in the annotation tool affecting the DP field, but not the DPS field. As you say, the DPS field is a count of reads spanning the variant pos -25 -> pos + 25. Hence they do overlap the variant and have sufficient anchoring sequence either side of the variant to allow reliable realignment to the reference and alternative sequence(s). Indeed one cannot assume that all of these reads support the alternative allele - this is what the SR field is for, which bring us onto question 2).

With regards to how medaka calculates zygosity, medaka factors the diploid variant calling problem into two independent sets of consensus and variant calls on each haplotype (see the docs for more information). Whatshap is used to partition reads by haplotype. Reads which can't be haplotyped (e.g. because they don't overlap with a heterozygous SNP) will be used in forming the consensus for both haplotypes. Variant calls on each haplotype are finally combined into a diploid VCF. If a variant is found on both haplotypes it is homozygous, if it is found on just one it will be heterozygous. It is possible that your variant call is homozygous because the reads supporting either A/G are spread across both hapolotypes, and medaka predicts A on both. This is consistent with the medaka quality scores for the consensus call on each haplotype (q1 and q2) are both around 26. What were the genotypes of the neighboring variants (e.g. C->T and C->A)? It is also worth mentioning that medaka's neural network model can take into account several other factors such as local sequence context and strand bias when predicting the consensus sequence. This means it can make predictions which differ to the result of a naive base-counting consensus. Whilst this does make it's predictions harder to interpret, it also the reason why medaka provides higher accuracy overall.

It's worth mentioning that medaka variant calls can be further refined by regenotyping them with DeepVariant. We have not tested this on non-human datasets, but it might be worth trying. A tutorial on how to do this can be found here.

Regarding the read counts, if you are able to share a bam containing the reads around that variant and your reference and vcf I can look into this and see if there is a bug in the code.

mwykes commented 3 years ago

In case you are able to share these files and would prefer to do so privately, I've just emailed you a secure file-sharing link.

MJMCarvalho commented 3 years ago

Dear Mike,

I will get back to you as soon as possible with a few comments and questions. In the meantime I have sent you the relevant files via the upload link you shared.

Thank you for your help. Kind regards. Maria

--------------------------------------#NoVAT4Science---------------------------------------- Maria João Mendes de Carvalho, PhD Genome Sequencing and Analysis Lab - iBiGEN Department of Medical Sciences & Institute for Biomedicine – DCM & iBiMED University of Aveiro Agras do Crasto 3810-193 Aveiro Portugal https://www.ua.pt/dcm/ http://www.ua.pt/ibimed/https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.ua.pt%2Fibimed%2F&data=01%7C01%7CCarvalhoM%40cardiff.ac.uk%7Ca33a87b4c26442ca698e08d6c8cda90c%7Cbdb74b3095684856bdbf06759778fcbc%7C1&sdata=hXIxQXdITqSdGL%2F%2BA%2FyoWurwzMx0CFBj9ow540ty%2FDk%3D&reserved=0 Phone number: +351234370200 (xt. 22126) Email address: mjcarvalho@ua.ptmailto:mjcarvalho@ua.pt Science should be a fundamental activity in every country. Research in Portugal pays 23%VAT challenging the advances in RTD and the competitiveness in an already EU widening country. #NoVAT4Science

From: mwykes notifications@github.com Reply to: nanoporetech/medaka reply@reply.github.com Date: Friday, 15 January 2021 at 15:38 To: nanoporetech/medaka medaka@noreply.github.com Cc: Maria João Carvalho mjcarvalho@ua.pt, Author author@noreply.github.com Subject: Re: [nanoporetech/medaka] Medaka variant calling, annotation, classification: zigosity determination; supporting reads; (#244)

Hi Maria, Thanks for your questions. One thing before that though - you mention that you are using canu for read-correction; are you using corrected reads for medaka variant calling? If so, I would not recommend doing this; Medaka is trained on uncorrected reads and giving medaka corrected reads may result in lower accuracy variant calls unless you retrain a medaka model on corrected reads. Furthermore, our medaka variant calling models are only trained on depth up to 60X, so may not work as well at the very high depth you are using.

Regarding your question 1) about the annotation tool, there is a known bug in the annotation tool affecting the DP field, but not the DPS field. As you say, the DPS field is a count of reads spanning the variant pos -25 -> pos + 25. Hence they do overlap the variant and have sufficient anchoring sequence either side of the variant to allow reliable realignment to the reference and alternative sequence(s). Indeed one cannot assume that all of these reads support the alternative allele - this is what the SR field is for, which bring us onto question 2).

With regards to how medaka calculates zygosity, medaka factors the diploid variant calling problem into two independent sets of consensus and variant calls on each haplotype (see the docshttps://nanoporetech.github.io/medaka/snp.html for more information). Whatshap is used to partition reads by haplotype. Reads which can't be haplotyped (e.g. because they don't overlap with a heterozygous SNP) will be used in forming the consensus for both haplotypes. Variant calls on each haplotype are finally combined into a diploid VCF. If a variant is found on both haplotypes it is homozygous, if it is found on just one it will be heterozygous. It is possible that your variant call is homozygous because the reads supporting either A/G are spread across both hapolotypes, and medaka predicts A on both. This is consistent with the medaka quality scores for the consensus call on each haplotype (q1 and q2) are both around 26. What were the genotypes of the neighboring variants (e.g. C->T and C->A)? It is also worth mentioning that medaka's neural network model can take into account several other factors such as local sequence context and strand bias when predicting the consensus sequence. This means it can make predictions which differ to the result of a naive base-counting consensus. Whilst this does make it's predictions harder to interpret, it also the reason why medaka provides higher accuracy overall.

It's worth mentioning that medaka variant calls can be further refined by regenotyping them with DeepVariant. We have not tested this on non-human datasets, but it might be worth trying. A tutorial on how to do this can be found here.

Regarding the read counts, if you are able to share a bam containing the reads around that variant and your reference and vcf I can look into this and see if there is a bug in the code.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/nanoporetech/medaka/issues/244#issuecomment-761013682, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ASO6EVKIKSQD73MXEN2WC6TS2BOM5ANCNFSM4WCYDLQQ.

mwykes commented 3 years ago

Thanks Maria.

I ran the annotation tool on the latest pre-release of medaka in which a bug affecting the DP field has been fixed. This shows that the DP is 199 at this variant.

Filtering your bam to retain only primary alignments (which are the only alignments medaka considers) and checking the depth with bedtools confirms that the total depth of primary alignments is indeed 199 at this position.

bedtools coverage -b aln.sorted.bam -a test_variant.vcf 
Ca22chr3A_C_albicans_SC5314     3038    .       G       A       26.3085 PASS    type=snp        GT:GQ   1|1:26  313
samtools view aln.sorted.bam -F 2308 -b > aln.sorted.filt.bam
bedtools coverage -b aln.sorted.filt.bam -a test_variant.vcf                                                 
Ca22chr3A_C_albicans_SC5314     3038    .       G       A       26.3085 PASS    type=snp        GT:GQ   1|1:26  199

You can also get IGV to do this filtering via the view->preferences->alignments menu by ticking the "Filter secondary alignments" and "Filter supplementary alignments" boxes.

This shows the total depth as 189, with 129 A's, 60 G's and 10 deletions (the 10 deletions accounting for the difference between DP=199 and the depth reported by IGV).

Hence the variant is supported by a majority of reads, and a majority of reads on each strand.

I also took a look at the phasing around the variant. The bam you sent did not contain HP tags, so I reran medaka_variant medaka_variant -i aln.sorted.filt.bam -f C_albicans_SC5314_A22_current_chromosomes.fasta -r Ca22chr3A_C_albicans_SC5314 and loaded the bam round_0_hap_mixed_phased.bam into IGV>

The final variant call was the same as your vcf, (despite the fact that I used the filtered bam, illustrating that medaka is doing this filtering internally). a22chr3A_C_albicans_SC5314 3038 . G A 26.308 PASS pos1=3038;q1=26.818;pos2=3038;q2=25.798 GT:GQ 1/1:26

Colouring and sorting alignments by their HP tag (haplotype assignment) shows that the reads on both haplotypes support the variant, explaining why it was called as a homozygous variant.

Hopefully that all makes sense. Please let me know if not.

Mike

mwykes commented 3 years ago

Hi Maria,

In my last comment, I forgot to address your question about the SR field, (Is it possible that medaka has switched the number of reads supporting the reference and the alteration?). Yes, indeed, something does seem to be wrong here, I will look into this further and update the issue.

Mike

MJMCarvalho commented 3 years ago

Dear Mike,

Thank you for your help. I need some time to have a look at this and will get back to you.

Kind regards. Maria

--------------------------------------#NoVAT4Science---------------------------------------- Maria João Mendes de Carvalho, PhD Genome Sequencing and Analysis Lab - iBiGEN Department of Medical Sciences & Institute for Biomedicine – DCM & iBiMED University of Aveiro Agras do Crasto 3810-193 Aveiro Portugal https://www.ua.pt/dcm/ http://www.ua.pt/ibimed/https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.ua.pt%2Fibimed%2F&data=01%7C01%7CCarvalhoM%40cardiff.ac.uk%7Ca33a87b4c26442ca698e08d6c8cda90c%7Cbdb74b3095684856bdbf06759778fcbc%7C1&sdata=hXIxQXdITqSdGL%2F%2BA%2FyoWurwzMx0CFBj9ow540ty%2FDk%3D&reserved=0 Phone number: +351234370200 (xt. 22126) Email address: mjcarvalho@ua.ptmailto:mjcarvalho@ua.pt Science should be a fundamental activity in every country. Research in Portugal pays 23%VAT challenging the advances in RTD and the competitiveness in an already EU widening country. #NoVAT4Science

From: mwykes notifications@github.com Reply to: nanoporetech/medaka reply@reply.github.com Date: Tuesday, 19 January 2021 at 18:29 To: nanoporetech/medaka medaka@noreply.github.com Cc: Maria João Carvalho mjcarvalho@ua.pt, Author author@noreply.github.com Subject: Re: [nanoporetech/medaka] Medaka variant calling, annotation, classification: zigosity determination; supporting reads; (#244)

Hi Maria,

In my last comment, I forgot to address your question about the SR field, (Is it possible that medaka has switched the number of reads supporting the reference and the alteration?). Yes, indeed, something does seem to be wrong here, I will look into this further and update the issue.

Mike

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/nanoporetech/medaka/issues/244#issuecomment-763036448, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ASO6EVOWKPBMIOMMLP4TIPTS2XFRDANCNFSM4WCYDLQQ.

mwykes commented 3 years ago

Hi Maria,

I have checked the SR field. It seems the problem is related to the fact that there seem to be multiple ways in which reads are aligning around that SNP - with one set of reads having multiple ~50 bp insertions within 200 bp of the SNP and another set having a single larger 462 bp insertion at ~3130. The 47 bp insertion within the 25 bp padding of the SNP seems to trip up the annotation tool and results in a different alignment of the trimmed read sequences around the SNP, and hence affects the SR counts.

It it possible that this could be an artifact of the read correction - I would recommend repeating the analysis on the uncorrected reads. It may also be preferable not to error-correct reads when dealing with a diploid genome; unless you are using ploidy-aware methods, you may run the risk of correcting "errors" in reads which are actually heterozygous variants.

The impact of the 47-base insertion can be avoided by setting --pad 3 , which yields SR values which favour the alternative allele. There are however still inconsistent with the IGV base counts.

Ca22chr3A_C_albicans_SC5314 3038 . G A 26.3085 PASS AR=3,5;DP=199;DPS=124,75;DPSP=197;SC=2081,1265,2624,1510;SR=29,18,92,50;type=snp GT:GQ 1|1:26

Having seen the limited use of these SR annotations, we have already made the read-realignment part of the annotation optional in the latest pre-release of medaka and are considering removing it from the next release, such that only depth-related annotations are added.

Let me know if you have any other questions about this.

Mike

MJMCarvalho commented 3 years ago

Dear Mike,

Thank you for your help and patience.

I have re-run the analysis without reads’ correction with canu before read mapping against the reference genome using minimap2, and the there’s no longer a variant called at the location we have been discussing (3038).

Below are the print screens of IGV showing the bam and phased.bam files.

In fact, with Canu corrected reads, medaka called 95167 variants and 110716 variants were called with uncorrected reads. This is, of course, raw data without applying the filters I used in the upstream analysis, which gave 53512 SNPs for the analysis with Canu corrected reads and 78228 SNPs with uncorrected reads. For this organism, we expect to obtain around 60,000 SNPs. I am not sure what happened here. Perhaps using Canu does not result in lower accuracy, in fact it reduces the number of variants called which may indicate there’s higher accuracy?

Also, regarding the heterozygous/homozygous genotype issue, the majority of SNPs for this organism are heterozygous, which was not seen using medaka with Canu corrected reads, if we consider the supporting reads for alteration vs reference. In fact, it is the other way around: the big majority of SNPs called were homozygous.

In conclusion, I am not sure what to think.

bam file (filtering for secondary and supplementary alignments)

[Graphical user interface Description automatically generated with low confidence] phased.bam file (filtering for secondary and supplementary alignments) [Graphical user interface, chart Description automatically generated]

Thanks. Kind regards. Maria

MJMCarvalho commented 3 years ago

Dear Mike

I have Candida albicans isolates ONT and Illumina sequenced. I would like to call variants on these genomes, hence I used medaka. I would like to improve the previous analysis I have done and be able to interpret medaka outputs to filter out high quality called SNPs. I am wondering if you have seen my latest question and if you could help me.

Thanks Maria

mwykes commented 3 years ago

Hi Maria,

Apologies for the slow reply.

I don't think I can offer any advice other than that I have already given, namely how we would recommend you use medaka, and that DeepVariant genotyping of medaka or PEPPER variants can improve accuracy.

It's worth noting that Medaka is supported as a "Research Release" only. Research releases are provided as technology demonstrators to provide early access to features or stimulate Community development of tools. Support for this software is minimal and is only provided directly by the developers.

Kind regards, Mike,

nanoporetech / medaka

Medaka variant calling, annotation, classification: zigosity determination; supporting reads; #244