nextstrain / nextclade

Viral genome alignment, mutation calling, clade assignment, quality checks and phylogenetic placement
https://clades.nextstrain.org
MIT License
214 stars 58 forks source link

How to decide if the reversionSubstitutions are valid variants or not and whether to keep them? #1267

Closed Rohit-Satyam closed 6 months ago

Rohit-Satyam commented 11 months ago

Hi

I have a small query. When I primarily process my sample using wf-artic pipeline and the upload consensus fasta on nextclade, I do not obtain any private mutations.

However, when I perform filtering of some these VCFs and rebuild consensus sequence based on variant Allele Fraction (vafator_af), Variant allele Count (vafator_ac) and variant depth (vafator_dp), some variants as listed below are filtered "INFO/vafator_af < 0.5 || INFO/vafator_dp < 100 || INFO/vafator_ac < 50". We use vafator_dp <100 because Allele fraction being a ration could still be 0.5 even when there are merely 10 out of 20 reads supporting the variant presence.

Now since there is no guideline other than a minimum of 20X coverage per base( even when we get more than 100X coverage in amplicon data), most people might submit whatever comes out of wf-artic pipeline to GISAID and if such submissions are part of routine sequencing, nextclade might pickup these sequences to make set of private mutation. And then such variants if found absent in assemblies generated after using abovementioned filtering criteria are flagged as reversionSubstitutions. So how do I decide if this is actual reversionSubstitution or not and whether to keep it or not? What would you do if you have this additional information about AF, DP and AC?

Thresholds are based on recommendations in this best practices paper

MN908947.3  9344    .   C   T   35.401  POOR_CALLS  DP=88;DPS=50,38;Pool=1;OLD_CLUMPED=MN908947.3|9344|C|T|1;vafator_af=0.81818;vafator_ac=72;vafator_n=0;vafator_dp=88;vafator_eaf=0.5;vafator_pu=1;vafator_pw=1;vafator_k=4;vafator_bq=11,15.5;vafator_mq=60,60;vafator_pos=129,153;vafator_rsmq=-0.2;vafator_rsmq_pv=0.84148;vafator_rsbq=1.067;vafator_rsbq_pv=0.28612;vafator_rspos=-0.133;vafator_rspos_pv=0.89393    GT:GQ   1:35
ammaraziz commented 7 months ago

most people might submit whatever comes out of wf-artic pipeline to GISAID and if such submissions are part of routine sequencing,

Unfortunately, we do exactly this :( It's something that has bugged me. @Rohit-Satyam Could you describe how you perform the reconstruction of the consensus? Do you use the bam file output of the wf-artc pipeline and what do you use for variant calling?

Something to note to anyone else not familiar with wf-artic, the pipeline will mask (N) columns that (from memory) are ambiguous eg >1 possible SNP.

Question to the nextclade folks, what is reference in this context?

Reversions: Private mutations that go back to the reference sequence, i.e. a mutation with respect to reference is present on the attachment node but not on the query sequence.

From https://docs.nextstrain.org/projects/nextclade/en/stable/user/algorithm/07-quality-control.html#private-mutations-p

rneher commented 7 months ago

thanks for chiming in, @ammaraziz .

My take is that whether a diverse site or a reversion to reference are valid depends on a number of parameters and there is no simple criterion that will always give you the right answer. The coverage and diversity threshold you mention above are useful guides.

The reason for flagging these reversions is that it used to be quite common that when a new variant pops up, many people submitted sequences that confidently called reference alleles in drop-out regions (either because their pipeline equated low-coverage with reference, or because of contamination). This tends to be less of an issue nowadays.

to Ammar's question: these are mutations that map to terminal branches of query sequences that make this sequence closer to the reference. This used to be always the root of the tree, but we now also allow non-root sequences to serve as reference. Reference in this context refers to the sequence we align to initially.