iqbal-lab-org / make_prg

Code to create a PRG from a Multiple Sequence Alignment file
Other
21 stars 7 forks source link

Indels can cause spurious site production #27

Open bricoletc opened 2 years ago

bricoletc commented 2 years ago

This is linked to #15 as it leads to production of ambiguous prgs (multiple paths give rise to same sequence)

Clustering code can conclude that there is no meaningful clustering of a set of sequences- e.g. puts each sequence in one cluster.

However the code that calls clustering can re-run prg-building for sets of sequences that only differ in alignment (i.e. gap positioning), not sequence (here)

This leads to spurious 'nested variants' and the following pathological example case: msp6_ambig.pdf

Of the 4 paths between nodes labeled 55 and 56, two are identical. This cause gramtools to mis-genotype down the line.

I have a fix that I'm implementing and will PR in

bricoletc commented 2 years ago

Some good news, at the point of commit 68e032c39f6ed4475f035656fe3a1238500b6022 (branch fix_27) I obtain the following number of AMBIG calls in a set of 14 P. falciparum validation samples:

../pf6_26_genes_7_13/Pf7G8/final.vcf.gz 4
../pf6_26_genes_7_13/PfCD01/final.vcf.gz        4
../pf6_26_genes_7_13/PfDd2/final.vcf.gz 3
../pf6_26_genes_7_13/PfGA01/final.vcf.gz        5
../pf6_26_genes_7_13/PfGB4/final.vcf.gz 4
../pf6_26_genes_7_13/PfGN01/final.vcf.gz        4
../pf6_26_genes_7_13/PfHB3/final.vcf.gz 4
../pf6_26_genes_7_13/PfIT/final.vcf.gz  4
../pf6_26_genes_7_13/PfKE01/final.vcf.gz        3
../pf6_26_genes_7_13/PfKH01/final.vcf.gz        5
../pf6_26_genes_7_13/PfKH02/final.vcf.gz        4
../pf6_26_genes_7_13/PfSD01/final.vcf.gz        3
../pf6_26_genes_7_13/PfSN01/final.vcf.gz        4
../pf6_26_genes_7_13/PfTG01/final.vcf.gz        4

Compared to tip of master:

./Pf7G8/final.vcf.gz    410              
./PfCD01/final.vcf.gz   396                      
./PfDd2/final.vcf.gz    379              
./PfGA01/final.vcf.gz   410                      
./PfGB4/final.vcf.gz    401              
./PfGN01/final.vcf.gz   390                      
./PfHB3/final.vcf.gz    362              
./PfIT/final.vcf.gz     383              
./PfKE01/final.vcf.gz   398                      
./PfKH01/final.vcf.gz   393                      
./PfKH02/final.vcf.gz   408                      
./PfSD01/final.vcf.gz   397                      
./PfSN01/final.vcf.gz   403                      
./PfTG01/final.vcf.gz   416 

I'm still evaluating the accuracy of calls on the new prgs, currently it looks like it improves some of my genes, but does worse in others. I'm investigating the worse ones

bricoletc commented 2 years ago

Logging here another fix implemented in #28

imagine the msa program creates following alignment:

ATTTTTTA
A--TTTTA
ATTTT--A

the last two sequences are the same, but the msa program somehow aligned them differently, the two alignments are equally valid i believe. then the attached graph can be created, in which paths 0-3 and 1-2 are the same seq (ATTTTA): dotgraph.pdf

NB: this is a MWE, but its not a theoretical point, this happens in practice in my pf data, albeit on longer super indel and repe at rich alignments (in gene MSP6)

My fix is this: if in the alignment, you find at least one sequence (here, ATTTTA) with more than one gapped alignment (here the last two gapped seqs in the alignment), do not attempt seq clustering/recursive prg construction. just output the sequences as variants (here, ATTTTA vs ATTTTTTA)

The results in the comment above, reducing AMBIG calls, include this fix.