sars-cov-2-variants / lineage-proposals

Repository to propose and discuss lineages
42 stars 2 forks source link

Del 21997_22000 issue still unfixed #63

Closed aviczhl2 closed 1 year ago

aviczhl2 commented 1 year ago

Discussed in https://github.com/cov-lineages/pango-designation/issues/1882 we found that del21997_22000 on GISAID is caused by IonTorrent and Nanopore wrongly base called with fewer number of nucleotide.

However, this issue is still unfixed, GISAID remains calling this del21997_22000, while usher still assign S:Y145H (T21994C, T21995C) to it. (for all XBBs)

This is a recent XBB.1.16(just an example, all XBBs have such branches) branch with this issue, all seqs are quite new. usher

Screen Shot 2023-05-16 at 10 21 19

As long as normal sequences without this artefact will be attracted to such branches( shown in https://github.com/cov-lineages/pango-designation/issues/1882), I suggest a fix.

Insert an A at position 22000 for all seqs with del21997_22000 on GISAID profile when analysing, or just remove those seqs on usher trees.

FedeGueli commented 1 year ago

cc @angiehinrichs

AngieHinrichs commented 1 year ago

If this seems limited to XBB.1.16 (or XBB.1?), and it's more annoying than the possibility of missing actual mutations at those positions, then I can add some branch-specific masking in the affected range. Which specific positions to mask depends on where the alignment tool that I'm using places the deletion and substitutions and how consistent it is across different sequences.

Some notes about how inconsistently different aligners treat this follow -- maybe more detail than strictly necessary for this issue, but it is an interesting case.

I have been using mafft --keeplength to align sequences so far, but recently have been considering nextalign (because I get it for free when running nextclade on all sequences) and minimap2 (because it's also fast, configurable and widely used).

I picked one of the sequences from @aviczhl2's usher subtree, Japan/PG-470993/2023|EPI_ISL_17556447|2023-04-06, and compared different aligners' results and the differences listed on GISAID, and wow, it is a great illustration of how many equally poor choices aligners are faced with because of the sequencing error, and how inconsistent their choices are with each other:

extra space = insertion from either minimap2 with default options or gisaid's aligner:
           22222222222 222 22
           11111111111 122 22
           99999999999 900 00
           88999999999 900 00
           89012345678 901 23
ref        TGTTTATTACC ACA AA
mafft      TGTTTACCA-- AAA AA  <-- T21994C, T21995C, del_21997_21998, C22000A
nextalign  TGTTTAC--CA AAA AA  <-- T21994C, del_21995_21996, C21998A, C22000A
mm2 pango  TGTTTA--CCA AAA AA  <-- del_21994_21995, A21996C, C21998A, C22000A
mm2 deflt  TGT---TTACCAAAA AA  <-- del_21991_21993, ins_21998_A, C22000A
gisaid     TGTT---TACC AAAAAA  <-- del21992_21994, C22000A, ins22000A

"gisaid" is a reconstruction based on the "Nucl Mutations" listed for EPI_ISL_17556447 on the gisaid website in that position range, "del21992_21994, C22000A, ins22000A". In order to represent the insertion of A after base 22000, I added a space to all other sequences. "mm2 deflt" is minimap2 with defaults (no extra options); like whatever aligner GISAID is using, it deletes 3 bases and then inserts A but after 21998. "mm2 pango" is minimap2 with the added options -x asm20 --score-N=0 as used by pangolin -- and like mafft and nextalign, it deletes two bases and makes 3 substitutions, but all of those tools choose different locations for the deletion and substitutions.

I have not looked into how internally consistent each alignment tool is, i.e. given a bunch of sequences that are the same in that region (reference 21988-22003) but not completely identical, does an aligner always place the indels and substitutions the same. (It looks like mafft is at least consistent enough to create this annoying branch in XBB.1.16, but I should look around for substitutions elsewhere in that position range.)

Hmmm, EPI_ISL_17556447 used Illumina NextSeq 550 not Ion Torrent or nanopore -- but it still makes quite the problem for alignment and interpretation. 🤷

aviczhl2 commented 1 year ago

It is forming a large branch in XBB.1.16 now. #125

AngieHinrichs commented 1 year ago

As of the 2023-07-21 build, nucleotides 21994, 21995 and 21998 will be masked in the XBB branch of the UShER tree -- more on that in https://github.com/cov-lineages/pango-designation/issues/1503#issuecomment-1646243451.