nextstrain / nextclade

Viral genome alignment, mutation calling, clade assignment, quality checks and phylogenetic placement
https://clades.nextstrain.org
MIT License
219 stars 61 forks source link

Most BA.2.38 miscalled by Nextclade, and various other miscalled/missing lineages #935

Closed silcn closed 2 years ago

silcn commented 2 years ago

Issue: as it says on the tin. Most BA.2.38 are called as BA.2 by the latest Nextclade. This can be seen by running on any sample of recent sequences from India, of which a substantial proportion are BA.2.38, identified by the S:K417T mutation.

corneliusroemer commented 2 years ago

Thanks for the report @silcn, great spotting.

I have a few ideas for why this may have happened.

silcn commented 2 years ago

@corneliusroemer also BA.2.74 seems to be getting miscalled as BA.2.56.

corneliusroemer commented 2 years ago

The root cause is that I wanted to fix tree structure as homoplasic defining mutations made the wrong lineages group together. I will have to fix differently now that is an issue, via the constraint tree - or calculate branch length differently. Thanks again for the pointer, it's super useful that you share this immediately once you noticed.

corneliusroemer commented 2 years ago

In the BA.2.38 case the problem is that the sequences that were designated by majority also have 6091T, since that's the branch that's common in the UK which sequences much.

As a result, my lineage generator thinks that 6091T is defining. Because the Indian BA.2.38 lack that mutation they are not called as BA.2.38.

Since we do want all of these to be BA.2.38, I'll overwrite the script result so that the UK mutation isn't defining.

I yet have to investigate BA.2.74 miscallling.

By the way, in this case it's not the nextclade version that's relevant (2.3.0) but the dataset version, which you can see here in the line with "Updated: 2022-07-12..."

image

If you do spot any other problems in the future, these are easy to fix so please do tell me about them here - ideally mentioning the version. Maybe you can even figure out what's wrong so the fix can be done very quickly :)

In this case, the way to check is to go to the tree with the sequence that is misassigned, find the branch leading to the lineage that's not getting assigned correctly and see whether the mutations on the branch are correct or could explain why things don't get assigned.

image image

In this case, you can see that extra UK mutation that isn't present in the Indian BA.2.38. Bingo, now one just needs to overwrite here: https://github.com/neherlab/nextclade_data_workflows/blob/feat/gisaid-v2/sars-cov-2/profiles/clades/lineage_overwrite.tsv

BA.2.38 6091 C

I'd be very happy if you flag things you notice, 2 pairs of eyes are better than 1 and I don't think there's anyone out there who knows the lineages better than you!

silcn commented 2 years ago

Thank you for the fix!

The reason for BA.2.74 miscalling seems to be that it isn't actually in the tree yet. I presumed that because BA.2.75 was there it meant BA.2.74 would be too, but I was mistaken :)

silcn commented 2 years ago

Some more issues: BA.5.5 should just be defined by S:T76I, but in the Nextclade tree it also has C245T, C492T and C21108T which are only present in a very small proportion of samples BA.1.1.5, BA.2.19, BA.2.24, BA.2.43, BA.2.59, BA.2.69, BC.1 are missing from the tree (BA.2.43 spotted by @FedeGueli)

corneliusroemer commented 2 years ago

Thanks so much @silcn I'm just making a new tree, you can see the most up to date draft version here: https://nextstrain.org/staging/nextclade/sars-cov-2

There is a bit of a challenge surrounding 29868, it seems to have mutated to A in some of BA.5 - but probably not all. Since the terminals are often missing I'm not quite sure what to do there. If you're interested in investigating that'd be great! Seems to present in all of BA.5.5 for example.

All of the issues you've listed should be fixed (plus some topology issues with BA.2.9.3 for example not branching off BA.2.9 but clustering with BA.2.71.

All these things are super easy to fix as long as someone shares the info with me. This is much appreciated @silcn @FedeGueli @Sinickle & co!

FedeGueli commented 2 years ago

@corneliusroemer i have just noticed that nextcladepangolineage: BA.2.76* query on covspectrum gives 0 sequences as result

corneliusroemer commented 2 years ago

Resolved in the latest dataset release

FedeGueli commented 2 years ago

@corneliusroemer i noticed BA.4.1.8 is missing too. It would be important to add cause it is a 346T and growing even faster than BA.4.6