jmcbroome / auto-pango-designation

Repository for automated flagging of new lineages for pango designation.
Other
5 stars 0 forks source link

ENH: Offer alternative covspectrum query that uses mutations only #178

Open corneliusroemer opened 2 years ago

corneliusroemer commented 2 years ago

It's really cool to see the nice covSpectrum integration

I notice you make the query as: lineage + extra mutations, that's definitely good, but there's a risk:

Nextclade does not call lineages 100% the same way as Usher, they can disagree. So it would be good to make a lineage agnostic query as well that use just the mutations leading up to your proposal.

This shouldn't be too hard to extra automatically? Basically rather than saying nextcladePangoLineage:BF.7* & 1234A you'd say: 22984A & 22847C & 15932T & 1234A (just making the mutations up btw for illustration)

jmcbroome commented 2 years ago

This is straightforward to do, yes. Do you think the latter query is generally going to be superior?

AngieHinrichs commented 2 years ago

IMO it would probably require some manual corrections in some cases -- sometimes those queries with just a few mutations, where two or three mutations are a proxy for a really deep clade, can backfire and you get a few sequences from far-flung parts of BA.2 even though most are in your lineage of interest in BA.5.X. But it works a surprising amount of the time. :)

AngieHinrichs commented 2 years ago

And when those queries with just a few mutations appear in pango-designation github issues, I think they have resulted from a human being, using knowledge about which mutations are common to a lot of lineages vs. which mutations are rarely seen and more useful for identifying a particular lineage, and which mutations are especially prone to dropout problems, picking a set with the right mix and trying iteratively on cov-spectrum until they get pleasing results. Possible to automate, but not trivial.

corneliusroemer commented 2 years ago

Yes I wasn't suggesting to use these for designation purposes, but good for checking if there's either a problem with Nextclade or with Usher - both can get things wrong.

I'd maybe offer both queries. Possibly even three, one with Nextclade for pango, one with pangoLEARN via GISAID and one without any lineage at all.