Open corneliusroemer opened 2 years ago
This is straightforward to do, yes. Do you think the latter query is generally going to be superior?
IMO it would probably require some manual corrections in some cases -- sometimes those queries with just a few mutations, where two or three mutations are a proxy for a really deep clade, can backfire and you get a few sequences from far-flung parts of BA.2 even though most are in your lineage of interest in BA.5.X. But it works a surprising amount of the time. :)
And when those queries with just a few mutations appear in pango-designation github issues, I think they have resulted from a human being, using knowledge about which mutations are common to a lot of lineages vs. which mutations are rarely seen and more useful for identifying a particular lineage, and which mutations are especially prone to dropout problems, picking a set with the right mix and trying iteratively on cov-spectrum until they get pleasing results. Possible to automate, but not trivial.
Yes I wasn't suggesting to use these for designation purposes, but good for checking if there's either a problem with Nextclade or with Usher - both can get things wrong.
I'd maybe offer both queries. Possibly even three, one with Nextclade for pango, one with pangoLEARN via GISAID and one without any lineage at all.
It's really cool to see the nice covSpectrum integration
I notice you make the query as:
lineage + extra mutations
, that's definitely good, but there's a risk:Nextclade does not call lineages 100% the same way as Usher, they can disagree. So it would be good to make a lineage agnostic query as well that use just the mutations leading up to your proposal.
This shouldn't be too hard to extra automatically? Basically rather than saying
nextcladePangoLineage:BF.7* & 1234A
you'd say:22984A & 22847C & 15932T & 1234A
(just making the mutations up btw for illustration)