hodcroftlab / covariants

Real-time updates and information about key SARS-CoV-2 variants, plus the scripts that generate this information.
https://covariants.org/
GNU Affero General Public License v3.0
316 stars 113 forks source link

Use either pango or snps for cluster assignment, not both #357

Closed MoiraZuber closed 1 year ago

MoiraZuber commented 1 year ago

For a given cluster, it is currently possible that both pango lineages and snps could be used to assign sequences to it. With this change, only one of these methods will be used for each cluster:

Thoughts:

vercel[bot] commented 1 year ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated
covariants ✅ Ready (Inspect) Visit Preview Oct 26, 2022 at 0:14AM (UTC)
emmahodcroft commented 1 year ago

Thank you for this Moira & apologies that I'm just getting to it now!

I think this sounds very good to me and the logic makes sense - I think it's also nice that it makes it a little clearer about what's happening, hopefully to avoid confusion!

Currently, no warning is printed if none of these methods are possible. Should we print a warning if this is the case or will this never happen anyway?

Would a warnings like "no sequences were assigned to this cluster" (as I've seen sometimes, for example if you have nonsensical SNPs) be printed, or none at all? It might be nice to have a similar warning so that the user can see this rather than it failing 'silently'. I think it would print the above (because it should still be a clus in clus_to_run that's checked at the end, right?), but I might be misinterpreting?

The script currently only checks if use_pango=True, but not if the pango_lineages given in clusters[clus]["pango_lineages"] are real. This would include a separate pass over the metadata to collect all possible pango_lineages in order to compare them. (This is currently already done to find all "legal" Nextstrain_clades). Would this be desired?

I think we'd be ok without this - this should trigger the "no sequences assigned to this cluster" warning, and I think that's enough for the users to re-evaluate what they've put in as definitions (whatever they may be!), including check themselves if they're real Pango lineages!

emmahodcroft commented 1 year ago

I'm happy for this to merge in whenever, as long as you also think it's good to go?

MoiraZuber commented 1 year ago

Would a warnings like "no sequences were assigned to this cluster" (as I've seen sometimes, for example if you have nonsensical SNPs) be printed, or none at all?

Yes it would. If that's enough warning to realize something isn't right, then we can leave it like that.

I think in that case we're ready to merge!

emmahodcroft commented 1 year ago

Fantastic, I'll merge this in now and it should go in today's update! Thanks again @MoiraZuber !