hodcroftlab / covariants

Real-time updates and information about key SARS-CoV-2 variants, plus the scripts that generate this information.
https://covariants.org/
GNU Affero General Public License v3.0
316 stars 113 forks source link

Speed up cluster analysis #340

Closed MoiraZuber closed 1 year ago

MoiraZuber commented 2 years ago

This PR adds a refactor of the script allClusterDynamics_faster.py called cluster_analysis.py which traverses the metadata line by line instead of reading all into memory at once, causing a significant reduction in memory usage and runtime.

TODOs:

vercel[bot] commented 2 years ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated
covariants ✅ Ready (Inspect) Visit Preview Sep 15, 2022 at 9:10AM (UTC)
MoiraZuber commented 1 year ago

After a last round of testing, this PR should be ready to merge. Here's a quick overview of the differences between this script and the original allClusterDynamics_faster.py script:

Special cases:

  1. If a sequence does not have a known Nextstrain_clade but a known Nextclade_pango ("known" means that the field use_pango is set to True in cluster.py), we assign this cluster and continue similarly to step 1 (step 2 will not be performed).
  2. If a sequence has an official Nextstrain_clade but this clade is also found in the parent field of another cluster in clusters.py, then we will perform step 2 on all possible daughter clades of this cluster. If exactly one daughter clade is found, then the sequence will be assigned to the daughter clade and removed from the mother clade (with exception of the files used for the nextstrain runs).

When all clusters were assigned, the script will perform a check on consistency. There are two variations:

  1. Default check: Only if step 2 was performed, check if the sequence was assigned to multiple clusters (ignoring mutations and clusters that are not plotted). If yes, remove the sequence from all involved clusters and print out a warning at the end of the file.
  2. Debug check: If enabled (will be asked to at the beginning of the script, default False) do step 2 even if step 1 was performed. That way, sequences will be flagged (though the output files will not change) that are usually masked by the official Nextstrain_clade, but would also be detected by snps if allowed to.

Other notes:

emmahodcroft commented 1 year ago

Thanks Moira! I think this write-up is suuuuper useful as I can definitely see us finding this explicit clarification indefinitely helpful as more and more time passes since we worked on this! Just a couple points:

In any case, we assign mutations and clusters that are not plotted (named "rest" in the script) since duplicates do not matter in these cases.

Should this read "we assign mutations and clusters that are not plotted using SNPS" ?

For steps 4 & 5 - doesn't 4 also remove the sequence from the parent clade? So in 4, is it right to say it 'proceeds' to step 1, or does the assignment by Pango 'replace' step 1? (And it doesn't end up getting assigned to the Nextstrain clade group?)

Apart from this, it looks great!! 🙌

MoiraZuber commented 1 year ago

Should this read "we assign mutations and clusters that are not plotted using SNPS" ?

Yes!

So in 4, is it right to say it 'proceeds' to step 1, or does the assignment by Pango 'replace' step 1?

The assignment by Pango "replaces" step 1 so to speak. The logic is the following:

i.e. Pango is only used for cluster assignment if no proper Nextstrain_clade was found. Also, the daugher-parent-clade functionality is only present if the parent clade is found via Nextstrain_clade, not via Pango.

Does this answer the question?

emmahodcroft commented 1 year ago

Yes, I think this clarifies things very nicely, thank you Moira!