Speed up cluster analysis

MoiraZuber commented 2 years ago

This PR adds a refactor of the script allClusterDynamics_faster.py called cluster_analysis.py which traverses the metadata line by line instead of reading all into memory at once, causing a significant reduction in memory usage and runtime.

TODOs:

[x] After latest changes, does the output now match the output of the old script? If not, why?
[ ] Address TODOs in script.
[ ] Test Mother-Daughter-Clade functionality

vercel[bot] commented 2 years ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated
covariants	✅ Ready (Inspect)	Visit Preview	Sep 15, 2022 at 9:10AM (UTC)

MoiraZuber commented 1 year ago

After a last round of testing, this PR should be ready to merge. Here's a quick overview of the differences between this script and the original allClusterDynamics_faster.py script:

The most important change is the switch from reading the entire metadata in at once to traversing the metadata file line by line, collecting all desired data step by step.
In order to match the output data of the original script, the following logic had to be applied in order to avoid duplicate assignment of one sequence to several clusters:
1. If a sequence has a Nextstrain_clade that we use in covariants, only this cluster is assigned.
2. If a sequence does not have a known Nextstrain_clade, we try to assign a cluster using snps (Remark: We do NOT assign any clade that could have been assigned in step 1 this way. Those cluster will only ever be assigned by Nextclade).
3. In any case, we assign mutations and clusters that are not plotted (named "rest" in the script) since duplicates do not matter in these cases.

Special cases:

If a sequence does not have a known Nextstrain_clade but a known Nextclade_pango ("known" means that the field use_pango is set to True in cluster.py), we assign this cluster and continue similarly to step 1 (step 2 will not be performed).
If a sequence has an official Nextstrain_clade but this clade is also found in the parent field of another cluster in clusters.py, then we will perform step 2 on all possible daughter clades of this cluster. If exactly one daughter clade is found, then the sequence will be assigned to the daughter clade and removed from the mother clade (with exception of the files used for the nextstrain runs).

When all clusters were assigned, the script will perform a check on consistency. There are two variations:

Default check: Only if step 2 was performed, check if the sequence was assigned to multiple clusters (ignoring mutations and clusters that are not plotted). If yes, remove the sequence from all involved clusters and print out a warning at the end of the file.
Debug check: If enabled (will be asked to at the beginning of the script, default False) do step 2 even if step 1 was performed. That way, sequences will be flagged (though the output files will not change) that are usually masked by the official Nextstrain_clade, but would also be detected by snps if allowed to.

Other notes:

"Meta" clusters (e.g. 21K.21L and Omicron) will be compiled from their respective clades after the first metadata pass. This is done only if the cluster has a field meta_cluster = True in clusters.py. It is important that all clusters in the other_nextstrain_names field of meta clusters have an entry in clusters.py. Otherwise, sequences of this cluster will not be collected.
Sequences with problematic dates are as before flagged towards the end of the script. However, they are now automatically excluded from the counts. This removes the need to re-run the script if bad dates were found.
The dictionary first_date_exceptions found in approx_first_dates.py is now parsed and sequences in the list will be included in the counts even if they have problematic dates.

emmahodcroft commented 1 year ago

Thanks Moira! I think this write-up is suuuuper useful as I can definitely see us finding this explicit clarification indefinitely helpful as more and more time passes since we worked on this! Just a couple points:

In any case, we assign mutations and clusters that are not plotted (named "rest" in the script) since duplicates do not matter in these cases.

Should this read "we assign mutations and clusters that are not plotted using SNPS" ?

For steps 4 & 5 - doesn't 4 also remove the sequence from the parent clade? So in 4, is it right to say it 'proceeds' to step 1, or does the assignment by Pango 'replace' step 1? (And it doesn't end up getting assigned to the Nextstrain clade group?)

Apart from this, it looks great!! 🙌

MoiraZuber commented 1 year ago

Should this read "we assign mutations and clusters that are not plotted using SNPS" ?

Yes!

So in 4, is it right to say it 'proceeds' to step 1, or does the assignment by Pango 'replace' step 1?

The assignment by Pango "replaces" step 1 so to speak. The logic is the following:

Does this sequence have a known Nextstrain_clade? YES:
- Assign sequence to this cluster
- Does this Nextstrain_clade have known daughter clades? YES:
  - Does the Pango_lineage of the current sequence corresponds to a possible daughter clade of the Nextstrain_clade? YES: Assign the sequence to that daughter clade and remove from the parent Nextstrain_clade (except for nextstrain runs) NO: Use SNPs to check all possible daughter clades of Nextstrain_clade. If exactly one was found, assign the sequence to that daughter clade and remove from the parent Nextstrain_clade (except for nextstrain runs). (If more than one daughter clade found, print a warning at the end of the script and only assign the parent clade in order to avoid possible inconsistencies)
NO:
- Does this sequence have a known Pango_lineage: YES: Assign sequence to this cluster NO: Proceed with SNPs assignment

i.e. Pango is only used for cluster assignment if no proper Nextstrain_clade was found. Also, the daugher-parent-clade functionality is only present if the parent clade is found via Nextstrain_clade, not via Pango.

Does this answer the question?

emmahodcroft commented 1 year ago

Yes, I think this clarifies things very nicely, thank you Moira!

hodcroftlab / covariants

Speed up cluster analysis #340