Cluster Stability Measures

glstott commented 2 years ago

"1) the proportion of sequences that moved from a cluster in the preceding week to non-clustered in the current week, 2) the number of clusters defined in the previous week that split in the current week (i.e., any instance where sequences that were in a single cluster in the previous week have moved to different clusters in the current week), and 3) the overall entropy score of the clusters found in the current week (with the lowest score of 0 occurring when all sequences are in a single cluster). " - Sobkowiak, et al. medRxiv preprint doi: https://doi.org/10.1101/2022.03.10.22272213

Assuming clusters are labeled, no problem. Option 1 can be done by a quick match filtering on tree and clade. Option 2 is similar, we just need to find clusters in one week that split to two in a subsequent week. Option 3 is unclear, but I suspect they're using a standard entropy measure a la https://stats.stackexchange.com/questions/338719/calculating-clusters-entropy-python .

glstott commented 2 years ago

Option 1:

Another node for clade/lineage may be a useful addition. I'll write with that assumption in mind. This is a bit slow, but a useful starting point.

MATCH (n:sample)-[r:member_of]->(c:clade)
WITH DISTINCT r.source_tree AS src
ORDER BY src DESC
WITH COLLECT(src) AS src_list, range(0,size(src_list)) AS is
UNWIND is AS i
MATCH (n:sample)-[r:member_of {source_tree: src_list[i]}]->(c:clade)
WHERE  (i >=1 AND NOT (n:sample)-[r:member_of {source_tree: src_list[i-1]}]->(c:clade))
RETURN r.source_tree AS src_tree, c.clade AS clade, COUNT( DISTINCT n) AS stability_1

glstott commented 2 years ago

Option 2:

Depending on how we define clades, we may be able to load these transitions into the graph as they arise rather than inferring them.

One possible way to do this would be to identify new clusters that obtain X # of sequences from a cluster in a previous epiweek. Shown below:

// This starts off by collecting the source tree list and ordering them (As seen above). 
//    I think the best naming convention is something like YYYYEW as an integer with other modifiers for additional filters in other fields.
MATCH (n:sample)-[r:member_of]->(c:clade)
WITH DISTINCT r.source_tree AS src
ORDER BY src DESC
WITH COLLECT(src) AS src_list, range(0,size(src_list)) AS is
UNWIND is AS i
// Now find new clusters where they receive at least 3 (chosen arbitrarily) sequences from an older clade.
MATCH (n:sample)-[r:member_of {source_tree: src_list[i]}]->(c:clade)
WHERE  (i >=1 AND NOT (:sample)-[r:member_of {source_tree: src_list[i-1]}]->(c:clade))
WITH r.source_tree AS src_tree, c.clade AS clade, COUNT(DISTINCT n) AS num_split
WHERE num_split >=3
RETURN r.source_tree AS src_tree, COUNT( DISTINCT clade) AS stability_2

glstott commented 2 years ago

Next steps:

[ ] 1. Borrow Leke's and Gabriella's scripts to generate trees and label clades (or the finished products).
[ ] 2. Integrate Neo4j scripts with Nextflow pipeline.
[ ] 3. Run these trees through pipeline to load in new data.
[ ] 4. Extract cluster stability metrics.

glstott / PMeND

Cluster Stability Measures #12