Closed glstott closed 2 years ago
Option 1:
MATCH (n:sample)-[r:member_of]->(c:clade)
WITH DISTINCT r.source_tree AS src
ORDER BY src DESC
WITH COLLECT(src) AS src_list, range(0,size(src_list)) AS is
UNWIND is AS i
MATCH (n:sample)-[r:member_of {source_tree: src_list[i]}]->(c:clade)
WHERE (i >=1 AND NOT (n:sample)-[r:member_of {source_tree: src_list[i-1]}]->(c:clade))
RETURN r.source_tree AS src_tree, c.clade AS clade, COUNT( DISTINCT n) AS stability_1
Option 2:
// This starts off by collecting the source tree list and ordering them (As seen above).
// I think the best naming convention is something like YYYYEW as an integer with other modifiers for additional filters in other fields.
MATCH (n:sample)-[r:member_of]->(c:clade)
WITH DISTINCT r.source_tree AS src
ORDER BY src DESC
WITH COLLECT(src) AS src_list, range(0,size(src_list)) AS is
UNWIND is AS i
// Now find new clusters where they receive at least 3 (chosen arbitrarily) sequences from an older clade.
MATCH (n:sample)-[r:member_of {source_tree: src_list[i]}]->(c:clade)
WHERE (i >=1 AND NOT (:sample)-[r:member_of {source_tree: src_list[i-1]}]->(c:clade))
WITH r.source_tree AS src_tree, c.clade AS clade, COUNT(DISTINCT n) AS num_split
WHERE num_split >=3
RETURN r.source_tree AS src_tree, COUNT( DISTINCT clade) AS stability_2
Next steps:
"1) the proportion of sequences that moved from a cluster in the preceding week to non-clustered in the current week, 2) the number of clusters defined in the previous week that split in the current week (i.e., any instance where sequences that were in a single cluster in the previous week have moved to different clusters in the current week), and 3) the overall entropy score of the clusters found in the current week (with the lowest score of 0 occurring when all sequences are in a single cluster). " - Sobkowiak, et al. medRxiv preprint doi: https://doi.org/10.1101/2022.03.10.22272213
Assuming clusters are labeled, no problem. Option 1 can be done by a quick match filtering on tree and clade. Option 2 is similar, we just need to find clusters in one week that split to two in a subsequent week. Option 3 is unclear, but I suspect they're using a standard entropy measure a la https://stats.stackexchange.com/questions/338719/calculating-clusters-entropy-python .