Closed wangqianwen0418 closed 3 years ago
Patient grouping is based on a hcluster algorithm. It is a bottom-up method. Every time, the algorithm merge two clusters with the smallest distance.
the original version is modified as below:
I found it is not always easy to observe transition in Sankey diagram. For example, in the figure below, it is possible that there are only sequence [A, C, E] and [B, C, D], no [B, C, E] exists.
Does it matter?
If it doesn't matter, does it mean that we only need to care the state transition between two timepoints?
several options of user encoding:
most straightforward: each patient is encoded based on state sequences, e.g., [A, B, B, C, D]. Of course not a good solution: sensitive to sequence length; cannot align sequences well. e.g., [A,C, B,D,E, F] and [B,D,E, , , _ ] are not treated as similar sequences. 🔨 already implemented
n-gram: each patient is encoded based on a collection of n-gram, a subsequence of n items. e.g., [A,B]2, [B, B, D] 1, [A, C]*0 Similar to how people process text documents
frequent patterns: each patient is encoded based on whether they have the mined frequent patterns (mined using PrefixSpan). Different from n-gram, items in the pattern are not necessarily next to each other. For example, [A, B, C, D, E] has the pattern A_C_D 🔨 already implemented 💻 Seems not good on long sequences. 🗒️ Hyper-parameters (minLen, maxLen, minSupport) need to be carefully set to achieve optimal performance (still exploring the best setting). Constrains on the pattern length improve the performance on long sequences
MDL, a clusters of patients is encoded by a sequence based on the minimum description length principle. Ref: Sequence Synopsis: Optimize Visual Summary of Temporal Event Data. 💬 Based on the feedback from Tali, the global patterns are less important than local pattens.
❓ more experiments & discussions are needed