wangqianwen0418 commented 3 years ago

several options of user encoding:

most straightforward: each patient is encoded based on state sequences, e.g., [A, B, B, C, D]. Of course not a good solution: sensitive to sequence length; cannot align sequences well. e.g., [A,C, B,D,E, F] and [B,D,E, , , _ ] are not treated as similar sequences. 🔨 already implemented
n-gram: each patient is encoded based on a collection of n-gram, a subsequence of n items. e.g., [A,B]2, [B, B, D] 1, [A, C]*0 Similar to how people process text documents
frequent patterns: each patient is encoded based on whether they have the mined frequent patterns (mined using PrefixSpan). Different from n-gram, items in the pattern are not necessarily next to each other. For example, [A, B, C, D, E] has the pattern A_C_D 🔨 already implemented 💻 Seems not good on long sequences. 🗒️ Hyper-parameters (minLen, maxLen, minSupport) need to be carefully set to achieve optimal performance (still exploring the best setting). Constrains on the pattern length improve the performance on long sequences
MDL, a clusters of patients is encoded by a sequence based on the minimum description length principle. Ref: Sequence Synopsis: Optimize Visual Summary of Temporal Event Data. 💬 Based on the feedback from Tali, the global patterns are less important than local pattens.

❓ more experiments & discussions are needed

wangqianwen0418 commented 3 years ago

clustering method

:link: https://github.com/hms-dbmi/OncoThreads/blob/version2/src/modules/TemporalHeatmap/UtilityClasses/clusterfck.js

Patient grouping is based on a hcluster algorithm. It is a bottom-up method. Every time, the algorithm merge two clusters with the smallest distance.

the original version is modified as below:

[x] More friendly user interaction: instead of inputing a minimus distance, users can input the number of clusters they prefer
[x] Disable meaningless clusters: users are not allowed to further increase number of clusters when the min dis is already 0.
[x] More balanced clustering: When finding clusters with same pair-wise distances, merge the clusters with smaller size.
[x] Support different linkages: so far, "average" works the best for point clustering in the state identification; "single" works the best for sequence grouping (each sequence, i.e., patient, is encoding based on whether they have the frequent patterns mined through PrefixScan)

wangqianwen0418 commented 3 years ago

Observe State Transitions❓

I found it is not always easy to observe transition in Sankey diagram. For example, in the figure below, it is possible that there are only sequence [A, C, E] and [B, C, D], no [B, C, E] exists.

Does it matter?

If it doesn't matter, does it mean that we only need to care the state transition between two timepoints?

hms-dbmi / OncoThreads

Patient Grouping #248

several options of user encoding:

MDL, a clusters of patients is encoded by a sequence based on the minimum description length principle. Ref: Sequence Synopsis: Optimize Visual Summary of Temporal Event Data. 💬 Based on the feedback from Tali, the global patterns are less important than local pattens.

clustering method

Observe State Transitions❓