philips-software / latrend

An R package for clustering longitudinal datasets in a standardized way, providing interfaces to various R packages for longitudinal clustering, and facilitating the rapid implementation and evaluation of new methods
https://philips-software.github.io/latrend/
GNU General Public License v2.0
28 stars 5 forks source link

Combine latrend() with SillyPutty #141

Closed hichew22 closed 1 year ago

hichew22 commented 1 year ago

Hello, I am trying to combine the KML clustering method in latrend with SillyPutty (section "Combining SillyPutty With Hierarchical Clustering"). Specifically, I am trying to do the following: "To apply SillyPutty to an already precomputed clustering algorithm, you have to have the cluster identities of the clustering algorithm and the distance matrix of the data set. SillyPutty will then recalculate the clusters from a starting point within the post-clustered clusters and return the best silhouette width score and the new cluster identities."

I am having trouble finding 1) the cluster identities of the clustering algorithm and 2) the distance matrix. Could you explain how I might find these 2 objects using the lcMethodKML function()? I understand that I can fit the KML model as follows:

kml_method <- lcMethodKML("value", nClusters = 2, nbRedrawing = 1)
kml_model <- latrend(kml_method, data = df)
kml_methods <- lcMethods(kml_method, nClusters = 1:5)
kml_models <- latrendBatch(kml_methods, data = df)

Select 4-cluster model as preferred representation kml_model_4 <- subset(kml_models, nClusters == 4, drop = TRUE)

I would like to use the 4-cluster KML model and enhance it with Silly Putty, but I am not sure how to extract the clustering identities and the distance matrix (Euclidean) such that I can run the following as in their example: hierSilly <- SillyPutty(vis@clusters, dis)

I believe the cluster assignments can be extracted using cluster <- trajectoryAssignments(kml_model_4) but this yields factors rather than numeric which is what I believe SillyPutty() requires for the clusters object. I am not sure how to obtain the distance matrix.

Would you be able to help me with this? Thank you!

niekdt commented 1 year ago

Hi @hichew22, thanks for the detailed post.

The cluster assignment for the trajectories can be converted to integer by using as.integer(cluster), where 1=first cluster, 2=second cluster, etc.

KmL is not a hierarchical cluster algorithm so it does not compute/consider the pairwise distances between trajectories. That said, since $k$-means uses the Euclidean distance, we can compute a distance matrix ourselves.

First, the data needs to be structured with one trajectory per row, and each column representing a different time point. You can use latrend's tsmatrix function for that. Then we can compute the Euclidean distance matrix using R's dist function.

tsdata = tsmatrix(df, response='value')
d = dist(tsdata)

If SillyPutty expects d to be a matrix object then use as.matrix(d).

It's important that the cluster assignments vector and distance matrix rows have the same order (i.e., refer to the same trajectories). With the code I posted this is the case. The order of the assignments can be obtained using ids(kml_model_4).

hichew22 commented 1 year ago

Hi Niek, thank you very much for your help! I think I was able to figure out how to do this. If I have a dataframe containing the new cluster assignments, is there a way to plot the assigned cluster trajectories? I can join the cluster assignments with my longitudinal dataframe and then use ggplot as so:

ggplot(aes(x = day, y = response, group = id), data = df) + 
  geom_line() +
  facet_wrap(vars(newcluster))

However, I would like to include the colored lines for the cluster trajectories as in plotClusterTrajectories(kmlModel4) or plot(kmlModel4) as in your demonstration vignette.

niekdt commented 1 year ago

You're welcome! You can use the plotClusterTrajectories() function to plot an arbitrarily clustered set of trajectories according to a custom cluster center definition (which is `mean in case of KmL):

plotClusterTrajectories(df, cluster = 'newcluster', trajectories = TRUE, facet = TRUE)

Have a look at the function's documentation for more options.

Alternatively, if you want to overlay the newly assigned trajectories with the original cluster trajectories computed by KmL, we'll need to manually combine these two by plotting the trajectories, and then drawing the cluster trajectories over it:

# extract cluster trajectories data.frame
df_cluster = clusterTrajectories(kml_model_4)

# We need matching names between df_cluster and df_traj for facetting set by plotTrajectories()
df$Cluster = df_traj$new_cluster

plotTrajectories(df, response = 'value', cluster = 'Cluster', facet = TRUE) + geom_line(data = df_cluster, aes(x = day, y = value, color = Cluster))

In case of errors check whether the time, id, response and cluster arguments are correctly specified. Also, the names of the clusters need to be the same between the data frames.

hichew22 commented 1 year ago

Hi Niek, thank you for your help! Those are what I would like to plot. I need a little more guidance in setting up the dataframes correctly for the above code.

First, I started with df_lab, which contains the longitudinal laboratory values for each individual (multiple timepoints per individual). This dataframe is what I used to fit the kml_model_4 on. I extracted the cluster assignments and ids as follows:

cluster <- trajectoryAssignments(kml_model_4)
cluster <- as.integer(cluster)
ids <- ids(kml_model_4)

Then, I used your guidance to 1) restructure the longitudinal data with one trajectory per row, and each column representing a different time point using latrend's tsmatrix function and 2) compute the Euclidean distance using R's dist function as follows:

df_ts = tsmatrix(df_lab, response = "value")
dist = dist(df_ts)

Lastly, I used the SillyPutty function to combine SillyPutty with the kml clustering and created a dataframe combining the ids, kml clusters, and SillyPutty clusters:

hierSilly <- SillyPutty(cluster, dist)
df_cluster_combine <- cbind(ids, cluster, hierSilly@cluster) %>%
  as.data.frame()

Could you explain how these dataframes and clusters would fit into the plotClusterTrajectories and plotTrajectories examples you provided? Do I need to join df_lab and df_cluster_combine?

Thank you!

hichew22 commented 1 year ago

I was able to join df_lab and df_cluster_combine and then plot the trajectories as so:

df_new <- df_lab %>% 
  left_join(df_cluster_combine, by = "id") %>% 
rename(new_cluster = V3)

plotClusterTrajectories(
  df_new,
  response = "lab",
  cluster = "new_cluster",
  trajectories = TRUE,
  facet = TRUE
)
image

I think this plot is what I was looking for! I was wondering if you could let me know how to add the percentage of individuals in each cluster like the output from plot(kml_model_4) in your demonstration?

niekdt commented 1 year ago

There's currently no option for the plotClusterTrajectories(data.frame) method to add the percentage, but you can create it by creating a second cluster column with the cluster names including the percentage.

I will create an issue because I think it would be a nice feature to have.

For now, you can achieve it by:

props = prop.table(table(df_new$new_cluster))
cluster_labels = sprintf('%s (%d%%)', names(props), round(props * 100))

df_new$new_cluster_label = factor(df_new$new_cluster, levels = names(props), labels = cluster_labels))

plotClusterTrajectories(
  df_new,
  response = "lab",
  cluster = "new_cluster_label",
  trajectories = TRUE,
  facet = TRUE
)
hichew22 commented 1 year ago

Awesome, thank you so much for all your help!!