spatial_lda and recurrent cellular neighborhoods

batukav commented 10 months ago

Dear Scimap developers,

Thank you very much for creating this great repo.

I would like to ask about defining recurrent cellular neighborhood (RCN) from histopathology data using spatial_lda method. Specifically, I'm trying to wrap my head around the spatial_lda method used in the publication The Spatial Landscape of Progression and Immunoediting in Primary Melanoma at Single-Cell Resolution by @ajitjohnson and his coworkers.

What I understand is that LDA is used to assign a distribution of "topics" to each cell. Then, to define the RCN's, these topic distributions per cell (latent weights) are clustered using K-means clustering. Then, the resulting clusters are manually grouped into "meta-clusters", which in turn correspond to the RCNs. Do I understand this approach correctly?

I am curious to understand 1- does it make sense to use different clustering algorithms like HDBScan (or UMAP + HDBScan) to group the latent weights and 2- the suitability of using Euclidean distance for clustering the latent weights. As the latent weights are probability distributions, does it make any sense to use Jensen-Shannon distance (or similar) for trying to cluster the latent weights? Did you experiment with any of these method?

The ease of use and smoothness introduced by Scimap is very valuable and I would like to use it for analyzing similar data. I hope this is the right place to discuss the above questions Thank you.

ajitjohnson commented 7 months ago

I apologize for missing this issue. For some reason, I have not been receiving notifications for the issues raised in this channel. Does the issue still persist, and would you need help? Thank you.

batukav commented 6 months ago

Hello,

Yes, the issue still persists and I think my first question is rather more important than the second one.

Another issue I have came across with is regarding the coherence scores. For my dataset containing a few million cells and ~10 different cell types, spatial_lda gives almost identical coherence scores for different number of topics. I have tried various number of topics, ranging from 20 to 50 and all gave identical coherence scores up to the sixth decimal point. Do you have any insight on this? Does this mean that I am requesting too many topics?

ajitjohnson commented 6 months ago

@batukav, I will first address your query regarding coherence scores and respond to your earlier question subsequently.

The coherence score remains constant in this application because the number of words (cell types) is limited (generally 5-15 cell types), contrasting with the standard implementation of Latent Dirichlet Allocation (LDA), where one might encounter millions of words. Due to this limitation, the current version of LDA in scimap does not output identified topics in the traditional sense. Instead, I focus on clustering the latent variables.

batukav commented 6 months ago

Thank you for the explanation. I think now I understand the process better. Do you have any suggestions for picking the number of topics? If the analysis boils down to investigating and merging the clusters, I suppose the overall results won't be as sensitive to the initial selection of number of topics.

labsyspharm / scimap

spatial_lda and recurrent cellular neighborhoods #71