immunogenomics / harmony

Fast, sensitive and accurate integration of single-cell data with Harmony
https://portals.broadinstitute.org/harmony/
Other
526 stars 100 forks source link

Harmony using a predefined clustering vector #250

Closed tkcaccia closed 6 months ago

tkcaccia commented 6 months ago

I am using harmony for spatial transcriptomic even if it is beyond is first application idea. I think kmeans could be not the most appropriate clustering method for harmony in spatial transcriptomic data. Is it possible to run harmony with a clustering labeling calculated with a different method?

pati-ni commented 6 months ago

We are investigating use cases for spatial transcriptomics workflows. Can you elaborate why kmeans is not suitable for your use case?

tkcaccia commented 6 months ago

I have "priviledge" point of view since I am developing a new method for feature extraction of spatial data. I am using the DLPFC visium as benchmark dataset. The dataset is composed by 12 slides from three different subject. If I use harmony of the 4 slide of the same subject, I obtained a worse clustering than without using harmony. Comparing to single-cell data, Spatial datasets seem to have a higher level of sparse matrix. I noticed that some points probably clustered together by kmeans are shifted in a different region of PCA. I guess this could be due to an erroneous choice of the centroids.

pati-ni commented 6 months ago

We have thought about visium and we think that Harmony is not the best tool to use with this data because it is not a single cell assay. Main problem is that clusters in theory correspond to a cell type and in visium you have a mosaic of cell types in each tile. One thing worth trying is to set the number of clusters parameter to 1 to disable clustering altogether and perform regression in the whole data altogether. We would be interested hearing your experience/take on batch correcting visium data.

To answer your question, you can not change easily the clustering method. What you could do, is create your own c++ cluster_cpp method (e.g. cluster_cpp2 (mycentroids) just copy the code of the existing cluster_cpp), where instead of updating centroids, you assign you centroids instead (that would be the Y assignment step). Then you would have to run harmony manually, and for that you can have a look at the advanced tutorial where we show how to run harmony step by step. There you could interject your cluster_cpp2(myY) custom call instead of cluster_cpp().

ilyakorsunsky commented 6 months ago

@tkcaccia thanks for your interest in using Harmony for spatial data! I fully support @pati-ni's answers above - I'll add a few things from my own experience.

(1) We've had much better success at running Harmony v1.2 on Visium data, especially when we use the new lambda estimation feature by setting lambda=NULL. This automatically estimates the degree of ridge regression for each Harmony cluster and in practice, leads to much more accurate integration. For benchmarking, I would also suggest using more recent technologies. The recent probe-based Visium is much more gene-dense than the original sequencing based data: https://www.10xgenomics.com/datasets?query=visium&page=1&configure%5BhitsPerPage%5D=50&configure%5BmaxValuesPerFacet%5D=1000 (2) As @pati-ni said, this requires a bit of manual work but is not hard to do if you follow the advanced tutorial (http://htmlpreview.github.io/?https://github.com/immunogenomics/harmony/blob/master/doc/detailedWalkthrough.html). At the moment, we still support soft kmeans with cosine distance with diversity penalty as the most effective clustering strategy. Therefore, we do not plan to support custom clustering at this time.

tkcaccia commented 6 months ago

Thank you, for your suggestions. I tried the configuration lambda=NULL but it did not give satisfactory results. For benchmarking, I will still use the DLPFC dataset because it is the mainstream dataset used for testing unsupervised learning methods. In the DLPFC, the harmony function creates artifacts.