labsyspharm / scimap

Spatial Single-Cell Analysis Toolkit
https://scimap.xyz/
MIT License
74 stars 26 forks source link

spatial_lda reproducibility #102

Open batukav opened 7 months ago

batukav commented 7 months ago

Dear All,

I am working on generating recurrent neighborhoods using spatial_lda for my dataset that contains ~2 million cells and 19 unique cell types.

My strategy was to 1- run the spatial_lda on the anndata object to extract 20 motifs, 2- run K-means clustering using a large number of clusters (k=30) on the latent weights (anndata.uns['spatial_lda']) , and 3- apply an agglomerative clustering on the k-means cluster centers to group cells into recurrent neighborhoods.

My expectation is that I'll get clusters via agglomerated clustering that have similar cell type and number composition across different spatial_lda runs (same parameters, different random seeds) on the same dataset.

My observation is that the above procedure does not give consistent results when spatial_lda is run using a different random seed. That is, the cell type content and number of the final RCN assignments fluctuate wildly between spatial_lda runs. Is this expected or what might I be doing wrong? Can this be a sign of overfitting?

I also add some output/screenshots from my analysis (I applied the spatial_lda on a subset of randomly sampled cells, same anndata object but different random seeds)

AnnData object with n_obs × n_vars = 101363 × 1
    obs: 'X_centroid', 'Y_centroid', 'phenotype', 'imageid', 'cell_id', 'kmeans_labels'
    uns: 'spatial_lda', 'spatial_lda_probability

Agglomerative clustering on the kmeans cluster centers, number of final clusters = 4 (columns are the RCN ids, rows are the cell types, values are the number of cell types for a given RCN):

Run_1 image

Run_2 image