I am working on generating recurrent neighborhoods using spatial_lda for my dataset that contains ~2 million cells and 19 unique cell types.
My strategy was to 1- run the spatial_lda on the anndata object to extract 20 motifs, 2- run K-means clustering using a large number of clusters (k=30) on the latent weights (anndata.uns['spatial_lda']) , and 3- apply an agglomerative clustering on the k-means cluster centers to group cells into recurrent neighborhoods.
My expectation is that I'll get clusters via agglomerated clustering that have similar cell type and number composition across different spatial_lda runs (same parameters, different random seeds) on the same dataset.
My observation is that the above procedure does not give consistent results when spatial_lda is run using a different random seed. That is, the cell type content and number of the final RCN assignments fluctuate wildly between spatial_lda runs. Is this expected or what might I be doing wrong? Can this be a sign of overfitting?
I also add some output/screenshots from my analysis (I applied the spatial_lda on a subset of randomly sampled cells, same anndata object but different random seeds)
Agglomerative clustering on the kmeans cluster centers, number of final clusters = 4 (columns are the RCN ids, rows are the cell types, values are the number of cell types for a given RCN):
Dear All,
I am working on generating recurrent neighborhoods using
spatial_lda
for my dataset that contains ~2 million cells and 19 unique cell types.My strategy was to 1- run the
spatial_lda
on theanndata
object to extract 20 motifs, 2- run K-means clustering using a large number of clusters (k=30
) on the latent weights (anndata.uns['spatial_lda']
) , and 3- apply an agglomerative clustering on the k-means cluster centers to group cells into recurrent neighborhoods.My expectation is that I'll get clusters via agglomerated clustering that have similar cell type and number composition across different
spatial_lda
runs (same parameters, different random seeds) on the same dataset.My observation is that the above procedure does not give consistent results when
spatial_lda
is run using a different random seed. That is, the cell type content and number of the final RCN assignments fluctuate wildly betweenspatial_lda
runs. Is this expected or what might I be doing wrong? Can this be a sign of overfitting?I also add some output/screenshots from my analysis (I applied the
spatial_lda
on a subset of randomly sampled cells, sameanndata
object but different random seeds)Agglomerative clustering on the kmeans cluster centers, number of final clusters = 4 (columns are the RCN ids, rows are the cell types, values are the number of cell types for a given RCN):
Run_1
Run_2