labsyspharm / scimap

Spatial Single-Cell Analysis Toolkit
https://scimap.xyz/
MIT License
72 stars 24 forks source link

Inquiry on Handling Imbalanced CODEX Data for Cellular Neighborhood Calculations in Scimap #110

Open Yuchen588 opened 1 month ago

Yuchen588 commented 1 month ago

Dear Scimap Team,

Thank you for your exceptional work in advancing spatial-omics biology. I am currently facing a challenge with integrating two different scales of CODEX quantification data processed through mcmicro. We have one group consisting of 300 TME cores (about 3000 cells each) and another group comprising 10 whole-slide images (approximately 1.5 million cells per patient). Despite having annotated and aligned the same cell types in each group, we are unable to calculate the Cellular Neighborhood (CN) possibly due to data imbalance [ sm.tl.spatial_lda + sm.tl.spatial_cluster(Kmeans)]. Specifically, the annotation in the column adata.obs['Kmeans'] is showing as NA, even though the concurrent 10 motif has been identified.

Could you please advise on the most suitable method for analyzing such imbalanced datasets? Here are the approaches we are considering:

**1. sm.tl.spatial_expression + sm.tl.spatial_cluster(Kmeans)

  1. sm.tl.spatial_count + sm.tl.spatial_cluster(Kmeans)
  2. sm.tl.spatial_lda + sm.tl.spatial_cluster(Kmeans)**

Alternatively, is there a method that could align or transfer the TMA CN results onto whole-slide image data at single-cell resolution, such as using sm.tl.spatial_similarity_search to find similar patterns?

I appreciate any guidance you can provide.

Best regards,

YC

ajitjohnson commented 1 month ago

hi @Yuchen588 are these two datasets processed independently and merged by some means into a single anndata object? If so at step was it merged?

Yuchen588 commented 1 month ago

We combined the metadata (X, Y, cell type, imageID) and raw counts data directly from 300 TMA cores and 10 whole-slide images (WSI) without applying batch effect normalization across the datasets. Our objective is to identify recurrent Cellular Neighborhoods (CN) in both TMA and WSI data. During the cell annotation phase, we harmonized batch effects across both datasets using the 'Harmony' tool, following initial cell type annotation via Seurat's 'Anchor' method. This process yielded a refined metadata set with consistent cell type annotations throughout both datasets.

ajitjohnson commented 1 month ago

well, that all sounds appropriate. can you share the code that you used to generate the neighborhoods and subsequent clustering? It is weird that you will get NA's.

Just to make sure: both the TMA and WSI were acquired at the same imaging resolution?

Yuchen588 commented 1 month ago

Of course, we did the same resolution processing for TMA and WSI, annotated with the same cell type for each group, i.e., both datasets were run separately sm. tl.spatial_cluster was fine getting K=10, but the merge two dataset came up with NaN; here's my code, check it please, thanks!

Additionally, can Scimap align the pre-identified spatial-LDA motifs from the TMA data to the large-scale WSI data, or is there another way to assign CNs determined from the TMA to each WSI cell?


---- Simplified import of tma_meta ---- Specify columns to import cols_to_use = ['CellID', 'X_centroid', 'Y_centroid', 'imageid', 'cell.anno']

---- Import specified columns from CSV file tma_meta = pd.read_csv( '/mnt/radonc-li01/private/lyc/CODEX/LUAD/results/cell.anno/TMA.20240613/seurat.obj.comb/cell.anno/TMA.WS.CN.comb.res/rawdata/TMA/tma.codex.metadata.filter.csv', index_col='CellID', ---- Specify 'CellID' as index column usecols=cols_to_use, ---- Include CellID and other specified columns low_memory=False )

---- Simplified import of WS_data ---- Specify columns to import cols_to_use = ['CellID', 'CD8', 'PanCK']

---- Import specified columns from CSV file tma_count = pd.read_csv( '/mnt/radonc-li01/private/lyc/CODEX/LUAD/results/cell.anno/TMA.20240613/seurat.obj.comb/cell.anno/TMA.WS.CN.comb.res/rawdata/TMA/tma.count.data.csv', index_col='CellID', ---- Specify 'CellID' as index column usecols=cols_to_use, ---- Include CellID and other specified columns low_memory=False )

---- WS_meta cols_to_use = ['CellID', 'X_centroid', 'Y_centroid', 'imageid', 'cell.anno'] WS_meta = pd.read_csv( '/mnt/radonc-li01/private/lyc/CODEX/LUAD/results/cell.anno/TMA.20240613/seurat.obj.comb/cell.anno/TMA.WS.CN.comb.res/rawdata/WS/metadata_combined.csv', index_col='CellID', ---- Specify 'CellID' as index column usecols=cols_to_use, ---- Include CellID and other specified columns low_memory=False )

---- WS_data cols_to_use = ['CellID', 'CD8', 'PanCK'] WS_count = pd.read_csv( '/mnt/radonc-li01/private/lyc/CODEX/LUAD/results/cell.anno/TMA.20240613/seurat.obj.comb/cell.anno/TMA.WS.CN.comb.res/rawdata/WS/counts_combined.csv', index_col='CellID', ---- Specify 'CellID' as index column usecols=cols_to_use, ---- Include CellID and other specified columns low_memory=False )

---- Combine metadata and count data combined_meta = pd.concat([tma_meta, WS_meta], ignore_index=True) combined_count = pd.concat([tma_count, WS_count], ignore_index=True) adata = ad.AnnData(combined_count) adata.obs = combined_meta adata.raw = adata

---- Set random seed for reproducibility np.random.seed(42)

---- Run the LDA tool adata = sm.tl.spatial_lda(adata, method='radius', radius=80, label='spatial_lda', phenotype='cell.anno') adata.uns['spatial_lda_probability']

---- Perform spatial clustering adata = sm.tl.spatial_cluster(adata, random_state=0, df_name='spatial_lda', method='kmeans', k=10, label='spatial_lda_kmeans.10.final') adata.obs

---- The NA values in the specified column adata.obs['spatial_lda_kmeans.10.final']

ajitjohnson commented 1 month ago

can you try:

### Run the LDA tool
adata = sm.tl.spatial_lda(adata, method='knn', knn=10, label='spatial_lda', phenotype='cell.anno')

### Perform spatial clustering
adata = sm.tl.spatial_cluster(adata, random_state=0, df_name='spatial_lda', method='kmeans', k=10, label='spatial_lda_kmeans.10.final')