Multi-sample analysis pipeline

Sengels24 commented 3 months ago

Thanks for creating this tool. I have got a similar question as stated in #1 : is there are multi-sample analysis pipeline available for python?

Essentially, we have an adata frame containing cells from various samples (the sample origin is known for each cell). I was wondering how Banksy goes about creating a spatial graph when there are multiple samples in the same adata frame. This should be done for each sample separately (Construction of the spatial Nearest-Neighbour graph), but I am not sure how to go about it. If I look in the slideseqv1_analysis.ipynb, match_labels is not used until cell clustering?

Thank you very much!

chousn commented 3 months ago

Hi Sengels24, thanks for your question!

Currently, the Python version of Banksy doesn't have a built-in multi-sample analysis pipeline like the R version. However, you can achieve this functionality using anndata's concatenate function (https://anndata.readthedocs.io/en/latest/generated/anndata.AnnData.concatenate.html). Assuming all your cells are in a single dataframe, which also contains x and y coordinates as annotations:

1.⁠ ⁠Subset by Sample Origin: Separate the cells into seperate anndata objects based on their sample origin. This ensures you analyze each sample independently.

2.⁠ ⁠Batch Correction: Since you have multiple samples, you'll need to correct for potential batch effects. This can involve simple z-score normalization for each sample or using a tool like Harmony.

3.⁠ ⁠Compute the Neighbor Augmented Matrix (BANKSY Matrix) for Each Sample Separately: Apply Banksy's neighbour matrix computation step to each batch-corrected sample dataset individually (this creates separate spatial graphs for each sample), generating anndata objects for each sample.

4.⁠ ⁠Concatenate BANKSY matrices: Concatenate Banksy's neighbour-augmented matrices (anndata objects) along the cells dimension using ⁠ anndata.concatenate(banksy_adata_sample1, banksy_adata_sample2, banksy_adata_sample3...) ⁠. This merges the information from all samples into a single anndata object, which contains batch-corrected and neighbour-augmented expression features for cells from all samples.

5.⁠ ⁠Clustering on Concatenated Data: Now you can perform clustering on the concatenated anndata object, which will generate clusters that span all the samples.

In this approach, you wouldn't need ⁠ match_labels ⁠ because the clusters should already encompass cells from multiple samples.

We are working on an updated version of Banksy_py and will try to include an example notebook demonstrating this multi-sample workflow.

I hope this clarifies the process! Feel free to ask if you have any further questions.

Sengels24 commented 3 months ago

Hi @chousn,

Thanks for the elaborate response! Very helpful and I will now perform analysis like this!

Andrea-ZW commented 2 weeks ago

Thank you for the clarification. I have a follow up question: once we concatenate the multiple anndatas after computed their BANKSY matrices, how can we generate the clustering graph, given their weight graphs are not generated simultaneously? And what do you mean by the clusters should already encompass cells from multiple samples.

prabhakarlab / Banksy_py

Multi-sample analysis pipeline #10