Subclustering capability

Hello Prabhakarlab,

I really appreciate your great efforts and contributions to the community with BANKSY. Recently, I started using BANKSY in our research program and am impressed by its capability.

I have three questions and wanted to hear your advice.

subclustering In case, users desire to identify granular cell types in each compartment, is subclustering applicable with BANKSY (e.g. identifying subtypes of 10-20 T cell-types)? I felt if I simply subset the dataset of a particular compartment, this may deteriorate expression data of neighborhood cells. So the workstream is:
- generate neighbor-augmented expression matrix using all cells
- subset data of a compartment (e.g. T cells), which is preserving neibor-augmented matrix computed using all cells
- dimensional reduction, graph, clustering For instance in Fig3g-j of the BANKSY paper, 22 clusters were identified using CRC MERSCOPE data. Apparently, some cell types are not granular (T cells, myeloid, B) while some subtypes of epithelial cells are identified. I guess this 22 clusters were identified by one-time clustering using all cells. I wonder how granular cell types you can identify (as granular as scRNAseq data), using BANKSY (unsupervised) without the help of reference data. In Extended Fig2, subpopulations of macrophages are identified using singleR. Is BANKSY necessary in this case as reference data is used? I also wonder if the identifications of subpopulations are doable without the reference scRNAseq data.
Niche, spatial domain, spatial neighborhood Is BANKSY capable of detecting niches in space (like described in Goltsev et al, Cell, 2018). I wonder what is the best practice in this case. Should I use relatively high mixing index (it seems 0.8 is used in the paper)? In addition, what is the good practice to determine the clustering resolution in this case (also clustering method; typically k-mean has been used to identify niches using cell type labels), and how to determine the optimal number of spatial domains?
Integrative analysis of health and disease. (also disease samples where tissue conditions are compromised) I wonder if BANKSY works well for integratively identifying shared celltypes from health and disease samples. If we would like to identify shared cell types first from health and disease samples and then, would like to investigate changes in transcriptional states and spatial characteristics, does BANKSY unnecessarily distinguish shared cell types into two distinct clusters because their neighborhood status is different and potentially in some disease cases, tissue conditions are severely compromised (e.g IBD). I wanted to hear your experiences in identifying cell types from health and disease tissues. (perhaps, incorporating in harmony etc).

Thank you so much for your helps!

best, Koichi

Thanks for your insightful question. I will respond to this soon!

Hi Koichi,

This is a great set of questions. I will answer point 1 first, and address points 2 and 3 in subsequent answers.

Point 1 Your strategy for round 2 clustering of a subset of cells is correct. Compute neighbour augmented matrix using all cells, then subset, then cluster. Note that you can see this type of granular / subclustering approach in Supplementary Figure 13, where we subclustered the neurons in the MERFISH Mouse hypothalamus data to get ~70 inh and exc neuron subtypes, similar to the original study (Moffitt et al., 2018).

On your sub-point about granularity of clustering and the need for reference data:

some cell types are not granular (T cells, myeloid, B) while some subtypes of epithelial cells are identified. I guess this 22 clusters were identified by one-time clustering using all cells.

i. Yes the 22 clusters in the CRC MERSCOPE data were found using a single round of clustering. I suspect the 500 genes used there probably contain enough genes to further distinguish the subtypes of B cells, T cells, etc, and this probably happens at higher resolutions, or taking the subclustering (second round) approach you mentioned above.

I wonder how granular cell types you can identify (as granular as scRNAseq data), using BANKSY (unsupervised) without the help of reference data.

A necessary, but not sufficient, condition for a clustering algorithm to distinguish any given pair of cell types/states is that at least one measured feature can distinguish that pair of cell types. In scRNA-seq data, we measure ~20,000 genes (selected down to something like 2000 informative/variable ones). Usually this means that the set of features is expressive enough about the clustering granularity that most cell types of interest can be found. In FISH-based spatial data, we commonly only have 100-500 genes (though the number is increasing). I think this tends to be the reason that the cell typing granularity in spatial data is not quite at the level of scRNA-seq; and I think this is independent of the clustering algorithm used. Side point: if a pair of cells is not distinguished by any of the features in a spatial dataset, then I don't think a reference dataset can distinguish them. There needs to be something in the spatial data that distinguishes two cells in the first place before a reference-based / label-transfer type method (like SingleR) can separate them / label them. In summary: whether a spatial method can distinguish cells at the granularity of scRNA-seq has to do with the informativeness of the features measured in scRNA-seq vs spatial methods, and not with the method itself. Using a reference dataset cannot 'fix' the issue.

On a related note: I think a very interesting observation we found with BANKSY was that it could find a dimension of separation not possible by non-spatial methods. For example, in Supp. Fig. 11, we show that the mature oligodendrocytes that separated into the anterior commissure (white matter MODs) and the rest of the hypothamic preoptic area (grey matter MODs) could not be separated by non-spatial clustering because the 'axis of separation' was different in the spatial and nonspatial modes.

ii.

In Extended Fig2, subpopulations of macrophages are identified using singleR. Is BANKSY necessary in this case as reference data is used?

In typical SingleR analysis, we impute labels from reference data onto single cells, and in doing this labeling, we implicitly group the cells too. I think it is worth clarifying that this is not what we are doing when we use SingleR.

In our case, the grouping is done in an unsupervised way (via BANKSY clustering), and then each cluster is given a label corresponding to the majority label given to its cells by SingleR / doing a correlation to the reference. Thus, SingleR itself is not being used to group cells.

You can think of that analysis this way: non-spatial clustering on the spatial data could not identify that cluster, but BANKSY could. Then we used reference data to check the expression signature of this cluster, and found it to be macrophages.

I also wonder if the identifications of subpopulations are doable without the reference scRNAseq data.

As above, the subpopulation are indeed being identified without reference data. The reference data is only being used to give the identified subpopulation a name.

@vipulsinghal02 Thanks so much for the meticulous answers and for sharing your deep insights, and experiences for question#1. Now with all the convincing info you provided here, I am more convinced that BANKSY is an exciting opportunity to identify granular cell types unbiasedly.

It is encouraging to hear that BANSKY is compatible for subclustering as you demonstrated using MERFISH POA data. I also like your semi-supervised annotation where you took a creative approach to utilize both BANSKY and reference data--- while respecting BANKSY clusters, labeling BANSKY clusters by majority label with single R/correlation (perhaps winner-take-all style for each cluster).

I totally agree that granularity is dependent on plex size of each assay. Given that 6K plex is commercially available for imaging based technology (NanoString/Bruker; potentially whole transcriptome wise in 2025~), I am quite interested and excited about how granular we can go with these assays and BANKSY.

Thanks!

Hi Koichi, apologies in the delay in getting back to you. It's just one of those weeks!

Thanks for your kind words! Agreed, as plex increases, BANKSY should approach scRNA seq, and exceed it (since it can use spatial information to do things scRNAseq clustering cannot).

Point 2

The pertinent difference between the Goltsev et al vs BANKSY's method is in the feature set. I think Goltsev's neighbourhood cell type frequency vector is, in some sense, a 'coarsening' of BANKSY's mean of neighbouring cells' expression vector.

Should I use relatively high mixing index (it seems 0.8 is used in the paper)?

Yes absolutely. Unless you have Visium data, which already contains many cells in each spot, so each spot is already a neighbourhood average. In that case we have found that a lambda = 0.2 does well. I am not sure exactly why its better to include a bit of contribution from nearby spots, but it seems to work very well in practice! I think it is effectively expanding the neighbourhood influence slightly. For single cell resolution data (like FISH based, or SlideSeq, or Xenium), I think lambda 0.8 is the way to go for finding niches.

what is the good practice to determine the clustering resolution in this case

I think both clustering resolution and $k{geom}$ can be used to tune the size of the domains. Low resolution and large $k{geom}$ should give you coarser / larger regions, while high res and smaller $k_{geom}$ should give you smaller niches.

also clustering method; typically k-mean has been used to identify niches using cell type labels

I think whether its Leiden or Louvain or k-means honestly probably doesnt matter too much (there are probably speed / accuracy differences, but probably not terribly different; clustering is clustering, after all!)

how to determine the optimal number of spatial domains

I think this depends on known biology / what you are expecting. Even in methods like HMRF based domain finding or k means, you need to specigy a number of clusters / labels. How do you determine a number there? Basically, you try a few different values of k (number of clusters), and see what makes most sense by talking to biologists / looking at markers (like spatially variable genes a la SpatialDE and SPARK). Here, you can tune resolution (say from 0.2 to 2.5 in steps of 0.1) and see what you get at different resolutions.

We adjusted the resolution parameter from 0.1 to 1.5 such that the number of clusters obtained matched the number of layers present in the manual annotation. Because more than one resolution parameter can yield the correct number of layers, for each dataset we report the median ARI across all parameter settings that had the correct layer number.

--from 'Human DLPFC 10x Visium data' in Methods of the BANKSY paper.

By the way, there was a study where labels were defined at multiple resolutions:

Neighbourhood < Community < Tissue Unit. See: Hickey et al. Fig. 3 a-e. (and the related Ext Data Fig 8 in the BANKSY paper).

Point 3: Integrative analysis of health and disease

I wonder if BANKSY works well for integratively identifying shared celltypes from health and disease samples. If we would like to identify shared cell types first from health and disease samples and then, would like to investigate changes in transcriptional states and spatial characteristics, does BANKSY unnecessarily distinguish shared cell types into two distinct clusters because their neighborhood status is different and potentially in some disease cases, tissue conditions are severely compromised (e.g IBD).

This is a great use case. If the neighbourhoods of the same cell type are different in two different regions or samples, and these are clustered jointly / in the same clustering run, then BANKSY will indeed tend to give them separate labels.

This is actually a major use case of BANKSY: one can always merge two clusters manually if one wants to, but separating them by differing environment is nontrivial, at least if the diseased and undiseased tissue is distributed in a single sample in such a way that it is not easy to just cut out the diseased and undiseased regions into separate samples.

Maybe one way to think about diseased and undiseased region cells of the same type getting separate labels is as a algorithmic generalization of manually cutting out diseased and undiseased regions. Now you can either perform analysis on each subpopulation (like finding DE genes between the two subpopulations), or merge them back if you would like.

Also, worth stating that in general, if a cell type lives in very different microenvironments, then it is very likely that that cell types own expression in the two envs is subtly different. This is why its useful to have a method that gives them different labels.

This was the use case in the mature oligodendrocyte clusters in Fig 3 of the BANKSY paper, where the two subpopulations (white matter and grey matter) had similar transcriptomes, with subtle gene expression variation.

I wanted to hear your experiences in identifying cell types from health and disease tissues. (perhaps, incorporating in harmony etc).

Sadly, we have not looked at this (and no current plans to go in this direction on our end). But if you do some analysis on this front, I would be very curious to see what you find. :-)

@vipulsinghal02 Thanks so much again for sharing your super deep insights. They are very helpful and inspiring.

I started using BANKSY for the last two months or so, which I love. There will be so much to learn from the real experiences.

Regarding integrative analysis of health and disease, ideal scenario is 1. experimental variations (e.g. batch) are minimized 2. BANKSY detects differences of the same (transcriptionally) cell types (separate clusters) in space. Perhaps, using Harmony type of correction on BANKSY space is necessary, like you showed in your Vignettes.

As far as I can find now, one of the best public data to explore BANKSY capability for health/disease integration is IBD CosMx study from Salas group where authors labeled CosMx cells using reference scRNAseq data. In my assessment, even at the major compartment level (Epi, T, B etc), transferred labels only match ~70-80 % to manually clustered labels (even worse for granular cell types in each compartment). I feel there are pros and cons for supervised approach vs conventional own-transcriptome-only clustering. Supervised approach is more robust for sparser/noisier (compared to scRNAseq) CosMx data especially when detecting sub-cell types while unbiased approach may detect cell types which are not present in reference data (e.g. neutrophil which is absent in many cases of scRNAseq data). However, likely because of the data sparsity, conventional clustering followed by manual labeling with CosMx type of data seems pretty challenging in our experiences. (and I feel very excited when I came across BANKSY).

I started applying BANKSY to the data recently, so far what I found were: 1) batch correction seems necessary. Simply merging the data from health and IBD results in separation of cell by disease types irrespective of cell types. 2) normalization methods may matter. Assumption used in conventional log-norm in scRNAseq analysis (same number of transcripts per cell) does not seem to be the best option for sparse CosMx data. I had no issue using Seurat, scTransform, and Harmony when only using transcriptomic data while I ran into error using scTransform with BANKSY.Have you used scTransform, Harmony and BANKSY yet? (I may make a separate thread for this). 3)Currently, I am testing with log-norm/BANKSY/Harmony to see if BANKSY goes beyond conventional clustering and shows good correspondence with authors' label (label transferred). I will keep you updated.

Thanks! Koichi

Thanks for your kind comment @KoichiHashikawa! By the way, you might want to look at this and this post for a bit of a discussion on integrative / Harmony analysis.

You might also be interested in a discussion about SCT and normalization as well.

Hope these help, and let us know how things go. :-)

Closing this for now, feel free to reopen if you would like.

prabhakarlab / Banksy

Subclustering capability #28