Which type of data `anndata` layers need to comprise to run scib.me.metrics_fast() depending on the integration method used?

Hi,

Thank you all the developers of scIB for the great package and also to suggest useful metrics to compare the performance of integration methods.

WARNING: You may want to skip to the end of the post to read the more concrete questions!

I've been trying to run the function scib.metrics.metrics_fast() (from scIB) for awhile but I've been facing some problems to understand which type of data layers the anndata objects need to hold depending on the method and parameters used during integration.

I should say that I'm mostly an R user and I'm interested in comparing the performance of Seurat and Scanorama across multiple data sets. I would like to run these methods following the parameters used in the paper Benchmarking atlas-level data integration in single-cell genomics - Fig.3b:

Seurat v3 RPCA: RPCA + HVG + scaling (full gene exp. matrix)
Scanorama: HVG + scaling (embedding)

Although I know a bit of python, I often prefer to avoid working with it since I find usually myself making simple mistakes which is quite frustrating. Additionally, I'm less familiar with general python packages (numpy, pandas, etc) as well as with the more specific scRNA-seq package scanpy which turns the code less readable for me. For all these reasons I thought to run Seurat and Scanorama (with reticulate) in R and use the final result to compare the performance with scIB in order to have the flexibility to change parameters in R for the integration tasks but minimize potential errors when using python. I should also say that the reason why I'm not using the snakemake workflow scib-pipeline is because I've my own workflow and I don't know how much hard would be try to adapt it to run the integration methods as implemented in scib-pipeline

In order to try to reproduce Seurat v3 RPCA method and parameters, I tried the following (please see a minimal reproducible example below). Briefly, the data set was split by the batch variable, 2K HVG found and the data scaled. If I understood correctly there isn't a wrapper function for the method Seurat in the package scIB, since Seurat is implemented in R. After reading the code from the repository scib-pipeline, I understood that the conversion between anndata and Seurat objects is made with as.SingleCellExperiment() and as.Seurat() functions. My understanding is that the @data slot from the Seurat object will hold the scaled counts and these are for all the genes contrary to the integration python methods, where the HVG are found first, the table of normalized counts subsetted for HVG and, only then, it is performed scaling. On the other hand, in Seurat the scaled counts in the @data will be scaled again for the HVG exported as an RDS format. Is my interpretation correct?

## Packages
library("dplyr")
library("Matrix")
library("Seurat")
library("SeuratData")
library("SeuratDisk")

# Import data
if ( ! ( "panc8" %in% (InstalledData() %>% pull(Dataset)) ) ) InstallData("panc8")
panc8 <- LoadData("panc8")

# Set seed 
set.seed(1024)

## Pre-process data 
# Normalize
panc8 <- NormalizeData(panc8, normalization.method="LogNormalize", scale.factor=10000) # Normalization

# Save the original unintegrated object
panc8 <- UpdateSeuratObject(panc8)
SaveH5Seurat(panc8, "data/unint.h5Seurat") 
Convert("data/unint.h5Seurat", dest="h5ad") # anndata.X : @data w/ 34363 genes x 14890 cells 

## Split data by batch, find 2K HVG, scale & merge Seurat obj
# Split by batch
seu_ls <- SplitObject(panc8, split.by="tech")

# Find 2K HVG 
seu_ls <- lapply(seu_ls, function(x) {
             x <- NormalizeData(x)
             x <- FindVariableFeatures(x, selection.method="vst",nfeatures=2000)
})
features <- SelectIntegrationFeatures(object.list = seu_ls)
saveRDS(features, "data/hvg_2k.rds")

# Scale 
for (s in seq_along(seu_ls)) if (length(seu_ls[[s]]@assays$RNA@var.features)>0) seu_ls[[s]]@assays$RNA@var.features <- logical(0) 
seu_ls <- lapply(seu_ls, function(x) {
             x <- ScaleData(x)
})
# Substitute 'data' by 'scaled.data' & merge Seurat objs
for (s in seq_along(seu_ls)) seu_ls[[s]]@assays$RNA@data <- as(seu_ls[[s]]@assays$RNA@scale.data, "sparseMatrix") 
seu <- merge(seu_ls[[1]], unlist(seu_ls[2:length(seu_ls)]), add.cell.ids = names(seu_ls))

## Integration
seu_ls <- SplitObject(seu, split.by="tech") 
seu_ls <- lapply(X = seu_ls, FUN = function(x) {
             x <- ScaleData(x, features = features, verbose = FALSE)
             x <- RunPCA(x, features = features, verbose = FALSE)
})
anchors <- FindIntegrationAnchors(object.list = seu_ls, 
                  anchor.features = features, 
                  reduction = "rpca")
int <- IntegrateData(anchorset=anchors)
DefaultAssay(int) <- "integrated"
int <- ScaleData(int)
int <- RunPCA(int)
int <- FindNeighbors(int, reduction="pca", dims=1:30) # run neighbour graph
int@assays$integrated@scale.data <- matrix(ncol=0,nrow=0) # fill with nothing 
SaveH5Seurat(int, "data/seurat_int.h5Seurat")
Convert("data/seurat_int.h5Seurat", dest="h5ad") # anndata.X: norm corrected data w/ 2000 genes x 14890 cells

In the case of Scanorama seems simpler since from the code it seems that what it is provided (considering the preference for the parameters mentioned above) as input is a table of scaled counts for the HVG by batch (see short example below). Is my interpretation correct?

# Packages
library("dplyr")
library("Seurat")
library("SeuratDisk")
library("reticulate")
source("helper_functions.R")
scanorama <- import("scanorama")

# Set seed
set.seed(1024)

# Load scaled obj & HVG genes
panc8 <- LoadH5Seurat("data/scaled.h5Seurat")
features <- readRDS("data/hvg_2k.rds") 

# Subset object to HVG
meta_data <- panc8@meta.data
counts <- GetAssayData(panc8, slot="counts", assay="RNA")   
seu <- CreateSeuratObject(counts=counts[features,], 
              meta.data=meta_data)

# Split data by batch
seu_ls <- SplitObject(seu, split.by="tech")

# Normalize and scale by batch
seu_ls <- lapply(seu_ls, function(x) {
             x <- NormalizeData(x)
             x <- ScaleData(x, features=row.names(x))
              })

## Parse data for Scanorama
mtx <- lapply(seu_ls, function(x) t(as.matrix(x@assays$RNA@scale.data))) 
genes <- lapply(mtx, function(x) colnames(x))
cells <- lapply(mtx, function(x) row.names(x)) 
cells <- unlist(cells) 
names(mtx) <- names(genes)  <- NULL

# Integration & batch-correction
int_ls <- scanorama$correct(mtx, genes, 
                return_dimred=TRUE, 
                return_dense=TRUE)

## Parse results
# Join dimensional reductions
int_dr <- do.call(rbind,int_ls[[1]])
colnames(int_dr) <- paste0("PC_", 1:100)
row.names(int_dr) <- cells

# Join integrated counts
int_t <- lapply(int_ls[[2]],t)
int_counts <- do.call(cbind,int_t)
row.names(int_counts) <- as.character(int_ls[[3]])
colnames(int_counts) <- cells

# Create Seurat object, merge scaled.data from list & 
#save @scale.data in @data in integrated obj
int <- CreateSeuratObject(counts=int_counts, 
              assay="integrated") 
              #project="batch")
int@assays$integrated@data <- t(do.call("rbind",mtx))
if (!is.null(seu_ls)) { # add metadata
    meta_ls <- lapply(seu_ls, function(x) x@meta.data)
    meta_all <- merge_metadata(meta_ls)
    int@meta.data <- meta_all
    row.names(int@meta.data) <- colnames(int)
}

# Add dimensional reduction 
std <- apply(int_dr, MARGIN=2, FUN=sd)
int@reductions$pca <- CreateDimReducObject(embeddings=int_dr, 
                       stdev=std, key="PC_", 
                       assay = "integrated")
row.names(int@reductions$pca@cell.embeddings) <- colnames(int)

Finally, I'm assessing the performance with scIB (see below). Here, it seems that I need to fix the category variables type and compute KNN graph with scanpy because SeuratDisk seems unable to save connectivities of this graph to the right place anndata.obsp["connectivities"] (perhaps due to incompatibility between versions). Here, my understanding is the following:

unintegrated anndata object: it needs to comprise normalized counts in the adata.X layer and it needs to contains cell labels (dependending on the metrics that I want to assess).
integrated anndata object (if type_="full"): it needs to comprise corrected scaled counts after integration, in the case of Seurat at least, in the adata.X layer (there will be as much genes as specified in the HVG parameter).
integrated anndata object (if type_="embed"): it needs to comprise scaled counts in the adata.X layer (there will be as much genes as specified in the HVG parameter) and embedding in adata_int.obsm["X_pca"] or adata_int.obsm["X_emb"] (because line).

# Import modules
import scib
import scanpy as sc

# Import anndatas files
unint=sc.read("data/unint.h5ad")
seurat_int=sc.read("data/seurat_int.h5ad")
scanorama_int=sc.read("data/scanorama_int.h5ad")

# Turn 'object' variables 'batch_key' & 'label_key' into 'category' variables
key_vars=["tech","celltype"]
for v in key_vars: 
    unint.obs[v]=unint.obs[v].astype("category")
    seurat_int.obs[v]=seurat_int.obs[v].astype("category")
    scanorama_int.obs[v]=scanorama_int.obs[v].astype("category")

# Compute PCA & neighbor graph
scib.pp.reduce_data(seurat_int, n_top_genes=2000, neighbors=True, use_rep="X_pca", pca=True, umap=False)
scib.pp.reduce_data(scanorama_int, n_top_genes=2000, neighbors=True, use_rep="X_pca", pca=True, umap=False)

# Run fast metrics
seurat_res=scib.me.metrics_fast(adata=unint, adata_int=seurat_int, batch_key="tech", label_key="celltype", organism="human", type_="full")
scanorama_res=scib.me.metrics_fast(adata=unint, adata_int=scanorama_int, batch_key="tech", label_key="celltype", organism="human", type_="embed") #it does not fail because of: adata_int.obsm["X_emb"] = adata_int.obsm["X_pca"].copy() (https://github.com/theislab/scib-pipeline/blob/75ae100cf158191ee9097750e658b2f822cc837b/scripts/metrics/metrics.py#L142)

Just to summarise and re-phrase the questions above (which migh be simpler):

Which is the exact input for Seurat and Scanorama if I'm interested in using the top performance parameters obtained in the paper? (which were for the methods: Seurat v3 RPCA (RPCA + HVG + scaling - full gene exp. matrix); Scanorama (HVG + scaling - embedding))
Which type of information the unintegrated and integrated anndata objects need to provide to the function scib.me.metrics_fast() depending if I use type equal full or embed?

Sorry for the long post.

Thank you in advance for the great package and any feedback that you may give.

Best regards,

António

theislab / scib

Which type of data `anndata` layers need to comprise to run scib.me.metrics_fast() depending on the integration method used? #300