theislab / kBET

An R package to test for batch effects in high-dimensional single-cell RNA sequencing data.

Apache License 2.0

154 stars 23 forks source link

High rejection rates for seemingly well integrated data #47

Closed Sum02dean closed 4 years ago

Sum02dean commented 5 years ago

Hello there. I have been trying to get this package working for quite some time now but no matter what, the rejection rates are either 1 or close to 1 regardless of correction. I would gladly appreciate feedback on this issue.

The data before conversion to a matrix is a [cell by gene] data.frame with the last two columns giving the batch and celltypes as categorical data. The data is for a single celltype collected over three separate batches and each batch contains not too disimilar numbers of observations.

Example:

PCA before correction:

kBET before correction:

PCA after correction:

kBET after correction:

Code:

# KBET for uncorrected 
# coerce as matrix and remove categorical columns
data <- as.matrix(original[ , 1:(length(original)-2)])
#coerce batch labels(sample) to factor
batch <- as.factor(original$sample)
#compute and plot
k0=floor(mean(table(batch))) 
knn <- get.knn(data, k=k0, algorithm = 'cover_tree')
batch.estimate <- kBET(data, batch, plot=TRUE, do.pca = TRUE, dim.pca = 2)

#KBET for corrected 
# coerce as matrix and remove categorical columns
data <- as.matrix(corrected[ , 1:(length(corrected)-2)])
#coerce batch labels(sample) to factor
batch <- as.factor(corrected$sample)
#compute and plot
k0=floor(mean(table(batch))) #neighbourhood size: mean batch size 
knn <- get.knn(data, k=k0, algorithm = 'cover_tree')
batch.estimate <- kBET(data, batch, plot=TRUE, do.pca = TRUE, dim.pca = 2)

I also tried iterating over many different values of k(recomputing knn as suggested) yet the rejection rate saturates and remains high throughout.

Assessment over varying values of k (corrected data):

best regards, Dean

nhuhoa commented 5 years ago

Dear Dean, I just pass by this github. I think the problem can be:

You computed knn but you did not pass the knn results to kBET function. See example from documentation:

now run kBET with pre-defined nearest neighbours.

batch.estimate <- kBET(data, batch, k = k0, knn = knn)
In your func, dim.pca=2, for 2 pcs, maybe there are less of difference with only 2 pcs: batch.estimate <- kBET(data, batch, plot=TRUE, do.pca = TRUE, dim.pca = 2)
Your k0 value is too big for your batch size.

Best, Hoa

PrashINRA commented 4 years ago

Hi nhuhoa, Can you provide information about running kBET on batch corrected Seurat objects? Thanks

mbuttner commented 4 years ago

Hi PrashINRA, I am not really familiar with the Seurat objects. However, we tried to convert the Seurat objects into SingleCellExperiment objects:

integrated <- IntegrateData(...)
integrated_sce <- as.SingleCellExperiment(integrated)

In the SingleCellExperiment object, you can access the integrated data matrix as one of the assays (I am not sure how it is going to be named). Check

integrated_sce

in your console and it returns an overview on the object.
Then, you can run kBET as described.

mbuttner commented 4 years ago

Hi Dean, thank you a lot for sharing so many details about your data. By looking at the PCA plot of your corrected data, I think that the kBET result reflects the well the result of the data integration. kBET is pretty sensitive to batch effects und you may visualise them better if you plot the cell density per batch for each PC (i.e. the marginals). We showed that in the accompanying publication (Figure 3b in Büttner et al, 2019) - before correction, we also observed an acceptable overlap of the two technical batches, but the systematic shift remained visible. kBET picked that up. In our example and similarly in yours.

The second question is now how you aim to proceed with your downstream analysis. If you want your cells to cluster together, your mission seems accomplished. If you want to evaluate gradual changes over the different batches, I suggest to use a different batch correction method, based on both bad kBET values and the shifts in distribution, which are still visible in the PCA plot.

Best regards, Maren

zenghh1226 commented 4 years ago

Dear mbutterner, Thank you for devloping this tool. I really want to use this tool to guide single cell data integration,but I don't know which value indicate whether the data mix well. Could you give a detail discription of the result?

RuiyuRayWang commented 3 years ago

@Sum02dean Hi Dean, I encountered the same issue as yours. The summary given by kBET showed a high rejection rate for both original data and batch corrected data. Have you found a solution to the issue?

Dear @mbuttner , I've read through several issues in this repo and I felt that we really need some clarifications from you.

In your publication, you showed in many figures that the so called "acceptance rate" decreased after the batch effect was corrected. But in the kBET package there is no reference of the so called "acceptance rate", only "rejection rate". How do I obtain a "acceptance rate" for my data?
You mentioned kBET only accepts dense matrix or knn-graph as input (#56), and you also mentioned we should use an assay from SingleCellExperiment object as input (#58). But typically, an assay from SingleCellExperiment IS a SPARSE matrix. In #49, someone asked about the data format to use. I feel this is by far the most ambiguous part in using kBET. In the kBET vignette, kBET was applied to count data, CPM normalized data, and limma batch-corrected data.What exactly should we use as the input data then?
For my own purposes, in my single cell analysis, I performed batch correction only on a subset of genes (highly variable genes), thus my data is a reduced matrix with only a few thousand genes. Can I use this matrix as input?

Thanks, Ray

RuiyuRayWang commented 3 years ago

After carefully reading your vignette (kBET_vignette.Rmd), I realized what the problems were. I'll document my findings in case others raise the same question.

If I understand correctly, kBET uses Pearson's Chi-squared statistic to test for homogeniety of a k0 sized neighborhood. The Chi-squared statistic is known to be sensitive to sample size. With large dataset and higher k0 values, there will be an overwhelmingly large number of significance reported, thus yielding a high rejection rate.

I think a plausible solution is to reduce the k0 to a reasonably low value, but I'm not sure whether the authors approve such practice.

Best, Ray

mbuttner commented 3 years ago

Dear @zenghh1226,

Dear mbutterner, Thank you for devloping this tool. I really want to use this tool to guide single cell data integration,but I don't know which value indicate whether the data mix well. Could you give a detail discription of the result?

thank you for your question. I wrote a rather extensive output for kBET, indeed. The value you are looking for is the summary$kBET.observed. This is the observed rejection rate reported by kBET.

Best, Maren

mbuttner commented 3 years ago

Dear Ray,

thank you for your detailed description of your issue. I am glad that you found the vignette useful.

After carefully reading your vignette (kBET_vignette.Rmd), I realized what the problems were. I'll document my findings in case others raise the same question.

If I understand correctly, kBET uses Pearson's Chi-squared statistic to test for homogeniety of a k0 sized neighborhood. The Chi-squared statistic is known to be sensitive to sample size. With large dataset and higher k0 values, there will be an overwhelmingly large number of significance reported, thus yielding a high rejection rate.

I think a plausible solution is to reduce the k0 to a reasonably low value, but I'm not sure whether the authors approve such practice.

Best, Ray

kBET uses the Pearson's Chi-Squared statistic, indeed. We are aware of the dependence of sample size in the test. Therefore, we implemented a heuristic to determine an optimal neighbourhood size for the test for well-mixedness (see Figures 2c-d in the kBET publication, where we scanned over all possible neighbourhood sizes to determine an optimal neighbourhood size for the test). You can adapt the neighbourhood size manually, but we do not recommend this as you can reduce the rejection rate this way (which might lead to underreporting). We are aware that kBET is a sensitive tool and often reports rejection rates close to 1. If you are interested in alternative measures for batch effect quantification, please have a look at the metrics implemented in scIB.

mbuttner commented 3 years ago

Dear Ray,

Dear @mbuttner , I've read through several issues in this repo and I felt that we really need some clarifications from you.

In your publication, you showed in many figures that the so called "acceptance rate" decreased after the batch effect was corrected. But in the kBET package there is no reference of the so called "acceptance rate", only "rejection rate". How do I obtain a "acceptance rate" for my data?

Thank you for pointing this out. In the publication, we used acceptance rate = 1 - rejection rate. This is a rescaling such that low values (close to 0) correspond to "bad" results and high values (close to 1) correspond to "good" results. However, as the foundation of kBET is a statistical test (and therefore, there is no such thing as acceptance in hypothesis testing), the kBET package itself only reports rejection rates.

You mentioned kBET only accepts dense matrix or knn-graph as input (Dose kBET allow normalized matrix as input? #56), and you also mentioned we should use an assay from SingleCellExperiment object as input (Working on SingleCellExperiment object. #58). But typically, an assay from SingleCellExperiment IS a SPARSE matrix. In data matrix for kBET evaluation #49, someone asked about the data format to use. I feel this is by far the most ambiguous part in using kBET. In the kBET vignette, kBET was applied to count data, CPM normalized data, and limma batch-corrected data.What exactly should we use as the input data then?

Thank you for your question. In general, we implemented kBET such that it accepts several types of inputs and is not dependent on a feature matrix per se. The standard workflow is as follows: kBET requires a dense matrix as input and then computes a knn-graph from the dense matrix. The FNN package, which does the knn-computation, requires a dense matrix as input. This is indeed less than optimal. If you want to use an assay from SingleCellExperiment is a sparse matrix, please consider to convert it into a dense matrix. If, for some reason, you have a large data matrix and computing nearest neighbours with kBET (and FNN resp.) takes too long, you can use a different method to compute the knn-graph instead. If you give a knn-graph as additional input, then kBET skips the computation of the knn-graph and computes the rejection rates directly on the knn-graph (see section "Variations" in the Readme):

require('FNN')
# data: a matrix (rows: samples, columns: features (genes))
k0=floor(mean(table(batch))) #neighbourhood size: mean batch size 
knn <- get.knn(data, k=k0, algorithm = 'cover_tree')

#now run kBET with pre-defined nearest neighbours.
batch.estimate <- kBET(data, batch, k = k0, knn = knn)

Please note that a data matrix with the same size as your original data is required, but not used (apart from estimating number of rows and columns).

For my own purposes, in my single cell analysis, I performed batch correction only on a subset of genes (highly variable genes), thus my data is a reduced matrix with only a few thousand genes. Can I use this matrix as input?

Yes, you can use a reduced matrix as input. In fact, we observed that using highly variable genes tends to improve the batch effect correction in most methods.

Thanks, Ray

Please let me know when you have further questions.

Best, Maren

RuiyuRayWang commented 3 years ago

Dear Ray,

Dear @mbuttner , I've read through several issues in this repo and I felt that we really need some clarifications from you.

In your publication, you showed in many figures that the so called "acceptance rate" decreased after the batch effect was corrected. But in the kBET package there is no reference of the so called "acceptance rate", only "rejection rate". How do I obtain a "acceptance rate" for my data?

Thank you for pointing this out. In the publication, we used acceptance rate = 1 - rejection rate. This is a rescaling such that low values (close to 0) correspond to "bad" results and high values (close to 1) correspond to "good" results. However, as the foundation of kBET is a statistical test (and therefore, there is no such thing as acceptance in hypothesis testing), the kBET package itself only reports rejection rates.

You mentioned kBET only accepts dense matrix or knn-graph as input (Dose kBET allow normalized matrix as input? #56), and you also mentioned we should use an assay from SingleCellExperiment object as input (Working on SingleCellExperiment object. #58). But typically, an assay from SingleCellExperiment IS a SPARSE matrix. In data matrix for kBET evaluation #49, someone asked about the data format to use. I feel this is by far the most ambiguous part in using kBET. In the kBET vignette, kBET was applied to count data, CPM normalized data, and limma batch-corrected data.What exactly should we use as the input data then?

Thank you for your question. In general, we implemented kBET such that it accepts several types of inputs and is not dependent on a feature matrix per se. The standard workflow is as follows: kBET requires a dense matrix as input and then computes a knn-graph from the dense matrix. The FNN package, which does the knn-computation, requires a dense matrix as input. This is indeed less than optimal. If you want to use an assay from SingleCellExperiment is a sparse matrix, please consider to convert it into a dense matrix. If, for some reason, you have a large data matrix and computing nearest neighbours with kBET (and FNN resp.) takes too long, you can use a different method to compute the knn-graph instead. If you give a knn-graph as additional input, then kBET skips the computation of the knn-graph and computes the rejection rates directly on the knn-graph (see section "Variations" in the Readme):
require('FNN')
# data: a matrix (rows: samples, columns: features (genes))
k0=floor(mean(table(batch))) #neighbourhood size: mean batch size 
knn <- get.knn(data, k=k0, algorithm = 'cover_tree')

#now run kBET with pre-defined nearest neighbours.
batch.estimate <- kBET(data, batch, k = k0, knn = knn)
Please note that a data matrix with the same size as your original data is required, but not used (apart from estimating number of rows and columns).

For my own purposes, in my single cell analysis, I performed batch correction only on a subset of genes (highly variable genes), thus my data is a reduced matrix with only a few thousand genes. Can I use this matrix as input?

Yes, you can use a reduced matrix as input. In fact, we observed that using highly variable genes tends to improve the batch effect correction in most methods.

Thanks, Ray

Please let me know when you have further questions.

Best, Maren

Dear Maren,

Thanks so much for your kindful explanation. It makes a lot more sense to me now. And thanks for pointing me to the scIB portal. I'm actually reading the scIB paper from bioRxiv and will definitely give it a try.

Best, Ray

liliay commented 2 years ago

HI there,

I am testing kBET on integrated seurat objects to assess the integration quality of 2 samples (WT sample & KO sample). Therefore I tried to convert my integrated seurat object to a sce data object. here is my code :

KBET for merged data

as matrix of the counts assay (corresponds to the count matrix in sce objects ??)

data <- as.matrix(sce.data.complete@assays@data@listData[["counts"]])

vector containing the batchlabel of each cell

batch <- as.factor(sce.data.combined@colData@rownames)

compute and plot

batch.estimate <- kBET(data, batch, plot=TRUE, do.pca = TRUE, dim.pca = 2)

You will notice that in $batch variable, I put the barcode of each cell with batch info (i.e. : WT_ATTCGAT...).

How did you guys manage to find the info to put in $batch and $data within the sce object ?

Thank you in advance for your insight / help !

Lilia

nhuhoa commented 2 years ago

Hi Liliay,

Long answer: To make it simple, here is a suggestion for you: - in case you want to work with Seurat platform.

Creating a seurat object for WT dataset, with metadata is a data frame, with cell_id, and batch_info columns. https://mojaveazure.github.io/seurat-object/reference/CreateSeuratObject.html (see metadata as input parameter here)
Creating a seurat object for KO dataset, with metadata is a data frame, with cell_id, and corresponding batch_info columns.
Combining 2 Seurat objects into a combined object.
Accessing combined Seurat object, and extract raw counts or scale.data assay, normal we use the normalized data or batch corrected, integrated matrix as input to kBET, but you can use raw counts in case you want to evaluate batch effect in raw data. https://satijalab.org/seurat/articles/essential_commands.html data = GetAssayData(object = object, slot = "counts") GetAssayData(object = object, slot = "scale.data")

Get batch info from meta data: object@meta.data$name ex: object@meta.data$batch_info I guess. You can extract meta.data as data frame and see the columns names there.

Put into kBET function: batch.estimate <- kBET(data, batch, plot=TRUE, do.pca = TRUE, dim.pca = 2) Some notes for you:
dim.pca = 2, I think dim.pca = 15 or 20, 25 is better. If only 2 PCs vector, it means you have very small variances in your datasets.
data: in general, in almost biology computational platforms, genes in row, and cells in columns. But pls noted here that input for kBET from author documentation is:

data: a matrix (rows: samples, columns: features (genes))

batch: vector or factor with batch label of each cell

So you may need to check your Seurat object, and do transform. Otherwise, the input is not correct, and output of your evaluation is not correct. data <- t(data)

Short answer

Get batch info from meta data: object@meta.data$name meta_data <- as.data.frame(object@meta.data) meta_data$cell_id <- rownames(meta_data) meta_data$batchinfo <- ifelse(grepl('WT',meta_data$cell_id), 'WT', 'KO') batch <- meta_data$batch_info

Good luck, Hana

liliay commented 2 years ago

Hi Hana,

The function finally runs but displays this error message :

Also when I get the count slot, "data" is empty, and when creating "data" with the scale.counts slot data not the genes x cells matrix (it is a list of barcodes). In the Kbet tutorial data is supposed to be a matrix which I do not get...

liliay commented 2 years ago

I finally managed to make the code work, chaging it to this : data <- as.matrix(GetAssayData(object = data.combined, slot = "counts")) data.combined@meta.data$orig.ident meta_data <- as.data.frame(data.combined@meta.data) meta_data$cell_id <- rownames(meta_data) meta_data$batch_info <- ifelse(grepl('WT_',meta_data$cell_id), 'WT', 'KO') batch <- meta_data$batch_info batch.estimate <- kBET(data, batch, plot=TRUE, do.pca = TRUE, dim.pca = 2)

Unfortunaltely this runs for too long (> 8 hours) on the lab server. This is not expected according to the timings indicated by kBET devs.

If anyone has a clue, please feel free to let me know your thoughts !

Best,

Lilia

nhuhoa commented 2 years ago

Hi Liliay, In case you still don't get any clue for the long execution time. Potential issue 1: I guess the issue is because you tried to extract PCA vectors from very large dimensions matrix, ex: using 10000 to 30000 genes as input. In general, we use the common processing pipeline, ex: Seurat to remove mito genes, normalize data, then extract highly variable features HVG - genes, ex: 3000 HVG genes as input to compute PCA. So the excution time for 3000 hvg x N cells barcodes will be much faster than 30K genes x N cells, and more accurate PCA results. I don't recommend to run kBET function on large matrix as input. You may want to see this tutorial: https://satijalab.org/seurat/archive/v3.0/pbmc3k_tutorial.html

Potential issue 2: as I mentioned above, data: in general, in almost biology computational platforms, genes in row, and cells in columns. But pls noted here that input for kBET from author documentation is:

data: a matrix (rows: samples, columns: features (genes))

#batch: vector or factor with batch label of each cell
So you may need to check your Seurat object, and do matrix transpose. Otherwise, the input is not correct, and output of your evaluation is not correct.
data <- t(data)

I don't remember exactly because I don't use Seurat much at current moment, you need to double check your input data, ex: View(data[1:5,1:5]) to see rows are cells - barcodes, and columns are genes.

Good luck,