niaid / dsb

Normalize CITEseq Data
Other
63 stars 13 forks source link

batch-effect correction in DSB-normalized data? #12

Closed massonix closed 3 years ago

massonix commented 3 years ago

Dear DSB team,

Thanks so much for developing this amazing resource.

I have a question regarding batch effect correction. As specified in the preprint: "we implemented this transformation on each staining batch separately to accommodate potential batch specific ambient noise– this helped mitigate batch-to-batch variation". Thus, would you recommend me apply DSB on each batch separately? In your supplementary figure 1A there is still some batch effect. Would you run any single-cell integration tool?

Thanks in advance for your help!

MattPM commented 3 years ago

Hi, @massonix thanks, that is a good question. How much batch variation there is depends on how much experiment-specific and expected biological variability there is between the batches. In the dataset used in the preprint, if we normalized with all background drops and cells in a single normalization, the resulting dsb normalized values were highly concordant with when we normalized each batch separately, this held true with multiple definitions of background drops. These 2 batches were run on subsequent days with the exact same protocol and pool of antibodies. I'd recommend trying both single and multi batch normalization and seeing which method minimizes the batch effect (included a snippet below for doing this). One piece of advice is to QC the background droplets to remove any cells with high RNA to reduce the impact of potential low quality cells that could drive the background signal.

As for using a batch correction tool like seurat or mnn correct, typically those are used on data where the batches are completely non-overlapping due to drastic batch effects e.g. those arising between different species, single cell technologies (droplet / plate). In Fig S1A, the overlap between cells is already about what you would expect after using one of those tools. Those are great for the drastic cases described above for RNA data but I'm not certain how they would perform on protein data since they use low dimensional representation–depending on the size of your antibody panel, further compression could add significant noise. I have not tried this though, basing that on the fact that clustering using PCA representations of our protein data with that 80+ antibody panel performed worse than using a euclidean distance matrix calculated on dsb normalized cells x protein. It is not described in the preprint, but part of the non-overlapping cells are due to biological variations between the different n=10 donors in each batch, for example recovery of expected donor-specific T cell populations. If there is a large batch effect, before trying a batch correction tool which uses some low dimensional representation of the data you might just want to try then using the dsb normalized values in a simpler linear model batch removal method (see below) but this may not be necessary. http://web.mit.edu/~r/current/arch/i386_linux26/lib/R/library/limma/html/removeBatchEffect.html https://genomicsclass.github.io/book/pages/adjusting_with_linear_models.html

Here is a snippet you can use to test both normalization schemes. Feel free to let us know what you find on your data!

# apply DSB using full background for each batch separately.  

# neg_adt = a list of protein data indexed by batch (background / empty droplets)
# pos_adt = a list of protein data indexed by batch (cell containing droplets)

# modify to match isotype control names in protein data matrix
# isotypes = c("Mouse IgG2bkIsotype_PROT", "MouseIgG1kappaisotype_PROT",
#             "MouseIgG2akappaisotype_PROT", "RatIgG2bkIsotype_PROT")

dsb_norm = list()
for (i in 1:length(neg_adt)) {
  dsb_norm[[i]] = 
    DSBNormalizeProtein(cell_protein_matrix = c[[i]], 
                        empty_drop_matrix = neg_adt[[i]], 
                        denoise.counts = TRUE, 
                        use.isotype.control = TRUE, 
                        isotype.control.name.vec = isotypes)
}
# merge multi batch norm data into "dsb"
dsb_multi = do.call(cbind, dsb_norm)

## Run single batch norm using full definition of background 
full_neg_merged = do.call(cbind, neg_adt)
pos_adt_merged = do.call(cbind, pos_adt)
dsb_merged_full = DSBNormalizeProtein(cell_protein_matrix = pos_adt_merged, 
                                      empty_drop_matrix = full_neg_merged, 
                                      denoise.counts = TRUE, 
                                      use.isotype.control = TRUE, 
                                      isotype.control.name.vec = isotypes)
massonix commented 3 years ago

Thanks @MattPM ! That's very good adivice. I will apply it and let you know if I have any further question

MattPM commented 3 years ago

Closing for now. Referenced this with a figure in updated readme FAQ section. See significantly updated documentation in readme. package now redirects to NIAID github: https://github.com/niaid/dsb/

gt7901b commented 3 years ago

@MattPM I ran dsb_multi = do.call(cbind, dsb_norm), there are 10 elements in the dsb_norm but got the error. I believe it is because some element in dsb_norm have different Abs after I removed non staining proteins. How do I reconcile this? Thanks

Error in (function (..., deparse.level = 1) : number of rows of matrices must match (see arg 6)