niaid / dsb

Normalize CITEseq Data
Other
63 stars 13 forks source link

Integrated multiple samples #44

Open Aaron-sqw opened 1 year ago

Aaron-sqw commented 1 year ago

Hi, Thanks for the excellent method for CITE-seq! I find the vignettes from "https://cran.r-project.org/web/packages/dsb/vignettes/end_to_end_workflow.html" only deal with one sample. If I have multiple samples (more than 3) CITE-seq data and hope to integrate together by anchors to correct the batch, how to deal with this circumstance, may I use the dsb normalized directly?

immune.anchors_adt <- FindIntegrationAnchors(object.list = immune.combined_adt.list, anchor.features = features) immune.combined_adt <- IntegrateData(anchorset = immune.anchors_adt, new.assay.name = "integrated.adt")

Thanks.

MattPM commented 1 year ago

Hi @Aaron-sqw If by samples you mean you have 3 batches, first you want to verify you have a large batch effect before doing any computational integration or correction. You can do this by calculating variance explained by batch. If you observe a large batch effect, then you could do that yes. However, it would not be appropriate if you have only a few proteins measured. Since methods like Seurat integration and Harmony compress the ADT data to find latent shared components in high dimensional space they are more geared toward mRNA where you have thousands of features (genes). For protein since there are less features, we have done a simple linear model batch correction (with limma) after dsb on a few projects with 80-100 proteins which also works well. If you used one of the larger protein panels currently available with hundreds of proteins you could try a less parsimonious correction like you're describing as long as the comparison groups of your experiment are not batch confounded.

corbettberry commented 2 months ago

Hi @MattPM! I'm hoping you can assist with a similar question. I have 8 distinct biological samples and I want to use dsb to normalize so I can better compare between the samples. They are not separate batches but completely distinct samples.

It looks like the raw feature matrices are no longer generated when aggregating multiple samples using cell ranger ("Starting with Cell Ranger 6.1, the unfiltered or raw feature-barcode matrix is no longer output by cellranger aggr."). So I have 8 separate raw_feature_bc_matrix and filtered_feature_bc_matrix. Would you recommend findings someway to concatenate these files and then run DSBNormalizeProtein or running separately before combining? What would be the best way to aggregate these raw files? Thank you so much for any thoughts/recs.

ddmk7 commented 1 month ago

Hi @MattPM,

Thank you for developing this amazing noise reduction technique. I have 4 tumor samples and 4 healthy donor samples, and for each sample, I added a DSB-normalized matrix (containing over 90 antibodies plus 3 isotype controls) before integrating them based on RNA. The Seurat FindMarkers function recommends using either the raw count or normalized data slot (DSB normalized in my case), but doing so generates large log2 fold changes (>100) between tumor and healthy cells in different cell types. I'd appreciate your thoughts on how to handle this. Thank you!

MattPM commented 1 month ago

@corbettberry Can you clarify your experiment design? Do you have 8 separate antibody staining reactions, or are you multiplexing 8 samples together, staining with antibodies in single tube, then partitioning that across 8 lanes?

MattPM commented 1 month ago

@ddmk7 This is more of a downstream analysis question, a bit outside the scope of this denoising package.

I will say that since you have several distinct individuals, the functions in "FindMarkers" are not ideal. The cells (more technically, the standard errors of the statistical test) are not independent if you're comparing n=4 tumor vs n=4 normal and treating all the samples as "replicates", ignoring the fact that you have several individuals. That is an example of pseudoreplication. Those p values are inflated and technically invalid. Unfortunately it happens a lot in published literature. This paper explains some of the issues: https://www.nature.com/articles/s41467-021-21038-1.

FindMarkers is ok for defining cluster specific markers just for the purpose of cell type annotation but not testing experimental effects in your case.

I would recommend using a mixed effects model and a pseudobulk approach. See here for more information: https://www.nature.com/articles/s41467-020-19894-4.

You can also read the methods of the differential expression approach used in our recent paper https://mattpm.net/man/pdf/natural_adjuvant_immunity_2024.pdf the associated code with the paper has a lot of examples https://niaid.github.io/fsc/

That framework leans heavily on Gabe Hoffmans method for applying mixed models on rnaseq data. His more recent advance of empirical bayes methods for mixed models with single cell data would be good to read for a deeper dive: https://academic.oup.com/bioinformatics/article/37/2/192/5878955 https://pubmed.ncbi.nlm.nih.gov/36993704/

-matt

corbettberry commented 1 month ago

@MattPM we had 8 separate antibody staining reactions that were then run across 8 separate lanes. Thanks for your help!

MattPM commented 1 month ago

@corbettberry Edited 6 Oct 2024 for clarity.

@corbettberry Please see the answer and code provided in https://github.com/niaid/dsb/issues/12

dsb is based on the underlying physics of each antibody staining reaction. In theory, each unique staining reaction for your 8 samples is a separete data generation process, so the data should be normalized separately, then each normalized matrix can be combined. However, in practice, if you stained the cells the same way for the same amount of time carefully, and the samples are relatively similar in their composition, the background distributions will be very similar between samples, and you can run one normalization after first combining all of the matrices into a large cell matrix and background matrix respectively. There is no way of knowing which will be better for your data without some testing and I have seen datasets where either scheme worked better. In issue 12 linked above, I provided code to run both ways. You can then see which method works better for you. If combining the matrices, append all the barcodes of each "lane" with a string like "_1" for the first lane, "_2" for the second and so on, to avoid barcode collisions (some barcodes of different lanes will be the same).