Integrating vs merging replicates

annaliem commented 3 years ago

Hello,

I realize this question has been asked several times but I feel as if I haven't been able to find an answer applicable to my situation.

I have two datasets: one with 7K cells and one with 9K cells. No treatments have been applied, and the sample prep, library prep and sequencing were performed with the same resources, technology, people, etc. However, they were completed several months apart. In essence, they are not true technical replicates because they weren't on the same flow cell, but there are no experimental differences. I am not sure if I can merge the raw data I obtained from CellRanger, and then perform QC, run SCTransform, dimensional reduction, cluster identification, etc. on that combined dataset or if integration is required. Every comment or vignette I find on integration is always concerned with minimizing technical variation to allow analysis of biological variation across conditions.

My question is: If the only thing different between my datasets is that they were not sequenced together, can I simply merge the resulting 10X matrices and treat them as one dataset? What is the difference in performing batch correction vs. integration?

I have done the following:

Merged the datasets first, filtered out low quality cells, ran SCTransform, then analyzed PCA plots, UMAPs, and TSNEs.
Filtered/SCTransformed them individually and examined PCA plots, UMAPs, and TSNEs.

Both approaches yield similar plots. When I split the merged object by dataset I see complete overlap on all three plots.

I appreciate anyone's help and/or comments!

NicolaasVanRenne commented 3 years ago

Not a member of seurat but here goes:

It also depends on the original samples. Are your samples two different patients? Then integrate. Are they two samples from the same patient (to increase the amount of cells in database) then you should merge.

If they are samples from cell culture, eg. two wells from the same experiment, they can be seen as spatiotemporal clones and I would merge them. (but no experience with cell line data sets). However, if your cell culture are two independent experiments they will have two different transcriptional dynamics and I would integrate them (although it's to be seen if that would give better results - not sure, have not tried).

Concerning your question "What is the difference in performing batch correction vs. integration?": I use data integration to perform batch-correction. It works well at least for patients (it removes donor-effects).

Can you provide more info on your original samples?

annaliem commented 3 years ago

Thanks for your reply!

I work with zebrafish, and the prep to isolate and enrich the cells of interest for sequencing can require hundreds of fish as input (at the stage I work with they're less than 1cm long). This is standard practice for zebrafish researchers. So to answer your question: no, the datasets aren't from the same single organism, but each dataset itself has cells from a large amount of organisms, if that makes sense. However, they are all offspring from the same line, same parents, etc.. They're raised under standard conditions and cells are isolated from the same structure at the same timepoint in the same manner. Using so many siblings as input also minimizes the "between fish" variation that could arise. As you suggest, the only reason for the spatiotemporal nature of the sequencing was to obtain a larger amount of cells to analyze, so we don't expect the datasets to show any significant amount of biological variation.

Thank you for clarifying the batch correction vs integration! They seemed like different goals using the same code so I was unsure. From my understanding, the goal of batch effect is to minimize technical variation to prevent it from obscuring biological variation; whereas the goal of integration is to ensure that cells from different datasets/conditions cluster by cell type instead of differences between batches, samples, conditions, etc. My merged data does that without integration (as expected), but I wasn't sure if batch effect was a separate step, or if it wasn't necessary since the cells look so similar.

Thank you again for your comments! I appreciate your help

saketkc commented 3 years ago

Hi @annaliem, we have tried to answer this question in the discussion here: https://github.com/satijalab/seurat/discussions/3998. Please feel free to reopen if you find anything missing.

annaliem commented 3 years ago

Hello @saketkc! I have seen that discussion previously; I suppose I am unclear on the difference between batch correction and integration. I understand that @NicolaasVanRenne and I'm sure many others use integration to perform batch correction. But I am not sure if I can perform batch correction but not integration, if neither are necessary (in my case), or if they are exactly the same thing. Perhaps this is a silly question, but I'm just not sure if I am misunderstanding which to use when.

denvercal1234GitHub commented 2 years ago

@annaliem -- Have you obtained the explaination and answer whether integration and batch correction (which I believe is part of the integration in Seurat) are different in which way?

NicolaasVanRenne commented 2 years ago

Thanks for your reply!

I work with zebrafish, and the prep to isolate and enrich the cells of interest for sequencing can require hundreds of fish as input (at the stage I work with they're less than 1cm long). This is standard practice for zebrafish researchers. So to answer your question: no, the datasets aren't from the same single organism, but each dataset itself has cells from a large amount of organisms, if that makes sense. However, they are all offspring from the same line, same parents, etc.. They're raised under standard conditions and cells are isolated from the same structure at the same timepoint in the same manner. Using so many siblings as input also minimizes the "between fish" variation that could arise. As you suggest, the only reason for the spatiotemporal nature of the sequencing was to obtain a larger amount of cells to analyze, so we don't expect the datasets to show any significant amount of biological variation.

Thank you for clarifying the batch correction vs integration! They seemed like different goals using the same code so I was unsure. From my understanding, the goal of batch effect is to minimize technical variation to prevent it from obscuring biological variation; whereas the goal of integration is to ensure that cells from different datasets/conditions cluster by cell type instead of differences between batches, samples, conditions, etc. My merged data does that without integration (as expected), but I wasn't sure if batch effect was a separate step, or if it wasn't necessary since the cells look so similar.

Thank you again for your comments! I appreciate your help

If you put hundreds of fish in the blender and then perform scRNAseq you have already merged your samples :)

no possibility to un-merge them now haha

That's why merging or integrating your two fish-blend samples doesn't make much difference. You have so many individuals that donor-effects will not be seen - every original individual put in the blender will be part of it the cell cluster but all smeared out. Impossible to see which cells arise from fish 001, fish 002, fish003 etc.

So your initial set-up is totally different from say a pancreas sample or whatever taken from 5 donors. In this case, merge = donor effects. Integrate = remove donor effect

Kind regards

nicolaas

satijalab / seurat

Integrating vs merging replicates #4372