scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.87k stars 594 forks source link

Tansfering data integration fom scanorama to a new dataset #2162

Closed aliechoes closed 2 years ago

aliechoes commented 2 years ago

Hi

I have a question. I am using the scanorama for integrating multiple datasets. In my use-case, I will have a new dataset again. The question is, is there any way to transfer the results of scanorama to the new dataset? Or should I retrain everything again.

This is also important if one has a train and test set. Ideally, you do not want to use the test set to use in the data integration training and only use the already trained transformation on the test data.

I am wondering if there is a solution for these use-cases? Would you please help us with that.

Cheers Ali

LuckyMD commented 2 years ago

Hi!

If you used a neural network approach you could use scArches to leverage transfer learning to map things across without re-integrating (only minimal additional training done there). You could also map into the embedded space using sc.tl.ingest for example. But there is always the danger that there is a residual batch effect that cannot be removed without de-novo integration.

aliechoes commented 2 years ago

Hey Malte!

Thanks for the response. That's great. Does this work for every other batch-effect correction method?

LuckyMD commented 2 years ago

Hi, Ingest works on any embedding afaik. In the tutorial I believe it's mapping to umap space.

a-munoz-rojas commented 2 years ago

Jumping in on this conversation to ask a related question - I'm using Scanorama to integrate some datasets and generate an aligned low-dimensional embedding. I then subset the data to only look at specific clusters and want to re-make the UMAP/t-SNE plot. Do you usually re-do the integration to generate a new low-dimensional embedding matrix with Scanorama for the subsetted data? I know you can technically subset the original low-dimensional embedding matrix, but I thought it's preferable to re-do the embedding when you have a different subset of cells to capture more of the variance between those cells (re-select HVGs, etc). Any advice would be welcome - thanks!

LuckyMD commented 2 years ago

Hey @a-munoz-rojas,

I normally wouldn't redo the batch correction. That can go wrong (or better tbh)... for scanorama it could be better, but for DL-based methods you would have fewer data points for learning the difference between batch and bio effects. So unless you have a large dataset, it might generate problems for those methods. Therefore I try to stay consistent.

a-munoz-rojas commented 2 years ago

Hi @LuckyMD - thanks for your reply! Yeah that makes sense. I'm performing these corrections using a subset of highly variable genes, so I guess to "make up" for the loss of "true" HVGs in the new subclusters of cells I could select a higher number of HVGs to perform the original alignment? As well as maybe using a larger number of components for downstream applications from the low-dimensional embedding outputted by the original alignment. Does that make sense to you?

One more question - when performing differential gene expression analysis, what is your preferred pipeline/method when using aligned datasets? I generally do not perform the correction on the gene expression matrix when aligning, and I think doing DE with corrected matrices is not as common. So maybe other methods that use batch as a covariate would be preferable (e.g. diffxpy or others?) Would really appreciate any suggestions here!

PS. many congratulations on the benchmarking integration paper in Nature Methods - excellent work and very useful resource for the field!

LuckyMD commented 2 years ago

Thanks a lot @a-munoz-rojas!

For the DE approach, I would go with the same thing you suggest. I would definitely not do DE testing on the corrected data (violation of distributional assumptions, potential overcorrection of background variation leading to false significant results).

Regarding altering the number of HVGs or latent dimensions... this is difficult to say in general. I would normally err on the higher side of the number of HVGs, but the latent dimensions will depend heavily on the complexity of the dataset i would imagine. I don't think it's possible to give a general recommendation there.

a-munoz-rojas commented 2 years ago

Great, thanks @LuckyMD! That makes a lot of sense.

Aside from diffxpy, are there other packages you recommend for more robust DE approaches in these (or related) scenarios? Thanks again for your advice - and sorry for hijacking this conversation!

LuckyMD commented 2 years ago

I've been using diffxpy or MAST so far. Moved to diffxpy though, but it's not 100% mature yet. Aside from pseudobulk I don't have any.