simonwm / tacco

TACCO: Transfer of Annotations to Cells and their COmbinations
BSD 3-Clause "New" or "Revised" License
44 stars 1 forks source link

Effect of Batch Effect in scRNA reference #11

Closed pakiessling closed 1 year ago

pakiessling commented 1 year ago

Thanks for the tool, very impressive!

I am curious since TACCO expects counts from the scRNA reference and most reference sets are composed of multiple patients or 10x batches:

Is this a problem for TACCO, should I subset to a single batch or maybe calculate "batch-corrected counts" with something like scVI?

JWatter commented 1 year ago

That is an interesting question. While we did not test exhaustively check the effects of having different batches in the reference, here are some experiences and generic advice:

pakiessling commented 1 year ago

Thank you for the in-depth answer.

The part I didn't quite get was the "multi_center" option.

Could you explain a bit more what it does and what a sub-category is? "multi_center" takes a number as input how does this help with including a batch column?

JWatter commented 1 year ago

The solution with multi_center does not provide the input for a batch column, but works more indirectly for batches in the reference: It is a generic solution for any kind of within-category heterogeneity in the reference. Batches would be a special application. With multi_center you specify a number of representing means for each annotations category: instead of representing a category with a single expression profile, it gets automatically subclustered into multi_center subclusters and a profile for each of them is used in the annotation. Afterwards these are automatically collapsed into the single category. This is useful in cases where annotation categories have a large within-category variation, e.g., a coarsely annotated celltype or the same celltype from different batches. This propagates a sense of variability of the reference expresion distribution to the annotation. For your case with multiple batches, it would be sensible to use multi_center at least as high as the number of batches. If batch is the driver for heterogeneity, then the subclusters will correspond to batches. If biological variability within a celltype is stronger than batch, then the subclusters will represent this. In any case, the annotation should improve until the within-subcluster heterogeneity is noise only. Then something similar to overfitting should set in. I would suggest to try a few choices of multi_center and inspect the resulting annotations to see where you see an optimum wrt expected biology.

pakiessling commented 1 year ago

Perfect, thank you!