Closed pakiessling closed 1 year ago
That is an interesting question. While we did not test exhaustively check the effects of having different batches in the reference, here are some experiences and generic advice:
Different patients can be very different. Therefore we generally try to use matched reference and new data from the same patient, e.g. scRNA and spatial data. If that is available, we run the annotation separately per patient. This also helps in utilizing the prior information contained in the overall celltype composition of the patient, which can be helpful in some algorithms like OT (where the effect can be regulated using the lamb
option).
If matched samples are not available, the next best thing would be to have balanced batches in the reference, i.e. batches which have approximately the same celltype composition (or composition wrt. any other annotation). In that case the most important information which is propagated to the annotation process is batch independent: The reference celltype profiles are usually closely related to the mean of all cells of an annotation category, where the batch information appears identically in the reference profiles of all celltypes. Therefore the batch does not affect the annotation algorithm a lot, as it still can find the best fitting combination of celltype in the reference and the new data.
If the batches in the reference are not balanced, then there can be a leading order effect on the annotation: A single reference profile per celltype cannot disentangle batch and celltype information. If e.g. celltypeA is over-represented in batch1, then the profile for celltypeA is confounded with batch1. Annotating new data with a batch effect close to batch1 will therefore tend to have more celltypeA annotations. Depending on the strength of the batch effect this can skew an annotation drastically. This can be avoided by resampling the references to being balanced. Alternatively, a representation by more than one profile per celltype can help here (see the multi_center
option in https://simonwm.github.io/tacco/_autosummary/tacco.tools.annotate.html#tacco.tools.annotate). Then the information about celltype and batch is available in the annotation, celltype information is not confounded by batch, and new data can also choose a celltype from a minority batch.
If batch corrected counts for the reference are available, they should work as well. But I would try sub-sampling and/or the multi profile approach first, as it makes the analysis pipeline simpler and faster and avoids possible artifacts from the batch correction. Hope this helps!
Thank you for the in-depth answer.
The part I didn't quite get was the "multi_center" option.
Could you explain a bit more what it does and what a sub-category is? "multi_center" takes a number as input how does this help with including a batch column?
The solution with multi_center
does not provide the input for a batch column, but works more indirectly for batches in the reference: It is a generic solution for any kind of within-category heterogeneity in the reference. Batches would be a special application.
With multi_center
you specify a number of representing means for each annotations category: instead of representing a category with a single expression profile, it gets automatically subclustered into multi_center
subclusters and a profile for each of them is used in the annotation. Afterwards these are automatically collapsed into the single category. This is useful in cases where annotation categories have a large within-category variation, e.g., a coarsely annotated celltype or the same celltype from different batches. This propagates a sense of variability of the reference expresion distribution to the annotation.
For your case with multiple batches, it would be sensible to use multi_center
at least as high as the number of batches. If batch is the driver for heterogeneity, then the subclusters will correspond to batches. If biological variability within a celltype is stronger than batch, then the subclusters will represent this. In any case, the annotation should improve until the within-subcluster heterogeneity is noise only. Then something similar to overfitting should set in. I would suggest to try a few choices of multi_center
and inspect the resulting annotations to see where you see an optimum wrt expected biology.
Perfect, thank you!
Thanks for the tool, very impressive!
I am curious since TACCO expects counts from the scRNA reference and most reference sets are composed of multiple patients or 10x batches:
Is this a problem for TACCO, should I subset to a single batch or maybe calculate "batch-corrected counts" with something like scVI?