theislab / scCODA

A Bayesian model for compositional single-cell data analysis
BSD 3-Clause "New" or "Revised" License
141 stars 23 forks source link

Influence of subtypes present in the dataset #57

Closed auesro closed 2 years ago

auesro commented 2 years ago

Hi,

This is more of a question than a bug. If I should post it somewhere else, please let me know.

I am doing some tests with scCODA and was wondering about the relevance of the number of "populations" included in the dataset. Say I originally have 20 groups (=clusters) and 2 conditions and find no compositional difference between conditions. What influence would it make to subset my dataset to a lower number of groups (e.g. removing 2 groups of less related cells)?

Thanks,

A

johannesostner commented 2 years ago

Hi @auesro, generally the Dirichlet-Multinomial distribution used in scCODA is not subcompositionally coherent. That means that adding or removing features (="populations") can change whether features that were not added or removed are compositionally different. However, if you remove a feature that has approx. the same proportion in all samples, it is very unlikely that the result is influenced. Thus, it is generally safe to try out removing features from the composition, but this may influence the results.

auesro commented 2 years ago

I see. That means that if you have a low number of samples (4 in my case: 2 controls and 2 experimental), removing a population might influence the end result given the low number of samples and inherent differences in the composition, am I right?

johannesostner commented 2 years ago

Yes, exactly. This is simply due to the nature of compositional data, as all populations are correlated (i.e. increasing the share of one population decreases the share of all others). If you remove one population, the share of all other populations will increase, but not necessarily by the same amount in both conditions, as you removed a different share in each sample.

Also due to the low number of samples, one outlier could have quite a large impact on the end result.

My suggestion would be to simply take a look at the result after removing the two populations and see if some cell types are differentially abundant then.

auesro commented 2 years ago

Thanks a lot, @johannesostner, very insightful!