theislab / scCODA

A Bayesian model for compositional single-cell data analysis
BSD 3-Clause "New" or "Revised" License
141 stars 23 forks source link

additive effects and interaction terms #81

Closed hoeflerb closed 11 months ago

hoeflerb commented 11 months ago

Hi,

Thanks for scCODA, great approach to performing compositional analysis. A question about the output tables. If we fit a model with two categorical covariates, with a GLM the additive effect would be determined by summing the coefficients, ex:

Treatments: Control, Treat1
Donors: Donor1, Donor2

Formula: Treatment+Donor
Treatment1,Donor1 effect = α + βtreat1
Treatment1,Donor2 effect = α + βtreat1 + βdonor2

For scCODA, is it correct to sum the log2-fold change terms to determine the additive effects (ie: Treatment1,Donor2 effect = LFCtreat1 + LFCdonor2)? I'm assuming that because the model uses a log-link function that this is valid, but I want to make sure I am thinking about this correctly.

The expected sample is calculated for each covariate separately (covariate value = 1, all other covariates = 0), with the same method as for the intercepts. The log-fold change is then calculated between this expected sample and the expected sample with no active covariates from the intercept section.

Similarly, if we fit a model with an interaction term, can the effect of the interaction be determined in the same way (ie: Treatment1,Donor2 effect = LFCtreat1 + LFCdonor2 + LFCtreat1:donor2)? In other words, can we treat the log2-fold change values in the same way as (log2 (α + β)/α) of a GLM?

Also, is there a way to do model comparisons with scCODA (maybe using WAIC)? If we want to know whether the model fits better with or without an interaction term, how would we determine this?

johannesostner commented 11 months ago

Hi @hoeflerb,

thanks for your interest in scCODA! Your intuition about scCODA having a GLM-like covariate structure is correct. The effects (called "Final Parameter" in the output table) can be combined additively, if you set them like Treatment + Donor in your formula parameter. For interaction parameters, the same thing holds in theory. We never explicitly tested this in practice, though.

Calculating the expected sample and log-fold change is a little bit more involved though due to the compositionality of the data. We assume the data to be Dirichlet-Multinomially distributed (y ~ DirMult(a, N)) with proportions log(a) = α + βtreat1 + βdonor2, which causes non-additivity of the mean wrt a. Therefore, I'd suggest to calculate the expected sample (and subsequently the log-fold change) individually for each scenario. The expected sample is the mean of a Dirichlet-Multinomial distribution with N (the mean number of cells per sample in your data) draws and distribution a (exp of thee sum of your effect vectors of all covariates of interest, as above), i.e. E(sample) = N*(a/sum(a)). The log-fold change is then the change of this expected sample compared to the control group with only the intercept.

johannesostner commented 11 months ago

As for model comparisons, we never looked into this specifically, but you could compare different models through sth like WAIC. Be careful with the metric for determining the goodness of fit though, as the data is non-normal!

hoeflerb commented 11 months ago

Thanks Johannes for the detailed response! I ran a quick mock calculation using your instructions and was able to reproduce the log2-fold change values reported in the results table, so this makes sense now.

For the WAIC calculation, the built-in arviz function requires log likelihoods to be stored in the InferenceData object, so I'm not sure how to go about obtaining those. I will open a new issue for that question.