qiime2 / q2-composition

BSD 3-Clause "New" or "Revised" License
5 stars 27 forks source link

BUG: ancombc no longer works with '*' as the interaction term in the formula #133

Open lizgehret opened 1 month ago

lizgehret commented 1 month ago

I initially discovered this while building the 2024.5 docs, but also replicated locally. Within a 2024.5 amplicon environment (on mac OS) the command in PD mice that utilizes ancombc with donor * genotype fails with the following error message:

Error in .data_qc(meta_data = meta_data, formula = formula, group = group,  :
  The following variables specified are not in the meta data: donor*genotype
Calls: ancombc -> .data_qc

This doesn't occur in 2024.2. I need to investigate further, but something seems to have changed with the input handling for ancombc. This error doesn't occur when swapping out '*' for '+' in this particular example.

lizgehret commented 1 month ago

Update: this also fails while using the interaction term :

I tested this with the following command (using PD mice dataset):

qiime composition ancombc \
--i-data table.qza \
--m-metadata-file metadata.tsv \
--p-formula 'donor:genotype + donor + genotype' \
--p-reference-levels 'donor::hc_1' 'genotype::wild type' \
--o-differentials diff.qza \
--verbose

This is the error message:

  The following variables specified are not in the meta data: donor:genotype
lizgehret commented 1 month ago

Another update: this appears to be due to the following code change in ancombc: https://github.com/FrederickHuangLin/ANCOMBC/blob/d402833f7d5ca5033132a0abba63e06674c7b6b1/R/ancombc_prep.R#L132

This change only permits additive interaction terms.

ebolyen commented 1 month ago
# what should happen:
vars = rownames(attr(terms(formula), 'factors'))
# what currently happens:
vars = unlist(strsplit(formula, split = "\\s*\\+\\s*"))
ebolyen commented 1 month ago

It appears to be intentional: https://github.com/FrederickHuangLin/ANCOMBC/issues/141

If I remember correctly, we don't use group because it's naive to the contrast for a given factor and the resulting p-value mask seemed... strange.