qiime2 / q2-composition

BSD 3-Clause "New" or "Revised" License
5 stars 27 forks source link

BUG: chosen intercept column in metadata but not table will default to alphabetical order #101

Closed lizgehret closed 1 year ago

lizgehret commented 1 year ago

Bug Description When a reference level is chosen that is a column within the metadata, but the associated IDs are not in the feature table, no error is raised - ANCOM-BC just defaults back to alphabetical order for the intercept column. We should raise an error for this, as it produces results that aren't accurate to the user for their provided inputs.

Example Metadata File:

sample-id         Column1           Column2          Column3
S001              group1            Test1            1823
S002              group2            Test2            2843
S003              group3            Test3            9972

Example Feature Table:

sample-id      feature1      feature2
S002           10            25
S003           2             14 

Example Command:

qiime composition ancombc /
--i-table table.qza / 
--m-metadata-file sample-md.tsv /
--p-formula Column1 /
--p-reference-level 'Column1::group1' /
--o-differentials ancombc-diffs.qza

In this example, Column1::group1 was chosen as the reference level, but the sample ID S001 is not included in the feature table, and is thus not included in the actual analysis. This causes the reference level behavior to default back to alphabetical order for the chosen formula column, meaning that group2 is selected as the intercept (i.e. reference level) instead of group1. This would produce the following differential table:

id           (Intercept)       Column1group3
feature1      0.004            0.0005
feature2      0.352            0.00478

This produces a confusing output for users because they are expecting the (Intercept) column to be group1 and for there to be two additional columns (group2 and group3 from Column1). It is unclear from these results which column is used as the intercept (i.e. reference level) and why one of the columns seemingly disappeared.

We should raise an error if the chosen reference level has IDs that are not included in the feature table (even if they are included in the metadata. cc: @cherman2 as she discovered this error while we were running ANCOM-BC on one of her datasets.