how to deal with ordered categories

krmaas commented 6 months ago

I have a study with control + 3 treatment levels. They are factors and I've reordered the levels so they are in order (control, low, medium, high), does that order impact the emuFit?

all_fit <- emuFit(formula = ~ Treatment,
                  data = alpha.expdata, #df containing the factor Treatment 
                  Y = otu.ab.denit) #df OTU matrix

I only have 20 samples, so when running with all 4 treatment levels I got this warning Constrained optimization failed to converge within iteration limit; retrying with smaller penalty scaling parameter tau and larger inner_maxit.

Which I think means no solution found? Before giving up on this model, I wanted to make sure that I was coding my treatment correctly.

I created another treatment variable with High vs everything else which did complete, so I have emuFit working. Very excited to dive into those results.

adw96 commented 6 months ago

Hi @krmaas ! Thanks for trying this out and I hope you don't give up on the model! We're here to help.

Please don't worry about that warning -- emuFit just needs a bit more time to get you a good estimate.

How many taxa are you working with? You might want to set run_score_tests=FALSE so you can get your estimates right away. Then, you can run your score tests in parallel. Let us know if you would like some help from us in setting up a script to show you how to do this in parallel!

Ordering of your categories (with respect to estimated coefficients) isn't enforced by radEmu -- enforcing ordering of categories is very uncommon and isn't done by DESeq2, corncob, LEFSe, etc. I expect that your data will estimate this ordering naturally at least for some taxa 😊

Here's some code I wrote to check this should all go through:

set.seed(1)
n <- 20
J <- 50
Y <- matrix(rpois(n*J, 100)*rbinom(n*J, 100, 0.7), nrow=n, ncol = J)
treatment <- rep(c(LETTERS[1:4]), 5)
XX <- data.frame(treatment)

library(radEmu)
# ef <- emuFit(formula = ~ treatment, 
#              data = XX, 
#              Y = Y) ### takes a while, as it runs score tests
ef <- emuFit(formula = ~ treatment, 
             data = XX, 
             Y = Y, 
             run_score_tests=FALSE) #### very fast
ef

ailurophilia commented 6 months ago

Hi @krmaas – also very excited that you're taking our method for a spin! 100% agree with the suggestions @adw96 makes.

Just popping my head in to clarify that although the warning you saw is indeed triggered if an optimization attempt for a score test doesn't converge, by default, the fitting function will re-attempt the optimization if this happens, so this warning does not necessarily mean that any part of the model didn't converge.

adw96 commented 6 months ago

@krmaas Please let us know if we've addressed your questions (ideally by closing this issue).

If there's anything else, please don't hesitate to let us know, though!

krmaas commented 6 months ago

thanks for the quick replies. I absolutely haven't given up on this :) I have a bunch of datasets to try it on. I've never been a deseq or lefseq fan. I've continued using indicator species even though it's not great for microbes.

Can you help me understand how it handles more than one category? does it compare a vs b, a vs c, b vs c? Sorry if this is part of the formulas in the paper-I don't follow those well.

ailurophilia commented 6 months ago

Hi @krmaas – radEmu handles (multi-category) categorical variables in essentially the same way that lm and glm do, if you've encountered categorical variables in one of those contexts. That is, there's nothing unique about radEmu in this regard.

Default behavior is that levels of a categorical variable will be compared to the reference level, so if you fit a model with levels a, b, and c, you will get output comparing b to a and c to a. This is a product of R's default behavior for handling factor variables. If you'd like to investigate different comparisons (e.g., b to a and c to b), you might consider changing the contrasts for the Treatment variable. Here's a source with some info on some options in R: https://stats.oarc.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables. (I suspect there are also tutorials, etc., on youtube if you're interested -- contrasts are not the most intuitive thing imo.)

So in short, default behavior is compare all levels to the first/reference level and this is just default behavior in R, but you can change that if you like. Hope this helps!

krmaas commented 5 months ago

thank you all for your quick responses, sorry I haven't been as timely

statdivlab / radEmu

how to deal with ordered categories #31