tpq / propr

[OFFICIAL] [CoDA] An R package to calculate proportionality and other measures for compositional data
69 stars 10 forks source link

Question about lr2glm model.matrix argument #18

Open MeggyC opened 4 years ago

MeggyC commented 4 years ago

Hi there,

I am working on sequence data from coral associated viral communities, which is in a dataframe of CLR transformed abundance data in which the columns are the samples and the rows are the viral species. I have a metadata dataframe too for each sample with ~15 numerical variables.

I want to build a glm for this data in order to see which metadata variables correlate best with the changes in community composition.

From looking at the documentation from propr, I think that the most appropriate course of action would be to use the lr2glm function. In this function I would pass the clr abundance dataframe as the first argument, and it requires a model matrix as the second argument. How do I generate this model matrix? Is that the result of the model.matrix() function in modelr? I would really appreciate an example please?

Also, does the clr transformed abundance matrix need to have the samples as the columns, or the variables (here, viruses).

Many thanks!

tpq commented 4 years ago

Hi Meggy,

There are a few options here. Given that your data are already CLR transformed, you can fit a glm model directly, as you would for any other data. The lr2glm() function in propr does something like this, and you find it helpful to look at the source code. Note however that this function is not really part of propr per se, which is why it is not properly documented. But to answer your question, a model matrix is indeed produced by the model.matrix() function. This function is actually contained within base R! If you check ?model.matrix in base R, you can find an example.

dd <- data.frame(a = gl(3,4), b = gl(4,1,12)) # balanced 2-way
m <- model.matrix(~ a + b, dd)

You would then supply m to the second argument of lr2glm().

tpq commented 4 years ago

Note that the GLM will test each OTU one at a time, via the formula OTU_i ~ Covariates. If you want to perform a fully multivariable analysis, you could instead supply the CLR-transformed data, along with the covariates, to a multivariable method, such as redundancy analysis vegan::rda() or canonical correlation analysis vegan::CCorA().