including correlations with environmental or experimental variables?

danchurch commented 3 months ago

Hello, and thanks for the cool/useful package.

Is there a way to include environmental predictors within a matrix to check for correlations with a categorical predictor column? I'd like to pick out network modules that are associated with certain experimental treatments and not others. In such a case, each of the categories would appear in the network diagram as another node, with edges to the OTUs. I have thought about "tricking" the program and simply adding a 0/1 dummy column for each of the categories in the count matrix.

I'd also like to do this for a continuous variable of interest.

Either way, the distributions of these variables are going to look very different from the count data columns, so I assume this will violate assumptions in the various network construction algorithms? I'm currently just using the pearson's correlation coefficient on CLR-transformed counts, but open to changing this.

Any advice here would be appreciated.

stefpeschel commented 3 months ago

Hi!

You're right! Simply adding covariates to the microbial count matrix would lead to spurious correlations as the data are clr-transformed. So, including any covariates into this transformation would lead to false results.

In principle, you want to compute two types of correlation: species-spiecies correlations and species-covariate correlations. For the species-spiecies correlations, the data need to be transformed to account for compositionality. Between species and environmental factors, however, no transformation is needed because there are no compositional effects (the species counts and an environmental factor don't add up to a constant sum). So, in principle you could compute the correlation matrices seperately using appropriate methods and bind them to a single correlation matrix, which can then be sent to NetCoMi.

If you're using Pearson correlations anyway, you could in principle simply compute Pearson correlations between the (clr-transformed) counts and the covariates. However, this approach does not account for the high-dimensionality of the data.

I would instead recommend using SPRING or SpiecEasi for the microbial associations, which both estimate conditional dependence instead of "marginal" correlations. And the methods are also appropriate for high-dimensional data. Both packages include Meinshausen & Bühlmann neighborhood selection, where for each node a regression model with all other nodes as predictors is solved.

For the species-covariate associations, you could then use a log-contrast regression, which is recommended for regression problems with compositional data as predictors. There are several R package doing this job, for instance: https://rdrr.io/cran/Compositional/man/lc.reg.html

stefpeschel commented 3 months ago

BTW: SpiecEasi offers the possibility to build cross-domain networks, which goes into the same direction since within-domain associations are estimated differently then between-domain associations. This is nicely explained in Tipton et al. (2018)

danchurch commented 3 months ago

Ok - so build the two sets of correlations separately with the appropriate statistical methods, then combine into a custom correlation matrix, and feed this back into the visualization piplelines in NetCoMi. Makes sense. Thank you for the really thought response here.

stefpeschel / NetCoMi

including correlations with environmental or experimental variables? #122