microbiome / mia

Microbiome analysis
https://microbiome.github.io/mia/
Artistic License 2.0
45 stars 25 forks source link

Initialise mediation functions #503

Closed RiboRings closed 2 months ago

RiboRings commented 4 months ago

This PR is not ready yet, but it was opened to begin discussion around mediation analysis for mia. Here, two new main functions are introduced: mediateColData and mediateAssay, both located in R/mediate.R. They depend on the mediation package and if this is not ideal these functions could be placed in an extension of mia.

mediateColData provides an easy way to run mediation analysis between 3 variables in the colData, while mediateAssay allows to run multiple mediation models for every taxon in an assay, or for every component of a reducedDim, and subsequently adjust significance for multiple comparison. They both leverage the mediation::mediate function under the hood.

mediateColData takes a tse object, three colData variables for outcome, treatment and mediator, respectively, plus other covariates from the colData and other arguments supported by mediation::mediate. It returns the output of the mediation model (similar to the output of lm or glm). Below is a self-contained example of how it works:

library(mia)
library(microbiomeDataSets)
library(mediation)
library(dplyr)
source("R/mediate.R")

tse <- OKeefeDSData()

tse <- transformAssay(tse,
                      method = "relabundance")

tse <- estimateDiversity(tse,
                         index = "shannon",
                         assay.type = "relabundance")

colData(tse)$bmi_group <- as.numeric(tse$bmi_group)

med_out <- mediateColData(tse,
                          outcome = "bmi_group",
                          treatment = "nationality",
                          mediator = "shannon",
                          covariates = "timepoint.within.group",
                          boot = TRUE, sims = 1000)

summary(med_out)

# Causal Mediation Analysis 
#
# Nonparametric Bootstrap Confidence Intervals with the Percentile Method
#
# (Inference Conditional on the Covariate Values Specified in `covariates')
#
#                Estimate 95% CI Lower 95% CI Upper p-value    
# ACME             0.1086       0.0344         0.19   0.004 ** 
# ADE             -0.4580      -0.6726        -0.24  <2e-16 ***
# Total Effect    -0.3493      -0.5484        -0.13  <2e-16 ***
# Prop. Mediated  -0.3110      -0.9781        -0.08   0.004 ** 
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Sample Size Used: 222 
#
#
# Simulations: 1000 

plot(med_out)

Screenshot 2024-03-06 at 18 15 50

mediateAssay iterates over the same code used for mediateColData. It takes the extra arguments assay.type and dim.type. If one of them is correctly specified, it runs mediation::mediate for every row of an assay or every column of a reducedDim slot, respectively. As output, it returns a dataframe with effect sizes and p-values for every taxon/component. Below is an example that requires variables from the previous example above:

tse <- transformAssay(tse,
                      method = "clr",
                      pseudocount = 1)

tse <- tse[ , tse$timepoint.within.group == 2]

med_res <- mediateAssay(tse,
                        outcome = "bmi_group",
                        treatment = "nationality",
                        assay.type = "clr",
                        boot = TRUE, sims = 300)

med_res %>%
  filter(ACME_adjpval > 0.05) %>%
  head() %>%
  knitr::kable()
Treatment Mediator Outcome ACME_estimate ADE_estimate ACME_pval ADE_pval ACME_adjpval ADE_adjpval ACME_CI_lower ADE_CI_lower
nationality Uncultured Mollicutes bmi_group -0.1700134 -0.1986718 0.0066667 0.2133333 0.2888889 0.2133333 -0.0469092 0.1062497
nationality Eubacterium ventriosum et rel. bmi_group -0.1087084 -0.2599768 0.0200000 0.0866667 0.4814815 0.0901333 -0.0100311 0.0607181
nationality Eubacterium cylindroides et rel. bmi_group 0.1010478 -0.4697330 0.0200000 0.0133333 0.4814815 0.0279570 0.2347390 -0.1674882
nationality Lactobacillus salivarius et rel. bmi_group 0.0816559 -0.4503411 0.0266667 0.0066667 0.4814815 0.0222222 0.1957793 -0.1609730
nationality Bacteroides uniformis et rel. bmi_group 0.2371214 -0.6058066 0.0266667 0.0133333 0.4814815 0.0279570 0.4490582 -0.2087500
nationality Uncultured Selenomonadaceae bmi_group 0.1084270 -0.4771122 0.0333333 0.0133333 0.4814815 0.0279570 0.2127692 -0.1225544
nationality Parabacteroides distasonis et rel. bmi_group 0.2731733 -0.6418585 0.0333333 0.0000000 0.4814815 0.0000000 0.5877062 -0.2688401
nationality Catenibacterium mitsuokai et rel. bmi_group -0.0810081 -0.2876771 0.0466667 0.0800000 0.5515152 0.0845528 -0.0009472 0.0257356

More examples can be found in this repository

antagomir commented 4 months ago

Great.

  1. Input types: splitting functionality by colData and assay is one option. Another option is to split by univariate vs. multivariate. In such case you could do something like mediate(tse, outcome = ..., treatment = "nationality", assay.type = "clr", ...) and there the "outcome" could refer to colData variable, or to one assay feature (taxon). This would be recognized automatically. It is important to have a chance to run mediation also for colData but then the question is, what is the added value compared to just running the basic mediate::mediation function from the original package on just a single variable from colData.. And why not allow situations where multiple comparisons could be run for colData.. (e.g. if there are different derived indices in addition to Shannon diversity).. This is a more general design aspect of mia and might deserve some thought. On the other hand perhaps we could just go with this colData / assay split first and improve later.

  2. Output: @TuomasBorman might comment but have been wondering if these functions should directly return augmented TreeSE objects as output. BUt I am not sure if it is feasible here. At least the examples should show how to add relevant pats of the results to the TreeSE object

  3. Support for custom mediation functions. If possible, it would be idea to allow users to input also custom mediation functions. This will greately facilitate later innovation and comparisons. In practice, we should define how the inputs and ouputs of such function should look like and as long as the function follows those standards it could be anything. For instance, we have allow the "FUN" argument to insert entirely new dissimilarity measures in functions such as runNMDS: https://microbiome.github.io/mia/reference/runNMDS.html

  4. We need an argument to choose from different methods, even if this would first provide only one method. Some other methods might like to take the entire assay rather than run for loop over features.

TuomasBorman commented 4 months ago

Looks nice!

  1. I think it is better to have one function for one task if possible. For example, now we have 2 functions for cross-correlation which might just confuse user.

In this case one options would be getMediate(tse, outcome = "disease", mediator = "counts", treatment = "diet", outcome.type = NULL, mediator.type = NULL, treatment.type = NULL), where outcome,mediatorand treatment parameters are automatically fetched from colData/assay/reducedDim. -> give error if overlapping name --> user can manually specify with *type where it is searched

  1. Yes, it could be added to metadata but I don't really see additional value. However, the function name could be getMediate to highlight that it does not return TreeSE. (add* would return)
RiboRings commented 4 months ago

Thank you for the comments!

@TuomasBorman, if I understand correctly, you are suggesting that mediateColData and mediateAssay should be combined into a single function, right? It sounds like it would simplify the code, and make it possible to use single or multiple variables for any of the outcome, mediator and treatment arguments.

But then how should we standardise the output? Because when multiple analyses are run (mediateAssay), the most practical output is a data.frame of model statistics for every combination, whereas for a single analysis (mediateColData), the most informative output is the trained model itself (so you can do summary or plot on it). From this aspect, keeping the two functions separate might be more convenient.

RiboRings commented 4 months ago

@antagomir, so you would consider this implementation with mediation::mediate like one option out of many, right? And then other mediation functions such as the hdmed ones could be specified with FUN = ...? This would definitely make this function more generic and complete.

In this case, how would you control the suitability of the custom method? For example, the hdmed (high-dimensional mediation) funs are only appropriate for high-dimensional data like the assay (mediateAssay), but not for colData variables (mediateColData). Also for this aspect having two separate functions might be more convenient(?)

Or actually what do you mean by custom mediation functions?

antagomir commented 3 months ago

Well in fact also the arguments may be very different for different methods. That's why there is for instance runMDS, runNMDS, etc. and not a single runOrdination.. so perhaps it is not good to go very generic for now. Let's just finalize this for the mediation package.

Regarding comment from TB above: it is recommened in general that the function output should always be the same, regardless of the inputs. We could have that, and in addition the user could choose to include the resulting model/s in metadata slot (default FALSE but examples could show how to do this when necessary). This applies also to the assay case, it would be a list of output models.

Would that work?

Default output could indeed be a data frame with effect sizes and (adjusted) p-values.

RiboRings commented 3 months ago

Hi!

The getMediation function is ready for review. I still need to add more argument checks, comment code and add unit testing, but the main parts are there. Now there is a single function that can take mediators from the colData (mediator arg), the assays (assay.type) or reducedDims (dim.type). It can also return the original model in attr(med_df, "metadata")[[model_number]] when add.metadata = TRUE.

Caveats:

antagomir commented 3 months ago

@RiboRings if you can close the conversations one by one when they have been resolved that would be great, to keep track on what was solved and what not

antagomir commented 3 months ago

Up - @RiboRings

RiboRings commented 2 months ago

Hi! I finally got back to this

TuomasBorman commented 2 months ago

Looks good, I will go through this in detail by tomorrow. I ran BiocCheck. There are at least multiple lines that were over 80 characters long.

I think code that we add should follow as closely as possible Bioconductor recommendations (it is harder to fix the code in the future). --> Can you run BiocCheck and fix the issues related to the lines that are added with this PR?

RiboRings commented 2 months ago

Ok, thank you for all the guidance!

I shortened the lines. Just as a reference, I also got this error when running BiocCheck:

* Checking for remote package usage...
    * ERROR: Package dependencies must be on CRAN or Bioconductor. Remove
      'Remotes:' from DESCRIPTION
antagomir commented 2 months ago

Hmm - - perhaps we should move hitchip to mia then..

TuomasBorman commented 2 months ago

There are also SilvermanAGutData from miaTime that is used in unit tests. At some point, we are planning to submit miaTime to Bioconductor so this is temporal thing.

Adding miaTime to Remotes in DESCRIPTION means that miaTime is installed when mia is built. This might cause problems. So it might be better to remove it from there.

Options

  1. Move hitchip data to mia --> We have to then move it back to miaTime when it is in Bioconductor.
  2. Run examples only when miaTime is installed --> This means that examples are not run (because miaTime is not installed by default)
  3. In addition to step 2, add miaTime installation step in the maediate examples. --> miaTime is installed only when documentation is built (not when user installs the package, as far as I know). However, there should be no installation steps in the documentation.

    Maybe the option 3 is the best?

antagomir commented 2 months ago

All options ok to me.

hitchip can move permanently to mia if this helps, there is no particular need to have it in miaTime. It was put there because we needed to limit the number of data sets in mia package and because hitchip had some time series in it

RiboRings commented 2 months ago

Maybe we could comment out the examples and remove miaTime from DESCRIPTION, and then uncomment the examples when miaTime is submitted to BioConductor?

TuomasBorman commented 2 months ago

Maybe we could comment out the examples and remove miaTime from DESCRIPTION, and then uncomment the examples when miaTime is submitted to BioConductor?

That was actually same to where I ended up after I thought this over night. --> So let's do that.

Instead of commenting out do:

if( require("miaTime") ){
examples
}

miaTime is used some other places also, but they are not affected. They are already ran only if miaTime is installed. (vignette is not ran)

antagomir commented 2 months ago

a shining idea

RiboRings commented 2 months ago

R CMD check fails due to warning:

❯ checking for unstated dependencies in examples ... WARNING
  'library' or 'require' call not declared from: ‘miaTime’

Also, for some reason miaTime is being installed even when it is not in DESCRIPTION:

miaTime                    0.1.21     2024-04-17 [1] Github (microbiome/miaTime@9fe9[771](https://github.com/microbiome/mia/actions/runs/8753409443/job/24023056975#step:4:791))
antagomir commented 2 months ago

For me running the latest version of the "mediation" branch through checks throws warning

Warning in data(hitchip1006) : data set ‘hitchip1006’ not found Error: object 'hitchip1006' not found

Also

TuomasBorman commented 2 months ago

I added \donttest which should fix the issue of @RiboRings.

@antagomir I cannot reproduce your error. It runs fine in my local machine, and the same tests are also run in GHA. Although, it is good to investigate this further to get answer on why you get different results

TuomasBorman commented 2 months ago

Did not run, I disabled the examples

antagomir commented 2 months ago

If this goes through automated tests then I think should be ok

RiboRings commented 2 months ago

Ubuntu:

* checking whether package ‘mia’ can be installed ... [30s/30s] WARNING
Found the following significant warnings:
  Warning: package ‘matrixStats’ was built under R version 4.5.0
  Warning: package ‘GenomicRanges’ was built under R version 4.5.0
  Warning: package ‘Biobase’ was built under R version 4.5.0
  Warning: package ‘SingleCellExperiment’ was built under R version 4.5.0

Mac:

Error: Error: Failed to install 'mia' from local:
  unable to load shared object '/Users/runner/work/_temp/Library/cli/libs/cli.so':
  dlopen(/Users/runner/work/_temp/Library/cli/libs/cli.so, 0x0006): tried: '/Users/runner/work/_temp/Library/cli/libs/cli.so' (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64e' or 'arm64')), '/System/Volumes/Preboot/Cryptexes/OS/Users/runner/work/_temp/Library/cli/libs/cli.so' (no such file), '/Users/runner/work/_temp/Library/cli/libs/cli.so' (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64e' or 'arm64'))
TuomasBorman commented 2 months ago

That will be hopefully fixed soon when new Bioc is released. However R 4.5.0 is little bit odd since next Bioc devel is developed against 4.4.0. That is something to do with rworkflows, I think

RiboRings commented 2 months ago

Hi @TuomasBorman! Should I do anything to fix this? How should we proceed to merge this PR?

TuomasBorman commented 2 months ago

I haven't been able to fix the problem, however, everything should be ok if at least one of these runs work and there are no errors related to this PR. --> so we can merge without passing all these rules

(I asked from rworkflows maintainers if they know what is the problem, and they have not answered yet)

TuomasBorman commented 2 months ago

Seems ok, I will merge (I just updated NEWS). Thanks!