insightsengineering / rbmi

Reference based multiple imputation R package
https://insightsengineering.github.io/rbmi/
Other
16 stars 4 forks source link

Better simulation function #225

Closed nociale closed 2 years ago

nociale commented 2 years ago

It might be useful to have a better data simulation function. In particular the following enhancements need to be discussed:

wolbersm commented 2 years ago

Yes, agree @nociale Two other arguments which set the proportion of missing data may also helpful:

nociale commented 2 years ago

@wolbersm @gowerc I have been thinking a bit about the implementation of the function to simulate data. I would like to agree with you on the general set-up/ user interface before to implement it. Here there is a first proposal, but it probably needs improvements.

#' @title Generate data
#' 
#' @description Generate data for a two-arms clinical trial with longitudinal continuous outcome and one intercurrent event (ICE).
#'
#' @param mu_c Numeric vector indicating the mean outcome of the control arm assuming no ICE. Should include the outcome at baseline.
#' @param sigma_c Covariance matrix of the outcome from the control arm assuming no ICE.
#' @param mu_t Numeric vector indicating the mean outcome of the treatment arm assuming no ICE. Should include the outcome at baseline.
#' @param sigma_t Covariance matrix of the outcome from the treatment arm assuming no ICE.
#' @param n_c Number of subjects belonging to the control arm.
#' @param n_t Number of subjects belonging to the treatment arm.
#' @param prob_ice_c Numeric vector that specifies the probability of experiencing the ICE at each visit for a patient in the control arm with outcome equal to the control mean at baseline.
#' @param prob_ice_t Numeric vector that specifies the probability of experiencing the ICE at each visit for a patient in the treatment arm with outcome equal to the treatment mean at baseline.
#' @param or_outchg_c Numeric number that specifies the odds ratio corresponding to a worsening in the outcome in the control arm. See details.
#' @param or_outchg_t Numeric number that specifies the odds ratio corresponding to a worsening in the outcome in the treatment arm. See details.
#' @param model Optional. Right-hand side formula object that specifies the model for the probability of experiencing the ICE. See details.
#' @param model_coef_c Optional. Numeric vector that specifies the coefficients of the model for the probability of experiencing the ICE in the control arm. See details.
#' Must contain one coefficient for each variable included in the model. Needed only if `model` is specified.
#' @param model_coef_t Optional. Numeric vector that specifies the coefficients of the model for the probability of experiencing the ICE in the treatment arm. See details.
#' Must contain one coefficient for each variable included in the model. Needed only if `model` is specified.
#' @param drop_out Numeric number that specifies the drop-out rate following the ICE.
#' @param post_ice_traj Vector of characters that specifies the assumption about post-ICE trajectory.
#' Possible choices are: Missing At Random `MAR`, Jump to Reference `JR`,
#' Copy Reference `CR`, Copy Increments in Reference `CIR`, Last Mean Carried Forward `LMCF`.
#' Multiple choices are allowed.
#' @param prob_miss Numeric number that specifies the probability for a given observation to be missing. Can be used to produce
#' "intermittent" missing values (which are missing completely at random).
#'
#' @details 
#' The data generation works as follows:
#' 
#' - Generate data from a multivariate normal distribution with parameters `mu_c` and `sigma_c`
#' for the control arm and parameters `mu_t` and `sigma_t` for the treatment arm.
#' - Simulate the ICE according to the given logistic model for the probability of experiencing the ICE.
#' - Simulate drop-out after the ICE. The drop-out is conditional on the ICE and is simulated completely at random.
#' - Adjust trajectory after the ICE according to the given assumption expressed with the `post_ice_traj` argument.
#' 
#' If `model` is **not** specified, a default model for the probability of experiencing the ICE is:
#' `~ 1 + I(visit == 1) + ... + I(visit == n_visits) + I((x-alpha)/beta)` where:
#' - `n_visits` is the number of visits.
#' - `alpha = mu_c[1]` or `alpha = mu_t[1]`: `alpha` is the baseline outcome mean in the control arm if the subject belongs to the control arm. Otherwise it is the baseline outcome mean in the treatment arm.
#' - `beta = mu_c[n_visits] - mu_c[1]` or `beta = mu_t[n_visits] - mu_t[1]`: `beta` is the difference between the mean outcome at the last visit and at baseline in the control arm if the subject belongs to the control arm.
#' Otherwise it is the difference between the mean outcome at the last visit and at baseline in the treatment arm.
#' The term `I((x-alpha)/beta)` specifies the dependency of the probability of the ICE on the current outcome value.
#' The corresponding coefficient is `log(or_outchg_c)` (or `log(or_outchg_t)`) which represents the increase in the ICE probability
#' due to a worsening in the outcome from baseline equal to `beta`. `or_outchg_c` is the odds ratio corresponding to such worsening in the outcome.
#' A larger value indicates a larger probability of experiencing the ICE due to a worsening in the outcome.
#' 
#' Alternatively the model for the probability of experiencing the ICE can be provided by the user specifying `model`, `model_coef_c` and `model_coef_t`.
#' 
#' @returns A `data.frame` containing the simulated data. If multiple assumptions about post-ICE data are provided
#' a separate column containing the outcome values for each assumption will be included in the output.

General question:

Model the probability of the ICE. We have two possible implementations: (1) Fully user-specified model (using model, model_coef_c, model_coef_t arguments). Or (2) a default model with user-specified probabilities/ odds ratio (using the arguments prob_ice_c, prob_ice_t, or_outchg_c, or_outchg_t). I would like to know what in your opinion would be better from a user perspective, and/or if we should allow for both.

Thanks!

wolbersm commented 2 years ago

Hi @nociale

I like it!

Comments:

wolbersm commented 2 years ago

Hi @nociale

Just one further thought: For the advanced vignette, it would be very nice to have a simulated dataset with two different types of ICEs.

I thought it should be relatively easy to enhance this simulation function as follows:

What do you think about this?`

Best, Marcel

nociale commented 2 years ago

@wolbersm I like this idea! It would allow to simulate trials with two ICEs without complicating the implementation. Just 3 confirmations: are the following true?

  1. The drop-out for ICE2 is simulated independently on the outcome value.
  2. The prob_dropout_c and prob_dropout_t are simulated missing completely at random.
  3. prob_dropout_c and prob_dropout_t affect only pre-ICE1 visits (since we have an ad-hoc parameter prob_post_ice1_dropout for drop-out following ICE1).

Best regards, Alessandro

wolbersm commented 2 years ago

@nociale Thanks!

  1. Yes, independently. I would call this "uninformative drop-out" or similar because it only triggers an ICE corresponding to treatment discontinuation if it occurs while subject is still on treatment. (see also 3. below)
  2. Yes, if you mean that they are simulated according to a independent binomial with the same p at each visit.
  3. I think they could also affect post-ICE1 visits, e.g. in case the subject had ICE1 but did not drop-out directly after ICE1 (as per prob_post_ice1_dropout) they could subsequently drop-out while off treatment due to to "additional drop-out" guided by prob_dropout_c or prob_dropout_t (simulated completely independently).
nociale commented 2 years ago

I see. Thanks for the answers, it seems everything clear to me now.