kgoldfeld / simstudy

simstudy: Illuminating research methods through data generation
https://kgoldfeld.github.io/simstudy/
GNU General Public License v3.0
81 stars 7 forks source link

Use formula to simulate categorical variable #228

Closed lorenzoFabbri closed 5 months ago

lorenzoFabbri commented 5 months ago

I am working on a simulation study for which I need to known the treatment generating mechanism. That is, given a set of variables, I want to simulate the treatment variable based on a known parametric form:

def <- simstudy::defData(
    varname = "variable1",
    dist = "categorical",
    formula = "0.2;0.5;0.3",
    variance = "low;middle;high"
  )
  def <- simstudy::defData(
    dtDefs = def,
    varname = "variable2",
    dist = "normal",
    formula = 35,
    variance = 15
  )
def <- simstudy::defData(
    dtDefs = def,
    varname = "treatment",
    dist = "categorical",
    formula = "variable1 + log(variable2)",
    variance = "level1;level2;level3"
  )

To the best of my knowledge, this is not possible since for formula I need to pass the probabilities of each level. Is there a workaround? I need to know how the treatment was generated to test some estimators, so I'd do something along the lines of:

mod <- glm(formula = variable1 + log(variable2))

I believe I'd have to obtain the probabilities for each level given a multinomial model, but I ask for confirmation.

kgoldfeld commented 5 months ago

Lorenzo - I am not fully clear on what the treatment assignment mechanism is. Can you describe the probabilities of treatment assignment with an equation (as a function of variable 1 and variable 2)? Or in words. Feel free to e-mail me directly at keith.goldfeld@nyulangone.org.

lorenzoFabbri commented 5 months ago

I apologize!

Let's assume that we have two confounding variables, $C1$ and $C2$. The treatment variable is $X$ and it has 3 levels, and the treatment generating mechanism is given by $X \sim C1 + log(C2)$. So if we wanted to model $X$, and if we were to know the treatment generating mechanism, we would fit a model $X \sim C1 + log(C2)$ (for instance with a multinomial logistic model). So I'd like to generate a categorical variable using the formula $C1 + log(C2)$. I hope now it's clear! Thank you.

kgoldfeld commented 5 months ago

You have written the probability of $X$ as a binary outcome. There need to be at least two probabilities specified. In this post, I generate data from a multinomial distribution (before fitting a model). Take a look and see if it helps.

lorenzoFabbri commented 5 months ago

This is probably what I need, thank you. I do have some questions now. In the example, you:

I am trying to replicate your example but with my own variables:

def <- simstudy::defData(
    varname = "sex",
    dist = "binary",
    link = "logit",
    formula = 0.5
  )
  def <- simstudy::defData(
    dtDefs = def,
    varname = "age",
    dist = "normal",
    formula = 10,
    variance = 2
  )
  def <- simstudy::defData(
    dtDefs = def,
    varname = "eth",
    dist = "categorical",
    formula = "0.4;0.3;0.2;0.1",
    variance = "1;2;3;4"
  )
  def <- simstudy::defData(
    dtDefs = def,
    varname = "edu",
    dist = "categorical",
    formula = "0.2;0.5;0.3",
    variance = "1;2;3"
  )
  def <- simstudy::defData(
    dtDefs = def,
    varname = "sep",
    dist = "categorical",
    formula = "0.2;0.5;0.3",
    variance = "1;2;3"
  )
  def <- simstudy::defData(
    dtDefs = def,
    varname = "weight",
    dist = "normal",
    formula = 35,
    variance = 15
  )

  # Generate exposure given confounders
  def <- simstudy::defData(
    dtDefs = def,
    varname = "exposure",
    dist = "categorical",
    link = "logit",
    formula = "1 + 0.1*sex*age + 0.5*I(age^2) - 1.3*eth + 0.7*edu + 0.2*sep + 0.3*log(weight);
               -1.3 + 0.3*sex*age + 0.5*I(age^2) - 1.7*eth + 0.7*edu + 0.2*sep + 0.9*log(weight)"
  )

  # Simulate data
  ret <- simstudy::genData(
    n = num_obs,
    dtDefs = def
  )

It is not clear to me why in my case, exposure has only two levels rather than 3.

assignUser commented 5 months ago
 # Generate exposure given confounders
  def <- simstudy::defData(
    dtDefs = def,
    varname = "exposure",
    dist = "categorical",
    link = "logit",
    formula = "1 + 0.1*sex*age + 0.5*I(age^2) - 1.3*eth + 0.7*edu + 0.2*sep + 0.3*log(weight);
               -1.3 + 0.3*sex*age + 0.5*I(age^2) - 1.7*eth + 0.7*edu + 0.2*sep + 0.9*log(weight)"
  )

Your definition for the categorical formula only provides 2 probabilities where as the definition in the blog post has 3:

defY <- defDataAdd(defY, varname = "y", 
  formula = "b_j - 1.3 + 0.1*A - 0.3*x1 - 0.5*x2 + .55*A*period;
             b_j - 0.6 + 1.4*A + 0.2*x1 - 0.5*x2;
             -0.3 - 0.3*A - 0.3*x1 - 0.5*x2 ", 
  dist = "categorical", link = "logit")

See also https://kgoldfeld.github.io/simstudy/articles/simstudy.html#categorical

lorenzoFabbri commented 5 months ago

Thank you but in the example, Y has four levels. I understood we have to provide K-1 formulas if the variable has `K levels. It is also not clear how to make sure the probabilities sum to 1 when using complex data generating mechanisms.

kgoldfeld commented 5 months ago

Let me jump back in. If you provide 2 probabilities that sum up to less than 1, a third category is automatically added - so two probabilities can in fact lead to 3 categories. The problem you are having is that both of your probabilities are actually 1 (if you look at the formulas you specified and take the inverse logit, you will see why). Since both are 1, the data generation actually is assigning 50% probability to the two outcomes.

And good catch on the documentation. Categorical data can be defined with the logit link.

lorenzoFabbri commented 5 months ago

Let me jump back in. If you provide 2 probabilities that sum up to less than 1, a third category is automatically added - so two probabilities can in fact lead to 3 categories. The problem you are having is that both of your probabilities are actually 1 (if you look at the formulas you specified and take the inverse logit, you will see why). Since both are 1, the data generation actually is assigning 50% probability to the two outcomes.

And good catch on the documentation. Categorical data can be defined with the logit link.

Okay makes sense. I apologize for this basic question, but with such complex generating mechanisms (especially for variables with many levels), how can we make sure that the sum of the probabilities is indeed 1?

kgoldfeld commented 5 months ago

You need to be careful in your data generating process. If you add a couple of extra lines, you can see the probabilities that you are generating - you generally don't want to create a scenario where the probabilities are crazy high. Looking at your the probabilities your formulas are generating, you would see that the probabilities exceed 50! Here is a simpler example:

library(simstudy)
library(data.table)

def <- 
 defData(varname = "sex", dist = "binary", link = "logit", formula = 0.5) |>
 defData(varname = "p1", dist = "nonrandom", link = "logit", formula = "-0.3 + .2*sex") |>
 defData(varname = "p2", dist = "nonrandom", link = "logit", formula = "-1.3 + 0.3*sex") |>
 defData(varname = "exposure", dist = "categorical", formula = "p1;p2")

# Simulate data

set.seed(123)

ret <- genData(n = 10000, dtDefs = def)
#> Warning: Probabilities do not sum to 1. Adding category to all rows!

ret
#> Key: <id>
#>           id   sex        p1        p2 exposure
#>        <int> <int>     <num>     <num>    <int>
#>     1:     1     1 0.4750208 0.2689414        2
#>     2:     2     0 0.4255575 0.2141650        1
#>     3:     3     1 0.4750208 0.2689414        2
#>     4:     4     0 0.4255575 0.2141650        2
#>     5:     5     0 0.4255575 0.2141650        1
#>    ---                                         
#>  9996:  9996     1 0.4750208 0.2689414        3
#>  9997:  9997     0 0.4255575 0.2141650        1
#>  9998:  9998     1 0.4750208 0.2689414        2
#>  9999:  9999     1 0.4750208 0.2689414        2
#> 10000: 10000     0 0.4255575 0.2141650        1

ret[, table(exposure)]
#> exposure
#>    1    2    3 
#> 4539 2520 2941