Closed lorenzoFabbri closed 5 months ago
Lorenzo - I am not fully clear on what the treatment assignment mechanism is. Can you describe the probabilities of treatment assignment with an equation (as a function of variable 1 and variable 2)? Or in words. Feel free to e-mail me directly at keith.goldfeld@nyulangone.org.
I apologize!
Let's assume that we have two confounding variables, $C1$ and $C2$. The treatment variable is $X$ and it has 3 levels, and the treatment generating mechanism is given by $X \sim C1 + log(C2)$. So if we wanted to model $X$, and if we were to know the treatment generating mechanism, we would fit a model $X \sim C1 + log(C2)$ (for instance with a multinomial logistic model). So I'd like to generate a categorical variable using the formula
$C1 + log(C2)$. I hope now it's clear! Thank you.
You have written the probability of $X$ as a binary outcome. There need to be at least two probabilities specified. In this post, I generate data from a multinomial distribution (before fitting a model). Take a look and see if it helps.
This is probably what I need, thank you. I do have some questions now. In the example, you:
I am trying to replicate your example but with my own variables:
def <- simstudy::defData(
varname = "sex",
dist = "binary",
link = "logit",
formula = 0.5
)
def <- simstudy::defData(
dtDefs = def,
varname = "age",
dist = "normal",
formula = 10,
variance = 2
)
def <- simstudy::defData(
dtDefs = def,
varname = "eth",
dist = "categorical",
formula = "0.4;0.3;0.2;0.1",
variance = "1;2;3;4"
)
def <- simstudy::defData(
dtDefs = def,
varname = "edu",
dist = "categorical",
formula = "0.2;0.5;0.3",
variance = "1;2;3"
)
def <- simstudy::defData(
dtDefs = def,
varname = "sep",
dist = "categorical",
formula = "0.2;0.5;0.3",
variance = "1;2;3"
)
def <- simstudy::defData(
dtDefs = def,
varname = "weight",
dist = "normal",
formula = 35,
variance = 15
)
# Generate exposure given confounders
def <- simstudy::defData(
dtDefs = def,
varname = "exposure",
dist = "categorical",
link = "logit",
formula = "1 + 0.1*sex*age + 0.5*I(age^2) - 1.3*eth + 0.7*edu + 0.2*sep + 0.3*log(weight);
-1.3 + 0.3*sex*age + 0.5*I(age^2) - 1.7*eth + 0.7*edu + 0.2*sep + 0.9*log(weight)"
)
# Simulate data
ret <- simstudy::genData(
n = num_obs,
dtDefs = def
)
It is not clear to me why in my case, exposure
has only two levels rather than 3.
# Generate exposure given confounders
def <- simstudy::defData(
dtDefs = def,
varname = "exposure",
dist = "categorical",
link = "logit",
formula = "1 + 0.1*sex*age + 0.5*I(age^2) - 1.3*eth + 0.7*edu + 0.2*sep + 0.3*log(weight);
-1.3 + 0.3*sex*age + 0.5*I(age^2) - 1.7*eth + 0.7*edu + 0.2*sep + 0.9*log(weight)"
)
Your definition for the categorical formula only provides 2 probabilities where as the definition in the blog post has 3:
defY <- defDataAdd(defY, varname = "y",
formula = "b_j - 1.3 + 0.1*A - 0.3*x1 - 0.5*x2 + .55*A*period;
b_j - 0.6 + 1.4*A + 0.2*x1 - 0.5*x2;
-0.3 - 0.3*A - 0.3*x1 - 0.5*x2 ",
dist = "categorical", link = "logit")
See also https://kgoldfeld.github.io/simstudy/articles/simstudy.html#categorical
Thank you but in the example, Y
has four levels. I understood we have to provide K-1
formulas if the variable has `K levels.
It is also not clear how to make sure the probabilities sum to 1 when using complex data generating mechanisms.
Let me jump back in. If you provide 2 probabilities that sum up to less than 1, a third category is automatically added - so two probabilities can in fact lead to 3 categories. The problem you are having is that both of your probabilities are actually 1 (if you look at the formulas you specified and take the inverse logit, you will see why). Since both are 1, the data generation actually is assigning 50% probability to the two outcomes.
And good catch on the documentation. Categorical data can be defined with the logit link.
Let me jump back in. If you provide 2 probabilities that sum up to less than 1, a third category is automatically added - so two probabilities can in fact lead to 3 categories. The problem you are having is that both of your probabilities are actually 1 (if you look at the formulas you specified and take the inverse logit, you will see why). Since both are 1, the data generation actually is assigning 50% probability to the two outcomes.
And good catch on the documentation. Categorical data can be defined with the logit link.
Okay makes sense. I apologize for this basic question, but with such complex generating mechanisms (especially for variables with many levels), how can we make sure that the sum of the probabilities is indeed 1?
You need to be careful in your data generating process. If you add a couple of extra lines, you can see the probabilities that you are generating - you generally don't want to create a scenario where the probabilities are crazy high. Looking at your the probabilities your formulas are generating, you would see that the probabilities exceed 50! Here is a simpler example:
library(simstudy)
library(data.table)
def <-
defData(varname = "sex", dist = "binary", link = "logit", formula = 0.5) |>
defData(varname = "p1", dist = "nonrandom", link = "logit", formula = "-0.3 + .2*sex") |>
defData(varname = "p2", dist = "nonrandom", link = "logit", formula = "-1.3 + 0.3*sex") |>
defData(varname = "exposure", dist = "categorical", formula = "p1;p2")
# Simulate data
set.seed(123)
ret <- genData(n = 10000, dtDefs = def)
#> Warning: Probabilities do not sum to 1. Adding category to all rows!
ret
#> Key: <id>
#> id sex p1 p2 exposure
#> <int> <int> <num> <num> <int>
#> 1: 1 1 0.4750208 0.2689414 2
#> 2: 2 0 0.4255575 0.2141650 1
#> 3: 3 1 0.4750208 0.2689414 2
#> 4: 4 0 0.4255575 0.2141650 2
#> 5: 5 0 0.4255575 0.2141650 1
#> ---
#> 9996: 9996 1 0.4750208 0.2689414 3
#> 9997: 9997 0 0.4255575 0.2141650 1
#> 9998: 9998 1 0.4750208 0.2689414 2
#> 9999: 9999 1 0.4750208 0.2689414 2
#> 10000: 10000 0 0.4255575 0.2141650 1
ret[, table(exposure)]
#> exposure
#> 1 2 3
#> 4539 2520 2941
I am working on a simulation study for which I need to known the treatment generating mechanism. That is, given a set of variables, I want to simulate the treatment variable based on a known parametric form:
To the best of my knowledge, this is not possible since for
formula
I need to pass the probabilities of each level. Is there a workaround? I need to know how the treatment was generated to test some estimators, so I'd do something along the lines of:I believe I'd have to obtain the probabilities for each level given a multinomial model, but I ask for confirmation.