sambrilleman / simsurv

Simulate Survival Data
GNU General Public License v3.0
22 stars 8 forks source link

Idea: Attempting to Recreate a Dataset using only the Results of a Cox-PH Model #17

Open swaheera opened 1 year ago

swaheera commented 1 year ago

Suppose I fit a Survival Cox-PH Regression Model in R and get the following results:

Call:
coxph(formula = Surv(time, status) ~ age + sex + ph.ecog, data = lung)

             coef exp(coef)  se(coef)      z        p
age      0.011067  1.011128  0.009267  1.194 0.232416
sex     -0.552612  0.575445  0.167739 -3.294 0.000986
ph.ecog  0.463728  1.589991  0.113577  4.083 4.45e-05

Likelihood ratio test=30.5  on 3 df, p=1.083e-06
n= 227, number of events= 164 
   (1 observation deleted due to missingness)

Based on these results, I can infer information such as:

My Question: Given this information, is it possible to simulate the covariate and response information for n = 227 such observations - such that if a similar Cox-PH model was fit to these newly simulated 227 observations, the resulting regression coefficients would approximately be equal to the original regression coefficients? Can I try to "guess" (and recreate) a plausible set of observations might have been observed based on the regression model coefficients?

For example, I know that if I were to "fix" the covariate information for a group of n = 227 "arbitrary created" patients, I could then simulate their survival times (e.g. https://cran.r-project.org/web/packages/simsurv/index.html) - however, if I were to then fit a Cox-PH model to these observations, the model coefficients would not necessarily be close to the original model coefficients.

In general, is this possible to do? Only given the above model summary, could I try and somehow generate the original dataset that this model was trained on?

Thanks!

Note: I realize there are probably an infinite number of n = 227 samples that can be randomly simulated such that a Cox-PH Model produces the same regression coefficient estimates as above.

swaheera commented 1 year ago

Possible Pseudocode:

I know this is a very abstract and roundabout way that might not have any mathematical validity - but I was curious to know if such an application might be logical?

Note: I realize that its entirely possible that I happen to simulate a dataset in which all the ages are concentrated around 20-25 years old when in fact the real age of the participants in this dataset were senior citizens - but based on the results of the other simulated variables, the resulting Cox-PH model produced using this simulated data might coincidentally have the same performance results as the original model - thus rendering this approach useless.