sambrilleman / simsurv

Simulate Survival Data
GNU General Public License v3.0
23 stars 8 forks source link

Importance of Simulating Survival Times? #16

Closed swaheera closed 1 year ago

swaheera commented 1 year ago

Hello Dr. Brilleman,

I was reading about your R package on simulating survival times (https://cran.r-project.org/web/packages/simsurv/index.html)

I have heard that simulation can be used to evaluate properties relating to the "robustness" and "misspecification" of a statistical model.

I was then reading your reference (https://www.jstatsoft.org/article/view/v097i03) in which it states : "When conducting simulation studies to evaluate the performance of new and existing statistical methods for analyzing survival data, one is required to simulate event times under a known data generating model. Similarly, one may need to simulate event times for the purpose of power calculations when designing new studies."

Given this information, I just wanted to clarify the following point:

This being said, I am trying to understand why exactly this is important and useful:

Your Help Is Greatly Appreciated, Thanks,

sambrilleman commented 1 year ago

Hi @swaheera - some answers inline below, hope they help!

Are we ever required to simulate data from the same distribution as the model itself? Or does this by definition provide no new information to us?

You might want to do this to check that your analysis model implementation can recover the "true" parameters of the data generating model that you used to simulate the data.

As an example, suppose I decide to fit a specific parametric Survival Model (e.g. AFT - Accelerated Failure Time Model) to some data I collected. I read that there a methods that use the "Inverse Probability Transform Method" that can then simulate Survival Times from this same "data-dependent probability distribution function" corresponding to my model. Such methods also exist in the semi-parametric case, where it becomes significantly more difficult to simulate Survival Times from a probability distribution that has not been explicitly defined.

Sorry I'm not familiar with the "Inverse Probability Transform Method"!

If I simulate Survival Times from the same distribution corresponding to the model I just fit - won't my model be able to handle this data well by definition (i.e. "home field advantage")?

As mentioned above - yes - and this is why you would do it - you want to check that the "home field advantage" is such that your analysis model can recover the true parameters, and therefore your implementation of the analysis model is correct.

Why is it important to simulate data from the same distribution as the model - would it not almost always be more useful to simulate Survival Times from a different distribution and see how well my model adapts (i.e. how close the predictions are) to this new data?

As above. But yes, using a data generating model that is different from your analysis model is also a useful endeavour - it allows you to assess the impacts of model misspecification, e.g. properties like bias etc. So, it really depends of the goals of your simulation study...

swaheera commented 1 year ago

@sambrilleman : thank you so much for your kind answers!

swaheera commented 1 year ago

One other question I wanted to ask - in general, it is not possible to simulate data from a Cox-PH model without making some assumption about the Baseline Hazard?

E.g. if I fit a Cox-PH model to some data, I am NOT required to specify a distribution for the Baseline Hazard Function. However, I want to simulate data from this same model, then I NEED to specify a distribution for the Baseline Hazard Function. Is this correct?

Thank you so much!

sambrilleman commented 1 year ago

Yeah that's correct.

No worries, happy to have helped. 🙂

On Thu, 9 Feb 2023, 14:48 swaheera, @.***> wrote:

One other question I wanted to ask - in general, it is not possible to simulate data from a Cox-PH model without making some assumption about the Baseline Hazard?

E.g. if I fit a Cox-PH model to some data, I am NOT required to specify a distribution for the Baseline Hazard Function. However, I want to simulate data from this same model, then I NEED to specify a distribution for the Baseline Hazard Function. Is this correct?

Thank you so much!

— Reply to this email directly, view it on GitHub https://github.com/sambrilleman/simsurv/issues/16#issuecomment-1423581826, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEVECTHUTZWXL2GT4ZQKXADWWRSHBANCNFSM6AAAAAAUR4TTVM . You are receiving this because you were mentioned.Message ID: @.***>