therneau / survival

Survival package for R
381 stars 104 forks source link

Discrepancy in header information when using time variables with start and end format? #257

Closed camilagazolla closed 3 months ago

camilagazolla commented 3 months ago

Hello,

First of all, thank you for your hard work in developing and maintaining the survival package. It has been incredibly valuable for my research.

I am trying to run an Extended Cox Model with "INCIDENT_LIVERDIS" as the outcome and "high_richness_group" as the main predictor, using a time-dependent covariate (INCIDENT_DIAB_T2).

My data frame (5375x26) is structured with time variables in the "start and end format", meaning one participant can have two rows:

Here you see a sample of a participant with two rows (indicating both INCIDENT_LIVERDIS and INCIDENT_DIAB_T2 detected):

For reference, in this dataset we have:

Participants: N = 5339 INCIDENT_LIVERDIS: N = 105 INCIDENT_DIAB_T2: N = 631

I fitted the model using the “id” argument from coxph:

What concerns me is that the header information seems inaccurate even though I set the “id” argument. It states that the number of events is 120 and N = 5375.

I would greatly appreciate your guidance on the following:

Thank you very much for your time and assistance! :)

R version 4.3.1 (2023-06-16) survival_3.5-8

bethatkinson commented 3 months ago

So I read this as stating you have 5375 observations in the dataset with the time-dependent cohort (not unique people) and there are 120 with INCIDENT_LIVERDIS. All the id variable does is specify that the robust variant should be used if there is more than one event/person. If there is only 1 event/person then you don't need it.

I did note that in all the examples you've shown, it looks like INCIDENT_DIAB_T2 was defined as having happened at or prior to baseline (in which case it isn't a time-dep variable).

I'm not sure what the purpose of having 2 lines here is if the event happened at 6.227 and DIAB was diagnosed at baseline.

Tstart tstop INCIDENT_LIVERDIS INCIDENT_DIAB_T2 0 3.687 0 1 3.687 6.227 1 1

Perhaps you wanted something where the person didn't have diab from 0-3.687, then did have diab from 3.687 to 6.227, and had the event at 6.227?

Tstart tstop INCIDENT_LIVERDIS INCIDENT_DIAB_T2 0 3.687 0 0 3.687 6.227 1 1

So quick answer, I think your data isn't set up quite correctly but the cox model output looks ok (assuming you have 120 events in your dataset). If you have more events in the initial dataset, then you may additional errors in your time-dep dataset.

Beth

From: Camila Gazolla Volpiano @.> Sent: Wednesday, May 29, 2024 6:10 PM To: therneau/survival @.> Cc: Subscribed @.***> Subject: [EXTERNAL] [therneau/survival] Discrepancy in header information when using time variables with start and end format? (Issue #257)

Hello,

First of all, thank you for your hard work in developing and maintaining the survival package. It has been incredibly valuable for my research.

I am trying to run an Extended Cox Model with "INCIDENT_LIVERDIS" as the outcome and "high_richness_group" as the main predictor, using a time-dependent covariate (INCIDENT_DIAB_T2).

My data frame (5375x26) is structured with time variables in the "start and end format", meaning one participant can have two rows: Picture.1.png (view on web)https://github.com/therneau/survival/assets/64544051/23a1a2c1-b9fd-4cb9-a139-49d6f6675aa8

Here you see a sample of a participant with two rows (indicating both INCIDENT_LIVERDIS and INCIDENT_DIAB_T2 detected): Picture.1.png (view on web)https://github.com/therneau/survival/assets/64544051/77b3899e-4377-470e-b98b-17ae8e94a246

For reference, in this dataset we have:

Participants: N = 5339 INCIDENT_LIVERDIS: N = 105 INCIDENT_DIAB_T2: N = 631

I fitted the model using the "id" argument from coxph: Picture.1.png (view on web)https://github.com/therneau/survival/assets/64544051/1347f44c-42a7-4696-b6cb-c68cd6a25d83

What concerns me is that the header information seems inaccurate even though I set the "id" argument. It states that the number of events is 120 and N = 5375. Picture.1.png (view on web)https://github.com/therneau/survival/assets/64544051/759c8400-5ee6-4a9d-8d4a-e9b1df53b711

I would greatly appreciate your guidance on the following:

Thank you very much for your time and assistance! :)

R version 4.3.1 (2023-06-16) survival_3.5-8

- Reply to this email directly, view it on GitHubhttps://github.com/therneau/survival/issues/257, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACWQG5YAQ2LJZBEGWX3RUYDZEZN3VAVCNFSM6AAAAABIP2U73OVHI2DSMVQWIX3LMV43ASLTON2WKOZSGMZDIMZXGY4DMMY. You are receiving this because you are subscribed to this thread.Message ID: @.**@.>>

camilagazolla commented 3 months ago

Hi @bethatkinson,

Thank you so much for your help!

Let me clarify: INCIDENT_DIAB_T2 and INCIDENT_LIVERDIS are indeed detections that occur after the baseline.

This should be read as follows:

Tstart Tstop INCIDENT_LIVERDIS INCIDENT_DIAB_T2 Description
0 3.687 0 1 we detected DIAB_T2 at 3.687 years after baseline
3.687 6.227 1 1 we also detected LIVERDIS for this participant at 6.227 years after baseline

I am not sure if this is the correct form.

What do you think?

Best,

bethatkinson commented 3 months ago

So at tstart=0 you don't know that they have diabetes therefore incident_diab_t2 should=0. Covariates are known at the beginning of the interval - events are known at the end.

From: Camila Gazolla Volpiano @.> Sent: Thursday, May 30, 2024 6:36 PM To: therneau/survival @.> Cc: Atkinson, Beth J., M.S. @.>; Mention @.> Subject: [EXTERNAL] Re: [therneau/survival] Discrepancy in header information when using time variables with start and end format? (Issue #257)

Hi @bethatkinsonhttps://github.com/bethatkinson,

Thank you so much for your help!

Let me clarify: INCIDENT_DIAB_T2 and INCIDENT_LIVERDIS are indeed detections that occur after the baseline.

This should be read as follows: Tstart Tstop INCIDENT_LIVERDIS INCIDENT_DIAB_T2 Description 0 3.687 0 1 At year 3.687 after baseline, we detected DIAB_T2 3.687 6.227 1 1 At year 6.227 after baseline, we also detected LIVERDIS for this participant

What do you think?

Best,

- Reply to this email directly, view it on GitHubhttps://github.com/therneau/survival/issues/257#issuecomment-2141003983, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACWQG53HX45DGLJKRD5UYVDZE6ZV3AVCNFSM6AAAAABIP2U73OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBRGAYDGOJYGM. You are receiving this because you were mentioned.Message ID: @.**@.>>

therneau commented 3 months ago

It would help if you showed a full coxph call, i.e., coxph(Surv(tstart, tstop, incident.liver) ~ age + sex + diabetes, data= ... ) You have hidden your formula away, which is a very bad idea if you are asking for help. (And in my opinion, a bad idea in general. A year from now you won't know what "model_formula" is either, when reading your own code.)

therneau commented 3 months ago

One further note: you will see another request just a bit older than yours, that I add "number of id" to the coxph printout, to augment the current "number of rows of data" value, which is what "n" is. Your confusion adds weight to their argument.