mmcdermott / MEDS-DEV

Platform for Inductive Experiments over Medical Data
MIT License
7 stars 3 forks source link

How should we handle censoring / loss-of-follow-up / exclusion criteria for the 1y-5y MI task? #10

Open mmcdermott opened 3 weeks ago

mmcdermott commented 3 weeks ago

Task file: https://github.com/mmcdermott/MEDS-DEV/blob/main/src/MEDS-DEV/tasks/criteria/phenotyping/outpatient/MI/1y-5y.yaml

mmcdermott commented 3 weeks ago

@prockenschaub, I'm very comfortable changing the criteria for this task (or deciding we can't pursue this task if we can't appropriately handle competing risks with death and censoring due to loss of follow up). what would you recommend?

prockenschaub commented 3 weeks ago

I think what irritates me about this task is the long follow-up. While binary classification is easily defendable for short-term outcomes like 30-day readmission, using it for follow-ups up to 5 years seems less advisable. After all, this type of question is the poster child for time-to-event analysis. This may simply be a gut feeling, though, because I think the basic problems remain the same.

Treatment of competing risks in existing CVD prevention tools

You linking the readmission task to a prospective use case helped me a lot to think about it, so I tried to do the same here. To me, this task feels a lot like the traditional CVD risk assessment models, e.g., SCORE2 endorsed by the European Society of Cardiology. The goal here is to estimate a patient's risk of developing CVD. Using this estimated risk, we can stratify patients and recommend high-risk patients for additional preventative care, closer management, etc.

If we accept that as the prospective use case, it is interesting to note that SCORE2 was developed using a Fine-Gray (FG) model, which --- unlike the Cox model which models hazards --- directly models the relationship between covariates and the cumulative incidence function (CIF). There are two interesting aspects of this choice: (1) this does not care whether the CIF is reduced because a covariate lowers the risk of CVD specifically or whether it just increases the risk of dying before ever getting the chance to develop CVD, and (2) it is very similar to Option 2 in the readmission task if we do not allow for censoring other than through death (i.e., every patient either has $E=1$ or $M=1$ but never $E=0$ and $M=0$)

Even though FG was used in this very popular score, it remains debatable whether it was the right choice to begin with. The likely reason why it was chosen is that it allows to directly link covariates to the CIF, which is particularly important for such a simple scoring system calculated by hand but less important for an computerised decision support tool. To say it with the words of Therneau, the creator of the R survival package:

The primary strength of the Fine-Gray model with respect to the Cox model approach is that if lifetime risk is a primary question, then the model has given us a simple and digestible answer [...] This simplicity is not without a price, however, and these authors are not proponents of the approach. [...] The attempt to capture a complex process as a single value is grasping for a simplicity that does not exist for many (perhaps most) data sets.

What other option might exist?

Pooled binary prediction

A method that has become very popular in causal inference is pooled logistic regression. Here, the follow-up time is divided into discrete bins (e.g., weeks or months) and the model predicts for each bin if the event is going to happen. For example, we could divide our follow-up into yearly bins (1st year, 2nd year, ...) and treat each one as a binary problem. Once an event has happened or the patient is lost to follow-up, no subsequent years are included. The total predicted risk for a patient can then be estimated by through $1 - \prod_t=1^5 (1 - \hat p_t)$.

This is a (sequential) binary task that captures a lot of the intricacies of time-to-event data, and it might be adapted to either resemble a Cox model or a FG model. However, it still adds some (necessary?) complexity that we may not want to have in the initial push.

Additional thoughts

Delayed prediction

I presume the gap of 1 year between annual physical and the prediction window is meant to avoid any causal leakage of information around the time of the visit (e.g., because a test was ordered following the physical and the result backdated)? Whether or not this is necessary probably strongly depends on the dataset. For example, if I knew for sure that this does not happen, I would want to predict for the entire follow-up from 0-5y. This is what SCORE2 and other traditional scores do.

I agree that for benchmarking purposes --- and without knowing much about the underlying datasets --- it may make sense to choose a safer option and include some wash-out period of immortal time to avoid any leakage. We do something similar in the current definition of the mortality task here, and Robin and I also introduced a gap like this in YAIB. However, I am wondering whether 1y is way too conservative and we should consider 1m-5y instead, which hopefully introduces a less severe selection bias (simply excluding those at imminent risk of MI).

Implicit vs. explicit loss to follow-up

One problem in the readmission use case is that we usually don't know if a patient was lost to follow-up. If all we have is MIMIC, we don't know if a patient was lost to follow-up, for example because they moved to a different state.

Depending on the healthcare system, primary care data like the one implied by this task may be different. In the UK, for example, you are registered with your general practitioner and a patient can be considered under observation as long as they remain registered with that practitioner. This allows to distinguish between patients who are followed-up but have no visits/data and patients who are truely lost to follow-up. If possible, this would be a better criterion than "any data after 5 years", mainly because it doesn't bias the cohort as much towards patients with longer follow-ups or frequent visits.

mmcdermott commented 2 weeks ago

A method that has become very popular in causal inference is pooled logistic regression. Here, the follow-up time is divided into discrete bins (e.g., weeks or months) and the model predicts for each bin if the event is going to happen ... I presume the gap of 1 year between annual physical and the prediction window is meant to avoid any causal leakage of information around the time of the visit (e.g., because a test was ordered following the physical and the result backdated)?

The gap of 1 year is actually because I'm eventually anticipating we will "complete" this task by adding the other time bins (e.g., a 0 - 1 year prediction task, a 1 - 5 year prediction task, a 5 - 10 year prediction task, etc.), but I didn't want to add them all to the benchmark at this stage so that we don't have our tasks dominated by this one task which actually has a bunch of component tasks. Do you think there is a better bin to use for either (a) this initial task or (b) for things in general, assuming we eventually complete other bins? Ostensibly, the bin choice should reflect the disparate kinds of decisions a provider might make -- e.g., if a patient is going to have an MI in the next year, that might warrant more acute action, but if it is only sometime in the next 1 - 5 years, maybe not so acute (to be clear this example is pure speculation, I'm just trying to illustrate the style of task I'm highlighting).

After all, this type of question is the poster child for time-to-event analysis. This may simply be a gut feeling, though, because I think the basic problems remain the same.

I both see your point as a CS / math person, but practically speaking I'm not sure how many real decision systems can rely on time-to-event predictions as opposed to simple binary or classification predictions. Ultimately, everything comes down to a concrete decision -- do an intervention, or don't; change the medication, or don't -- and while some of those decisions certainly can be made continuous in nature, I'm not sure how many of our systems now are prepped for such paradigms. @Gallifantjack, maybe you can offer a clinical perspective here -- are there settings where TTE predictions could realistically be integrated into decision workflows?

If possible, this would be a better criterion than "any data after 5 years", mainly because it doesn't bias the cohort as much towards patients with longer follow-ups or frequent visits.

I think this is a great point, but I don't know how many of our current datasets have this level of nuance. This also might be something we would want to apply more generally -- e.g., differentiate "complete" datasets from "incomplete" datasets in all metrics and task files.