mmcdermott / MEDS-DEV

Platform for Inductive Experiments over Medical Data
MIT License
7 stars 3 forks source link

How should we handle censoring / loss-of-follow-up for readmission risk prediction? #9

Open mmcdermott opened 1 month ago

mmcdermott commented 1 month ago

For example, consider the readmission/general_hospital/30d task.

Tagging for comments: @prockenschaub @shalmalijoshi @justin13601 @tompollard @Jwoo5

As of the commit referenced in the link above, this task currently excludes all patients who do not have one data element after 30 days. That may or may not be advisable.

Let's define $x$ to be a patient's data as of a prediction time, $R$ to be the label of whether or not there is an admission event within 30 days ($R=1$ if so, $R=0$ otherwise), $E$ to be a binary variable indicating whether or not we have data after 30 days ($E=1$ if there is data after 30 days, $E=0$ otherwise), and $M$ to be a binary variable indicating whether or not the patient dies within 30 days ($M=1$ if they do, $M=0$ otherwise).

Note that our constraints here are that we are limited to only a binary classification task. We can't, within this current scope, change frameworks to a survival analysis or something.

The question here is whether to include, exclude, and what label to assign for training this task. There are a few options:

Option 1: Only predict for patients with data observed more than 30 days out

Description $R$ $E$ $M$ Include/Exclude Training label $y$
Patients w/ data after 30 days are included & labeled * 1 0 Include $R$
Patients w/o data after 30 days are excluded * 0 * Exclude N/A

In this setting, we combine the notions of "death" and "loss of follow up" and try to predict readmission only for the population of patients who will still have data after 30 days. One can use this in a clinical pipeline appropriately by also including predictors for a patient's likelihood to leave the dataset within 30 days, either due to death or lack of follow up (either jointly or via individual predictors), giving a nuanced picture of the patient's state (e.g., this patient is likely to die within 30 days, vs. this patient is likely to still have data for the full next 30 days but within that period to need a readmission).

Pros:
  1. This ensures that all patients in the training set have the same time-duration within which a readmission could feasibly occur, eliminating a confounder that may cause the model to predict a lower likelihood of readmission if it thinks a patient has a chance of leaving the dataset (either due to death or any other reason) even if the severity of their disease would normally warrant a readmission.
  2. This is a simple inclusion/exclusion criteria that is easy to express, and does not require any notion of identifying patterns that signify "loss of follow up" vs. "lack of data but true follow up".
Cons:
  1. If one interprets "readmission prediction" to mean "predict whether this patient will likely be readmitted to the hospital given what will likely happen to them naturally", then this is a biased cohort under that definition as it does not reflect the fact that patients who are likely to die prior to any subsequent admission are unlikely to have a "readmission".
  2. This ignores some "ground truth" data, in that we know that patients who are readmitted but don't have data after 30 days were readmitted, and that we know that patients who died within 30 days and did not have a subsequent admission had no next admission.

Option 2: Predict on all patients where ground truth is known; omit patients where it is not.

Description $R$ $E$ $M$ Include/Exclude Training label $y$
Patients w/ data after 30 days are included & labeled * 1 0 Include $R$
Patients who die within 30 days w/o admission get 0 0 0 1 Include 0
Patients who are readmitted within 30 days get 1 1 * * Include 1
Patients w/o data or death > 30d out are excluded 0 0 0 Exclude N/A

In this setting, whenever we know a definite answer, we include the patient. If the patient is readmitted within 30 days, they get a 1. If they die within 30 days before readmission, they get a 0. If they have the full 30 days observed without a readmission, they get a 0. If they don't meet any of those criteria, they are excluded.

Pros:
  1. This uses all known information.
  2. This definition has "readmission" the closest we can get without making assumptions to "will there be an unconditioned readmission in 30 days for this patient"
Cons:
  1. Even though this is "closest" to the definition most my intuit for readmission, it isn't that thing. It is still conditioned on not having a loss of follow up. The fact that it is closer might make that discrepancy more dangerous as people may not recognize it is something that requires adjustment, whereas omitting death is a more obvious signal.
  2. This is a more complex cohort definition. In particular, we may need to solve https://github.com/justin13601/ACES/issues/54 before we could realize this.
  3. Users may mistakenly think that this is a predictor of some kind of acuity, when in reality the model should be expected to display a sort of "non-monotonic" behavior -- patients who are so acute they are going to die promptly will receive low readmission scores, as will patients with low acuity, whereas patients with acuity in the middle will be at greater risk for readmission.

Option 3: Include all patients, assume data is complete.

Description $R$ $E$ $M$ Include/Exclude Training label $y$
Patients are labeled from $R$ alone * * * Include $R$

In this option, we don't exclude anybody on the basis of future information (either due to death or loss of follow up). If $R=1$, we label $y=1$. If $R=0$, we label $y=0$.

Pros:
  1. This is the simplest possible cohort
  2. Under the current data distribution, this is the exact task of "will we see this patient again in 30 days, unconditioned on anything"
  3. This uses the maximum number of possible patients, and thus has the maximal amount of training data.
Cons:
  1. This is assuming that patients for whom we have a loss of follow-up also don't have a readmission. As there may be a correlation between the need for a readmission and the likelihood of risk for a lack of follow up, this may introduce un-intended biases.
  2. Users may mistakenly think that this is a predictor of some kind of acuity, when in reality the model should be expected to display a sort of "non-monotonic" behavior -- patients who are so acute they are going to die promptly will receive low readmission scores, as will patients with low acuity, whereas patients with acuity in the middle will be at greater risk for readmission.

Option 4: ???

Other suggestions or options are welcome.

mmcdermott commented 1 month ago

To make this more concrete and timely -- barring community input over the coming days, I will stick with Option 1 as our current default for this initial push.

tompollard commented 1 month ago

@mmcdermott i probably mentioned this already, but sorry for not being very responsive. i'm working towards a tight deadline for a grant proposal so have very little time for the next ~month.

i have only scanned the discussion above, but, if i understand correctly, option 1 only includes patients if they have one or more data points >30 days from prediction time (/ discharge time).

in real life we can't begin by filtering people who we don't see in the future, so making this a requirement for a benchmark seems like a bad idea.

i think it makes more sense to (1) assume that if a patient returns to hospital then they are returning to the same hospital (2) make sure that our prediction window doesn't exceed censor date for the dataset.

tompollard commented 1 month ago

Thinking about the question of how to deal with patients who die within the prediction window...doesn't this just mean that we are trying to force a non-binary classification task into a binary task?

mmcdermott commented 1 month ago

@mmcdermott i probably mentioned this already, but sorry for not being very responsive. i'm working towards a tight deadline for a grant proposal so have very little time for the next ~month.

No worries @tompollard -- Whatever cycles you have to offer insight is appreciated!

i have only scanned the discussion above, but, if i understand correctly, option 1 only includes patients if they have one or more data points >30 days from prediction time (/ discharge time). in real life we can't begin by filtering people who we don't see in the future, so making this a requirement for a benchmark seems like a bad idea.

So, obviously you are correct in that we can't filter patients by future (unseen) data in a deployment scenario. However, I disagree with the logic that this makes the task bad for a benchmark. In fact, implicitly many tasks are characterized with future data dependencies -- for example, any study on MIMIC-IV has the implicit exclusion criteria that a patient will be excluded from the task cohort if they have not and will not ever go to the ED while they remain in the dataset. I'm not suggesting this means that this property is not a problem. Instead, what I would say is problematic about these tasks is not their inclusion in a benchmark, but is in any subsequent use of results over these tasks to justify inappropriate deployment strategies. In particular, in this case, when I say we should do "Option 1", I am also explicitly proposing that this task could not and should not be used in a deployment scenario without an additional predictor also being leveraged to predict whether or not the patient will be in the dataset for more than 30 day. These two tasks together give us the unconditioned probability of an "admission within the next 30 days", when that is of interest. They also give us more precise predictors of things like "is this patient likely to be in the dataset for more than 30 days" and "presuming this patient doesn't leave the dataset, would they likely be readmitted?".

I would argue that in almost all cases when restricted to binary tasks, multiple predictors will be necessary to form a complete picture of the relevant probability distributions to motivate use in deployment. I would go further and say that (while I acknowledge this poses very real HCI and interpretability challenges), that this property is a good thing because it reflects that we are making more precise predictions of simpler probabilistic outcomes, rather than making more broad predictions of more complex, often more poorly understood probability distributions.

i think it makes more sense to (1) assume that if a patient returns to hospital then they are returning to the same hospital (2) make sure that our prediction window doesn't exceed censor date for the dataset.

Can you map this into a concrete proposal of inclusion/exclusion & label under the tabular form above, to make sure I understand?

Thinking about the question of how to deal with patients who die within the prediction window...doesn't this just mean that we are trying to force a non-binary classification task into a binary task?

I instead prefer to think about this from the perspective that we are breaking down a complex task into simpler binary components, but that argument is at least half just semantics.

tompollard commented 1 month ago

For me this discussion emphasizes the importance of being clear about our intended goals of the benchmarks, and how we intend them to be used (not that we aren't, I'm still behind on reading up).

In fact, implicitly many tasks are characterized with future data dependencies -- for example, any study on MIMIC-IV has the implicit exclusion criteria that a patient will be excluded from the task cohort if they have not and will not ever go to the ED while they remain in the dataset.

I assume this should be "not ever go to the ED [or ICU] while...". I agree, for me the construction of the cohort is a major problem with MIMIC-IV (along with confusing temporal misalignment of modules). I wouldn't want to use these existing problems as justification for creating a new one.

I am also explicitly proposing that this task could not and should not be used in a deployment scenario without an additional predictor also being leveraged to predict whether or not the patient will be in the dataset for more than 30 day.

Maybe a side note, but is there also an upper bound (e.g. more than 30 day and less than 365 day)? Otherwise we're skewing the population towards those who were admitted early in the period spanned by the dataset.

I am also explicitly proposing that this task could not and should not be used in a deployment scenario without an additional predictor also being leveraged to predict whether or not the patient will be in the dataset for more than 30 day

e.g. without the upper bound on future data, we'll be lowballing the probability of being in the dataset for more than 30 days (e.g. patient has COVID? less likely to show up again).

Can you map this into a concrete proposal of inclusion/exclusion & label under the tabular form above, to make sure I understand?

I'll try to find the time but no promises!

I instead prefer to think about this from the perspective that we are breaking down a complex task into simpler binary components, but that argument is at least half just semantics.

So in practice you would run two separate models - e.g. one predicting 30 day mortality and one predicting 30 day readmission - and this would give you cleaner estimate of both outcomes? That makes sense I guess. It just feels odd to break down the task using a dependency on information not available at prediction time.

mmcdermott commented 1 month ago

For me this discussion emphasizes the importance of being clear about our intended goals of the benchmarks, and how we intend them to be used (not that we aren't, I'm still behind on reading up).

Couldn't agree more!

Maybe a side note, but is there also an upper bound (e.g. more than 30 day and less than 365 day)? Otherwise we're skewing the population towards those who were admitted early in the period spanned by the dataset.

Yeah, that is a good point, but I'm not sure the best way to handle it is to add a recency constraint. I'm very open to it, but it seems like it could also introduce other unintended confounders, like focusing the model only on patients with certain diseases who are more likely to be seen more regularly.

So in practice you would run two separate models - e.g. one predicting 30 day mortality and one predicting 30 day readmission - and this would give you cleaner estimate of both outcomes? That makes sense I guess. It just feels odd to break down the task using a dependency on information not available at prediction time.

Yes, you'd run two (or more) models. This is not as weird as it sounds (or it shouldn't be, imo) -- in IPW for causal analyses, for example, you do something similar based on predicting whether the patient would receive the treatment then using that to reweight treatment response predictions. In general, almost all tasks we care about will be on restricted cohorts, with conditions on those cohorts that we can't know in advance. E.g., predicting an abnormal lab result is conditioned on the lab being measured, predicting any generic future event is conditioned on the patient still being in the dataset in that period, predicting treatment response is conditioned on the treatment being continued for a sufficient time to observe a response, etc. From that perspective, I think the fact that binary classification models make some of these things very explicit is an advantage

prockenschaub commented 3 weeks ago

TL;DR: I like the nuanced modelling of the task but am afraid that nuance will get lost in the paper, making things worse than just going for Option 3.

Long version:

Similar to Tom's comment, I think several of the options outlined above are valid and our choice depends on our intended goals and how well we can explain them.

For example, Option 1 (the current default) estimates $p(R | E = 1, M = 0)$. Your argument @mmcdermott is that although we condition on the future, this is fine, because in practice we can combine it with $p(E, M)$ to get the full data distribution $p(R, E, M)$. This is true and I actually like the idea of more fine-grained modelling. However, I'd like to point out that if it is really the readmission itself that we care about, it is not enough to do this

an additional predictor also being leveraged to predict whether or not the patient will be in the dataset for more than 30 day. These two tasks together give us the unconditioned probability of an "admission within the next 30 days"

because it gives us an incomplete picture. By modelling $p(R | E = 1, M = 0)$, we only model a third of all possibilities. In order to recover the full joint distribution, we'd also need to estimate $p(R | E = 0, M = 1)$ (the probability on being readmitted before death in patients dying within 30 days) and $p(R | E = 0, M = 0)$ (the probability of being readmitted before being lost to follow-up for any reason that isn't death). Notably, the latter is a censoring/missing data problem that of course is only identifiable under certain conditions. We don't need to cover the fourth case because $p(E = 1, M = 1) = 0$.

We don't always need all of those additional distributions, though. To see how our underlying prediction goal changes the task, let's contrast this with a setting where we aren't actually interested in the readmission itself but instead are one of those users who just want to create a predictor for acuity as "readmission or death". Now we don't care about the conditional $p(R | E = 0, M = 1)$ and are content with just estimating $p(R | M = 0, E)$ and $p(M | E)$. I consider this close to the current description of Option 1 in the .yaml, because in the config we currently say "Prediction of the patient's likelihood of mortality within 30 days of discharge is a separate task". To me, this description suggests that just estimating $p(E, M)$ is all that is needed to expand our task and model readmission in an "appropriately conditional manner".

Finally, if all we care about is whether the patient turns up at our doorstep (e.g., because our model is not used for medical decision making but solely for planning of staff workload), life get's even simpler and we just need $p(R)$ (Option 3).

So which one of the above is the right task to use for our benchmark? I personally think that either would be fine and may be appropriate in practice depending on the goal for our model. As long as we make very clear in the benchmark what we model and why, all is good. There is a but, though. I think the discussion so far has shown that there is a lot of nuance to Option 1 including issues of both competing risks and censoring. My primary fear is that this crucial nuance may very well get lost in a benchmarking paper with multiple tasks, which is why I think we may want to consider Option 3 with the explicit assumption that any (or at least most) readmissions within 30 days are likely to the same hospital (as per Tom's suggestion). Option 1 would work best if we can show the entire modelling task including all the individual probabilities and devote a lot of space to discussing the intricacies.

I also think that this may be an even bigger problem for some of the other tasks. If we look at the MI task, the current setup estimates the probability of having an MI in the next five years conditional on the patient surviving said MI. Shown in isolation, this is a very odd task. If we are worried about users misinterpreting the predictor (which is listed as a con in both Options 2&3), I am not sure we are making things better by adding this complexity.

mmcdermott commented 3 weeks ago

However, I'd like to point out that if it is really the readmission itself that we care about, it is not enough to do this

You're right @prockenschaub, I spoke too carelessly. What I meant (but did not say), was that if we predict $f(x) = p(y|x, I=1)$, where $I=1$ is an aggregation of all our inclusion criteria asserting that the patient is included, and we model $p(I=1 | x)$, then we can use this second model to reweight our evaluation of $f(x)$ over the set of patients for whom $I=0$ via inverse propensity weighting. That said, I may be making an error there, as I'm definitely not an expert on causal stuff. Either way, we would definitely need to do more modeling to model $p(y | x)$ more generally.

mmcdermott commented 3 weeks ago

I also think that this may be an even bigger problem for some of the other tasks. If we look at the MI task, the current setup estimates the probability of having an MI in the next five years conditional on the patient surviving said MI.

You're right. We should have separate issues for each task. I'm going to re-title this issue to focus specifically on readmission, and we can have more targeted discussions for other tasks in new issues. Here's a task for MI: https://github.com/mmcdermott/MEDS-DEV/issues/10

mmcdermott commented 3 weeks ago

i think it makes more sense to (1) assume that if a patient returns to hospital then they are returning to the same hospital -- cite>@tompollard</cite ... we may want to consider Option 3 with the explicit assumption that any (or at least most) readmissions within 30 days are likely to the same hospital (as per Tom's suggestion) -- cite>@prockenschaub</cite

I looked into this a bit. Data on how frequently this happens is sparse, but it is not insignificant. The only study I found pegs it at around 20% though this will obviously vary widely. Regardless, I don't think we can assume the fraction that get readmitted to a different hospital is negligible.

That said, it is not clear even if someone is readmitted to a different hospital whether our criteria here can actually catch that, because going to a different hospital for your readmission is very different than never coming back to the health system of the original admission.

mmcdermott commented 3 weeks ago

So, some overall thoughts.

Thought 1: We should (as a field, and in the long term for this benchmark, not necessarily in the immediate term) make our tasks more aligned with prospective use cases in general.

Right now, a lot of the confusion and apparent disagreement in this discussion I think stems from us having different views in mind of what "readmission risk prediction" means -- or, rather, what we are actually trying to predict, meaning explicitly what operationalized value of information we expect such a prediction to offer.

This is a general, systemic problem in the field, I think, but one that we are well poised to try to target in the long term. This discussion is already diving more deeply into important questions and aspects of this task than I have seen in many more traditional resources in ML for health.

In particular, my actionable take away from this is that while simplicity is extremely valuable, I think we should prioritize first providing task definitions that, insofar as we are able within the confines of ACES' syntax and the datasets we have access to, reflect in part the complexity of their use cases and clearly document and describe why the tasks are configured the way they are. While this will make our tasks harder to understand, it will make it easier for our community to iterate on these tasks and narrow them down to the tasks that truly matter in this space. Obviously this is a balancing act, but I think simply defaulting to the simplest possible cohort would be a mistake.

Thought 2: For Readmission Risk Specifically, a good "operationalized value proposition" we could consider trying to target is the U.S. Hospital Readmissions Reduction Program (HRRP)

The HRRP is aimed to reduce unplanned and avoidable readmissions by applying financial penalties to hospitals that have higher than average readmission rates for patients with certain conditions. These financial incentives are, to the best of my (admittedly relatively limited) knowledge, the main driver behind the readmission risk prediction task being of such wide interest in the U.S. in particular. Other countries also have similar readmission program, though the exact criteria differ. Similar financial penalties do not currently exist to the best of my knowledge for mortality within 30 days after discharge, but other metrics do capture and penalize unplanned mortality on more indirect measures.

Here is a ChatGPT summary of unknown veracity of these programs: https://chatgpt.com/share/15f740dc-ba52-47c6-a059-d5e1a8043eac

If we wanted to treat the HRRP (or analogs of the HRRP) as our guiding principle, there are two larger areas of change we should consider:

Change 1: Have "initial admission disease" based inclusion criteria and "potential readmission disease" based exclusion criteria.

Under most of these programs, a readmission is only viable for penalization if it is thought to be both unplanned and avoidable. These are often codified in that

  1. Readmissions after hospitalizations are only penalized if the initial hospitalization was for a limited subset of diseases. For example, the HRRP includes: Acute myocardial infarction (AMI), Chronic obstructive pulmonary disease (COPD), Heart failure (HF), Pneumonia, Coronary artery bypass graft (CABG) surgery, & Elective primary total hip arthroplasty and/or total knee arthroplasty (THA/TKA).
  2. Subsequent admissions within 30 days only count as "readmissions" if they are believed to be avoidable. In general, to my limited understanding, this means elective procedures or planned admissions do not count.

We should consider including both of these kinds of conditions in our readmission task (or in a variant of a readmission task).

Change 2: We should adjust our exclusion criteria to mirror this prospective use cases.

If we imagine that this model is used to delay discharge for patients who will have a subsequent readmission in a manner that causes a financial penalty, we can break patient populations down into a few groups to examine the possible failure modes. To characterize these groups, imagine that each patient has a known time-to-mortality as of discharge assuming no subsequent admission that is given by a random variable $t_M$ (a random variable so that it can reflect uncertainty and/or true variance in how long patients will live after discharge)

  1. Patients who will not leave the dataset or go to a different hospital other than due to death and for whom $t_M$ has a significant probability of being less than 30 days, and are known at the time of discharge to be at high risk of mortality and are being moved to palliative care and/or hospice. In this case, these patients would never be readmitted due to their palliative care status, regardless of the value of $t_M$.
    • Under option 3, these patients would be labeled $y=0$. This is appropriate given they would never be readmitted.
    • Under option 1, these patients would be excluded, which ignores valuable signal in that the label of $y=0$ is appropriate. However, note that in reality only a subset of patients who fit this description will die within 30 days. Others will, by chance, survive longer than 30 days. As this population of patients are those who would never be readmitted, regardless of how long their prospective survival time is, those patients who do survive longer than 30 days would end up included with a label of $y=0$ to the model under Option 1, so this signal is not fully lost.
  2. Patients who will not leave the dataset or go to a different hospital other than due to death and are at risk of acute decompensation (e.g., their $t_M$ has a significant probability of being less than 30 days), but are not known at the time of discharge to be at high risk of mortality. Given that mortality has (albeit indirect) penalties as well, it is reasonable to assume that if the hospital realized the patient was at risk of acute decompensation, they would choose to re-admit that patient despite the financial penalty. As such, it is likewise reasonable to assume that the probability the patient would be readmitted within some timeframe grows with $t_M$ -- e.g., the longer the patient survives, the greater the chance they'd be readmitted.
    • Under option 3, these patients who die before being readmitted will be labeled as $y=0$. This is likely inappropriate, as for these patients, the hospital would both have preferred not to discharge them in the first place and, were they to survive longer, the chance that they would have been readmitted would increase.
    • Under option 3, those patients who are readmitted, then die, will be labeled as $y=1$, which is appropriate.
    • Under option 1, these patients who die within 30 days and are not readmitted are excluded. This is more appropriate than under option 3, because we would prefer to label these patients with $y=1$ (though we can't know that at either train or inference time, of course). Further, for the subset of patients who do live until 30 days, there is a greater chance that they will receive a label of $y=1$, because we know that $t_M$ is associated with admission likelihood.
    • Under option 1, those patients who are readmitted, then die, all within 30 days, will be excluded, which is inappropriate, as a label of $y=1$ would better align with a hospital's use case.
  3. Patients who will be readmitted but will go to a different hospital. In the case that these patients later return to our dataset, all options fail equally by giving a label of $y=0$ and there is nothing we can do. However, it seems reasonable to imagine that patients who do go to a different hospital are more likely to also leave the dataset permanently, and in that case Option 1 would exclude those patients, which is arguably superior to giving a label of $y=0$.
  4. Patients who will not be readmitted and will leave the dataset (not through mortality). In this case, giving a label of $y=0$, as Option 3 does, would be preferred over omitting the patient as Option 1 does. The extent to which this comparison matters is likely dependent on the dataset -- e.g., if the rate of people leaving the dataset within 30 days of a random discharge is much greater than the rate of readmission, then this choice will impact the perceived labels for many patients, whereas if it is much smaller it will be less impactful. Something also central here and harder to measure from the dataset is to what extent is the likelihood that a patient leaves the dataset statistically independent from the outcome of interest. I suspect in many cases it is not -- e.g., patients who are lower-resourced might be less likely to remain in the hospital system within the U.S. due to changes in or loss of insurance, for example. In such a case, I would argue it would be better to exclude these patients rather than assign them a label of $y=0$ out of concern that erroneous labeling would bias the apparent prior the model observes for similar patients in the dataset overall. Excluding these patients form the cohort also could help surface any apparent bias in the output data, because the number of patients from groups likely to leave the dataset within 30 days would be smaller than if the label were imputed to $y=0$.

Thought 3: We should open github discussions, or a wiki, or something other than just issues to curate discussions on tasks.

mmcdermott commented 3 weeks ago

To expand on Thought 2 above, how do those possible changes reflect in what we might want in our readmission task?

We should only count unplanned readmissions

The desire to count only admissions that are not elective in nature may reveal a limitation in ACES' configuration. @justin13601 and I will consider. If this is expressible (I do not think it currently is, but likely would be were https://github.com/justin13601/ACES/issues/54 solved), I would advocate we include it, but for now given I do not think it is expressible we should consider all admissions.

Neither Option 1 nor Option 3 is best

Based on the analysis above, I think the following assignment of labels to settings would best align with hospital use case (though it may not be expressible simply with ACES): Description $R$ $E$ $M$ Include/Exclude Training label $y$
Patients w/ data after 30 days are included & labeled * 1 0 Include $R$
Patients who die within 30 days w/o admission are excluded 0 0 1 Exclude N/A
Patients who are readmitted & die w/in 30 days get 1 1 * 1 Include 1
Patients w/o data or death > 30d out are excluded 0 0 0 Exclude N/A

I think that perhaps this would be best because:

  1. patients who (unbeknownst to the model or us) are at known risk of mortality and enter palliative care will either be excluded if they die within 30 days, but the subset of similar patients who (by chance or not) survive past 30 days will be included with an appropriate label of 0.
  2. patients who (unbeknownst to the model or us) are at unknown risk of mortality will be excluded if the die before their decompensation could be caught and they were readmitted to the hospital, will be given a label of 0 if they survive past 30 days without an admission (appropriately given the restriction of this task being binarized), and will be given a label of 1 if their decompensation is caught and they are readmitted, then die.
  3. patients who leave the health system before 30 days will be excluded, regardless of whether or not they are first readmitted. The rationale for this being regardless of whether or not they are readmitted is to avoid bias in the model, as patients who are readmitted but then leave may, for example, have greater healthcare access than those who are not readmitted then leave. In cases where this set of patients is minimal, excluding them will not unduely limit the model, whereas excluding them could introduce unwanted biases. In cases where this set of patients is significant, then likely something else is wrong with the data w.r.t. this task, so exposing a smaller dataset that may be more obviously limited is superior to exposing a larger dataset that may have a commensurately larger risk of bias.

However, I don't think at first glance that this configuration is capturable in ACES without solving https://github.com/justin13601/ACES/issues/54.

Given ACES' current limitations, I advocate we proceed with Option 1

This is not because I think it is the best configuration, but rather for a more tactical reason. In particular, I think it will be helpful if we have at least one task that has some justifiable, but complex and slightly non-standard inclusion/exclusion criteria, that way we can emphasize to readers and users that a big part of this benchmark is defining and improving task definitions as a community. Of the tasks in our set currently, this task has the lowest potential for dangerous errors due to misconfiguration as it would likely be used operationally more than clinically, so of tasks to include more complexity in, even if we're not fully sure of or fully able to express the best version of the criteria, this task is the one we should pick. It is also the best choice tactically, because it is a common task so if we can convincingly show that it is actually much more complex than we typically consider, that will be the most valuable for the community. If we take that as a goal, then Option 3 is eliminated for being too simple. Option 2 is not viable within ACES, nor is the new option I proposed above. Frankly, we also don't have the right expertise to really decide which of the various options is truly "most aligned" with how hospitals are likely to care about this. Given all that, until we can both get a true expert on this task to weigh in and express more complex relationships with ACES, our capability to produce the "right" cohort is limited, Option 1 is the only choice that meets our needs as being both not the simplest version of this task possible and being a reasonable expression of this task in a way that aligns with hospital needs.

mmcdermott commented 3 weeks ago

That all being said, practically for now we should just decide and stick with it, so I'm going to post a poll on slack between option 1 and 3 so we can just vote and be done for the ML4H push, and we can relegate further improvements to after the benchmark. At this point I'd be fine with either option, and see clear strengths and weaknesses to both.

prockenschaub commented 3 weeks ago

I think it will be helpful if we have at least one task that has some justifiable, but complex and slightly non-standard inclusion/exclusion criteria, that way we can emphasize to readers and users that a big part of this benchmark is defining and improving task definitions as a community.

If we take that as a goal, then Option 3 is eliminated for being too simple.

I agree with these points. I think if we devote some space to it in the manuscript, highlighting the complexity of even common tasks like readmission prediciton is a strength. The slack poll shows a clear preference for Option 1 anyway :)

prockenschaub commented 3 weeks ago

I also like the real-world use case with HRRP - something to revisit once the initial benchmark push is done.