Open mmcdermott opened 1 month ago
To make this more concrete and timely -- barring community input over the coming days, I will stick with Option 1 as our current default for this initial push.
@mmcdermott i probably mentioned this already, but sorry for not being very responsive. i'm working towards a tight deadline for a grant proposal so have very little time for the next ~month.
i have only scanned the discussion above, but, if i understand correctly, option 1 only includes patients if they have one or more data points >30 days from prediction time (/ discharge time).
in real life we can't begin by filtering people who we don't see in the future, so making this a requirement for a benchmark seems like a bad idea.
i think it makes more sense to (1) assume that if a patient returns to hospital then they are returning to the same hospital (2) make sure that our prediction window doesn't exceed censor date for the dataset.
Thinking about the question of how to deal with patients who die within the prediction window...doesn't this just mean that we are trying to force a non-binary classification task into a binary task?
@mmcdermott i probably mentioned this already, but sorry for not being very responsive. i'm working towards a tight deadline for a grant proposal so have very little time for the next ~month.
No worries @tompollard -- Whatever cycles you have to offer insight is appreciated!
i have only scanned the discussion above, but, if i understand correctly, option 1 only includes patients if they have one or more data points >30 days from prediction time (/ discharge time). in real life we can't begin by filtering people who we don't see in the future, so making this a requirement for a benchmark seems like a bad idea.
So, obviously you are correct in that we can't filter patients by future (unseen) data in a deployment scenario. However, I disagree with the logic that this makes the task bad for a benchmark. In fact, implicitly many tasks are characterized with future data dependencies -- for example, any study on MIMIC-IV has the implicit exclusion criteria that a patient will be excluded from the task cohort if they have not and will not ever go to the ED while they remain in the dataset. I'm not suggesting this means that this property is not a problem. Instead, what I would say is problematic about these tasks is not their inclusion in a benchmark, but is in any subsequent use of results over these tasks to justify inappropriate deployment strategies. In particular, in this case, when I say we should do "Option 1", I am also explicitly proposing that this task could not and should not be used in a deployment scenario without an additional predictor also being leveraged to predict whether or not the patient will be in the dataset for more than 30 day. These two tasks together give us the unconditioned probability of an "admission within the next 30 days", when that is of interest. They also give us more precise predictors of things like "is this patient likely to be in the dataset for more than 30 days" and "presuming this patient doesn't leave the dataset, would they likely be readmitted?".
I would argue that in almost all cases when restricted to binary tasks, multiple predictors will be necessary to form a complete picture of the relevant probability distributions to motivate use in deployment. I would go further and say that (while I acknowledge this poses very real HCI and interpretability challenges), that this property is a good thing because it reflects that we are making more precise predictions of simpler probabilistic outcomes, rather than making more broad predictions of more complex, often more poorly understood probability distributions.
i think it makes more sense to (1) assume that if a patient returns to hospital then they are returning to the same hospital (2) make sure that our prediction window doesn't exceed censor date for the dataset.
Can you map this into a concrete proposal of inclusion/exclusion & label under the tabular form above, to make sure I understand?
Thinking about the question of how to deal with patients who die within the prediction window...doesn't this just mean that we are trying to force a non-binary classification task into a binary task?
I instead prefer to think about this from the perspective that we are breaking down a complex task into simpler binary components, but that argument is at least half just semantics.
For me this discussion emphasizes the importance of being clear about our intended goals of the benchmarks, and how we intend them to be used (not that we aren't, I'm still behind on reading up).
In fact, implicitly many tasks are characterized with future data dependencies -- for example, any study on MIMIC-IV has the implicit exclusion criteria that a patient will be excluded from the task cohort if they have not and will not ever go to the ED while they remain in the dataset.
I assume this should be "not ever go to the ED [or ICU] while...". I agree, for me the construction of the cohort is a major problem with MIMIC-IV (along with confusing temporal misalignment of modules). I wouldn't want to use these existing problems as justification for creating a new one.
I am also explicitly proposing that this task could not and should not be used in a deployment scenario without an additional predictor also being leveraged to predict whether or not the patient will be in the dataset for more than 30 day.
Maybe a side note, but is there also an upper bound (e.g. more than 30 day and less than 365 day)? Otherwise we're skewing the population towards those who were admitted early in the period spanned by the dataset.
I am also explicitly proposing that this task could not and should not be used in a deployment scenario without an additional predictor also being leveraged to predict whether or not the patient will be in the dataset for more than 30 day
e.g. without the upper bound on future data, we'll be lowballing the probability of being in the dataset for more than 30 days (e.g. patient has COVID? less likely to show up again).
Can you map this into a concrete proposal of inclusion/exclusion & label under the tabular form above, to make sure I understand?
I'll try to find the time but no promises!
I instead prefer to think about this from the perspective that we are breaking down a complex task into simpler binary components, but that argument is at least half just semantics.
So in practice you would run two separate models - e.g. one predicting 30 day mortality and one predicting 30 day readmission - and this would give you cleaner estimate of both outcomes? That makes sense I guess. It just feels odd to break down the task using a dependency on information not available at prediction time.
For me this discussion emphasizes the importance of being clear about our intended goals of the benchmarks, and how we intend them to be used (not that we aren't, I'm still behind on reading up).
Couldn't agree more!
Maybe a side note, but is there also an upper bound (e.g. more than 30 day and less than 365 day)? Otherwise we're skewing the population towards those who were admitted early in the period spanned by the dataset.
Yeah, that is a good point, but I'm not sure the best way to handle it is to add a recency constraint. I'm very open to it, but it seems like it could also introduce other unintended confounders, like focusing the model only on patients with certain diseases who are more likely to be seen more regularly.
So in practice you would run two separate models - e.g. one predicting 30 day mortality and one predicting 30 day readmission - and this would give you cleaner estimate of both outcomes? That makes sense I guess. It just feels odd to break down the task using a dependency on information not available at prediction time.
Yes, you'd run two (or more) models. This is not as weird as it sounds (or it shouldn't be, imo) -- in IPW for causal analyses, for example, you do something similar based on predicting whether the patient would receive the treatment then using that to reweight treatment response predictions. In general, almost all tasks we care about will be on restricted cohorts, with conditions on those cohorts that we can't know in advance. E.g., predicting an abnormal lab result is conditioned on the lab being measured, predicting any generic future event is conditioned on the patient still being in the dataset in that period, predicting treatment response is conditioned on the treatment being continued for a sufficient time to observe a response, etc. From that perspective, I think the fact that binary classification models make some of these things very explicit is an advantage
TL;DR: I like the nuanced modelling of the task but am afraid that nuance will get lost in the paper, making things worse than just going for Option 3.
Long version:
Similar to Tom's comment, I think several of the options outlined above are valid and our choice depends on our intended goals and how well we can explain them.
For example, Option 1 (the current default) estimates $p(R | E = 1, M = 0)$. Your argument @mmcdermott is that although we condition on the future, this is fine, because in practice we can combine it with $p(E, M)$ to get the full data distribution $p(R, E, M)$. This is true and I actually like the idea of more fine-grained modelling. However, I'd like to point out that if it is really the readmission itself that we care about, it is not enough to do this
an additional predictor also being leveraged to predict whether or not the patient will be in the dataset for more than 30 day. These two tasks together give us the unconditioned probability of an "admission within the next 30 days"
because it gives us an incomplete picture. By modelling $p(R | E = 1, M = 0)$, we only model a third of all possibilities. In order to recover the full joint distribution, we'd also need to estimate $p(R | E = 0, M = 1)$ (the probability on being readmitted before death in patients dying within 30 days) and $p(R | E = 0, M = 0)$ (the probability of being readmitted before being lost to follow-up for any reason that isn't death). Notably, the latter is a censoring/missing data problem that of course is only identifiable under certain conditions. We don't need to cover the fourth case because $p(E = 1, M = 1) = 0$.
We don't always need all of those additional distributions, though. To see how our underlying prediction goal changes the task, let's contrast this with a setting where we aren't actually interested in the readmission itself but instead are one of those users who just want to create a predictor for acuity as "readmission or death". Now we don't care about the conditional $p(R | E = 0, M = 1)$ and are content with just estimating $p(R | M = 0, E)$ and $p(M | E)$. I consider this close to the current description of Option 1 in the .yaml, because in the config we currently say "Prediction of the patient's likelihood of mortality within 30 days of discharge is a separate task". To me, this description suggests that just estimating $p(E, M)$ is all that is needed to expand our task and model readmission in an "appropriately conditional manner".
Finally, if all we care about is whether the patient turns up at our doorstep (e.g., because our model is not used for medical decision making but solely for planning of staff workload), life get's even simpler and we just need $p(R)$ (Option 3).
So which one of the above is the right task to use for our benchmark? I personally think that either would be fine and may be appropriate in practice depending on the goal for our model. As long as we make very clear in the benchmark what we model and why, all is good. There is a but, though. I think the discussion so far has shown that there is a lot of nuance to Option 1 including issues of both competing risks and censoring. My primary fear is that this crucial nuance may very well get lost in a benchmarking paper with multiple tasks, which is why I think we may want to consider Option 3 with the explicit assumption that any (or at least most) readmissions within 30 days are likely to the same hospital (as per Tom's suggestion). Option 1 would work best if we can show the entire modelling task including all the individual probabilities and devote a lot of space to discussing the intricacies.
I also think that this may be an even bigger problem for some of the other tasks. If we look at the MI task, the current setup estimates the probability of having an MI in the next five years conditional on the patient surviving said MI. Shown in isolation, this is a very odd task. If we are worried about users misinterpreting the predictor (which is listed as a con in both Options 2&3), I am not sure we are making things better by adding this complexity.
However, I'd like to point out that if it is really the readmission itself that we care about, it is not enough to do this
You're right @prockenschaub, I spoke too carelessly. What I meant (but did not say), was that if we predict $f(x) = p(y|x, I=1)$, where $I=1$ is an aggregation of all our inclusion criteria asserting that the patient is included, and we model $p(I=1 | x)$, then we can use this second model to reweight our evaluation of $f(x)$ over the set of patients for whom $I=0$ via inverse propensity weighting. That said, I may be making an error there, as I'm definitely not an expert on causal stuff. Either way, we would definitely need to do more modeling to model $p(y | x)$ more generally.
I also think that this may be an even bigger problem for some of the other tasks. If we look at the MI task, the current setup estimates the probability of having an MI in the next five years conditional on the patient surviving said MI.
You're right. We should have separate issues for each task. I'm going to re-title this issue to focus specifically on readmission, and we can have more targeted discussions for other tasks in new issues. Here's a task for MI: https://github.com/mmcdermott/MEDS-DEV/issues/10
i think it makes more sense to (1) assume that if a patient returns to hospital then they are returning to the same hospital -- cite>@tompollard</cite ... we may want to consider Option 3 with the explicit assumption that any (or at least most) readmissions within 30 days are likely to the same hospital (as per Tom's suggestion) -- cite>@prockenschaub</cite
I looked into this a bit. Data on how frequently this happens is sparse, but it is not insignificant. The only study I found pegs it at around 20% though this will obviously vary widely. Regardless, I don't think we can assume the fraction that get readmitted to a different hospital is negligible.
That said, it is not clear even if someone is readmitted to a different hospital whether our criteria here can actually catch that, because going to a different hospital for your readmission is very different than never coming back to the health system of the original admission.
So, some overall thoughts.
Right now, a lot of the confusion and apparent disagreement in this discussion I think stems from us having different views in mind of what "readmission risk prediction" means -- or, rather, what we are actually trying to predict, meaning explicitly what operationalized value of information we expect such a prediction to offer.
This is a general, systemic problem in the field, I think, but one that we are well poised to try to target in the long term. This discussion is already diving more deeply into important questions and aspects of this task than I have seen in many more traditional resources in ML for health.
In particular, my actionable take away from this is that while simplicity is extremely valuable, I think we should prioritize first providing task definitions that, insofar as we are able within the confines of ACES' syntax and the datasets we have access to, reflect in part the complexity of their use cases and clearly document and describe why the tasks are configured the way they are. While this will make our tasks harder to understand, it will make it easier for our community to iterate on these tasks and narrow them down to the tasks that truly matter in this space. Obviously this is a balancing act, but I think simply defaulting to the simplest possible cohort would be a mistake.
The HRRP is aimed to reduce unplanned and avoidable readmissions by applying financial penalties to hospitals that have higher than average readmission rates for patients with certain conditions. These financial incentives are, to the best of my (admittedly relatively limited) knowledge, the main driver behind the readmission risk prediction task being of such wide interest in the U.S. in particular. Other countries also have similar readmission program, though the exact criteria differ. Similar financial penalties do not currently exist to the best of my knowledge for mortality within 30 days after discharge, but other metrics do capture and penalize unplanned mortality on more indirect measures.
Here is a ChatGPT summary of unknown veracity of these programs: https://chatgpt.com/share/15f740dc-ba52-47c6-a059-d5e1a8043eac
If we wanted to treat the HRRP (or analogs of the HRRP) as our guiding principle, there are two larger areas of change we should consider:
Under most of these programs, a readmission is only viable for penalization if it is thought to be both unplanned and avoidable. These are often codified in that
We should consider including both of these kinds of conditions in our readmission task (or in a variant of a readmission task).
If we imagine that this model is used to delay discharge for patients who will have a subsequent readmission in a manner that causes a financial penalty, we can break patient populations down into a few groups to examine the possible failure modes. To characterize these groups, imagine that each patient has a known time-to-mortality as of discharge assuming no subsequent admission that is given by a random variable $t_M$ (a random variable so that it can reflect uncertainty and/or true variance in how long patients will live after discharge)
To expand on Thought 2 above, how do those possible changes reflect in what we might want in our readmission task?
The desire to count only admissions that are not elective in nature may reveal a limitation in ACES' configuration. @justin13601 and I will consider. If this is expressible (I do not think it currently is, but likely would be were https://github.com/justin13601/ACES/issues/54 solved), I would advocate we include it, but for now given I do not think it is expressible we should consider all admissions.
Based on the analysis above, I think the following assignment of labels to settings would best align with hospital use case (though it may not be expressible simply with ACES): Description | $R$ | $E$ | $M$ | Include/Exclude | Training label $y$ |
---|---|---|---|---|---|
Patients w/ data after 30 days are included & labeled | * | 1 | 0 | Include | $R$ |
Patients who die within 30 days w/o admission are excluded | 0 | 0 | 1 | Exclude | N/A |
Patients who are readmitted & die w/in 30 days get 1 | 1 | * | 1 | Include | 1 |
Patients w/o data or death > 30d out are excluded | 0 | 0 | 0 | Exclude | N/A |
I think that perhaps this would be best because:
However, I don't think at first glance that this configuration is capturable in ACES without solving https://github.com/justin13601/ACES/issues/54.
This is not because I think it is the best configuration, but rather for a more tactical reason. In particular, I think it will be helpful if we have at least one task that has some justifiable, but complex and slightly non-standard inclusion/exclusion criteria, that way we can emphasize to readers and users that a big part of this benchmark is defining and improving task definitions as a community. Of the tasks in our set currently, this task has the lowest potential for dangerous errors due to misconfiguration as it would likely be used operationally more than clinically, so of tasks to include more complexity in, even if we're not fully sure of or fully able to express the best version of the criteria, this task is the one we should pick. It is also the best choice tactically, because it is a common task so if we can convincingly show that it is actually much more complex than we typically consider, that will be the most valuable for the community. If we take that as a goal, then Option 3 is eliminated for being too simple. Option 2 is not viable within ACES, nor is the new option I proposed above. Frankly, we also don't have the right expertise to really decide which of the various options is truly "most aligned" with how hospitals are likely to care about this. Given all that, until we can both get a true expert on this task to weigh in and express more complex relationships with ACES, our capability to produce the "right" cohort is limited, Option 1 is the only choice that meets our needs as being both not the simplest version of this task possible and being a reasonable expression of this task in a way that aligns with hospital needs.
That all being said, practically for now we should just decide and stick with it, so I'm going to post a poll on slack between option 1 and 3 so we can just vote and be done for the ML4H push, and we can relegate further improvements to after the benchmark. At this point I'd be fine with either option, and see clear strengths and weaknesses to both.
I think it will be helpful if we have at least one task that has some justifiable, but complex and slightly non-standard inclusion/exclusion criteria, that way we can emphasize to readers and users that a big part of this benchmark is defining and improving task definitions as a community.
If we take that as a goal, then Option 3 is eliminated for being too simple.
I agree with these points. I think if we devote some space to it in the manuscript, highlighting the complexity of even common tasks like readmission prediciton is a strength. The slack poll shows a clear preference for Option 1 anyway :)
I also like the real-world use case with HRRP - something to revisit once the initial benchmark push is done.
For example, consider the
readmission/general_hospital/30d
task.Tagging for comments: @prockenschaub @shalmalijoshi @justin13601 @tompollard @Jwoo5
As of the commit referenced in the link above, this task currently excludes all patients who do not have one data element after 30 days. That may or may not be advisable.
Let's define $x$ to be a patient's data as of a prediction time, $R$ to be the label of whether or not there is an admission event within 30 days ($R=1$ if so, $R=0$ otherwise), $E$ to be a binary variable indicating whether or not we have data after 30 days ($E=1$ if there is data after 30 days, $E=0$ otherwise), and $M$ to be a binary variable indicating whether or not the patient dies within 30 days ($M=1$ if they do, $M=0$ otherwise).
Note that our constraints here are that we are limited to only a binary classification task. We can't, within this current scope, change frameworks to a survival analysis or something.
The question here is whether to include, exclude, and what label to assign for training this task. There are a few options:
Option 1: Only predict for patients with data observed more than 30 days out
In this setting, we combine the notions of "death" and "loss of follow up" and try to predict readmission only for the population of patients who will still have data after 30 days. One can use this in a clinical pipeline appropriately by also including predictors for a patient's likelihood to leave the dataset within 30 days, either due to death or lack of follow up (either jointly or via individual predictors), giving a nuanced picture of the patient's state (e.g., this patient is likely to die within 30 days, vs. this patient is likely to still have data for the full next 30 days but within that period to need a readmission).
Pros:
Cons:
Option 2: Predict on all patients where ground truth is known; omit patients where it is not.
In this setting, whenever we know a definite answer, we include the patient. If the patient is readmitted within 30 days, they get a 1. If they die within 30 days before readmission, they get a 0. If they have the full 30 days observed without a readmission, they get a 0. If they don't meet any of those criteria, they are excluded.
Pros:
Cons:
Option 3: Include all patients, assume data is complete.
In this option, we don't exclude anybody on the basis of future information (either due to death or loss of follow up). If $R=1$, we label $y=1$. If $R=0$, we label $y=0$.
Pros:
Cons:
Option 4: ???
Other suggestions or options are welcome.