jambo6 / sepsis_competition_physionet_2019

Code (rewritten) for our winning submission to the sepsis physionet 2019 challenge. Team name: Can I get your signature?
14 stars 9 forks source link

Question: Why is the model not generalizing better? #1

Open harshblue opened 4 years ago

harshblue commented 4 years ago

I am interested in predicting/diagnosing Sepsis and came across your winning submission to the Physionet 2019 Challenge. Congrats!

I had a chance to go through results across Hospital A, B and C and observe that model is performing poorly on Hospital C [where no training data was provided]. Do you have any hypothesis on why that might be the case, and how the model could have generalized better?

Also if there was some training data available from Hospital C, do you think techniques like transfer learning could have been applied to generalize better?

Thanks for your input. -Reddy

jambo6 commented 4 years ago

Dear Sam,

Thanks a lot for the message!

Yes, the performance on Hospital C is a big shame and really brings into question the feasibility of training algorithms on data from one hospital and deploying them in another. I have written an extended report that has been extended for publication in Critical Care Medicine where I went into a bit more detail about this. In particular I noted:

""" Finally, as previously noted, the models inability to maintain high prediction per- formance on hospital C indicates a limitation in generalising predictions to hospital data-sets on which the algorithm was not trained. There are a number of potential reasons as to why this might be the case: For example, the models were trained on variables that were highly dependent on physician decision-making processes and thus local hospital policies. A measurement or assessment that is encouraged to be taken by a doctor in one hospital may not comport with the practices of another hospital. As such, any model that is trained of data-sets from healthcare systems in which a gold standard for assessment procedures and measurements does not exist, will inherently adopt some of the biases of the underlying training set. One remedy to this limitation would be to only train on variables that are sampled at pre-determined times of a patients stay and are independent of a doctor decision- Finally, as previously noted, the models inability to maintain high prediction per- formance on hospital C indicates a limitation in generalising predictions to hospital data-sets on which the algorithm was not trained. There are a number of potential reasons as to why this might be the case: For example, the models were trained on variables that were highly dependent on physician decision-making processes and thus local hospital policies. A measurement or assessment that is encouraged to be taken by a doctor in one hospital may not comport with the practices of another hospital. As such, any model that is trained of data-sets from healthcare systems in which a gold standard for assessment procedures and measurements does not exist, will inherently adopt some of the biases of the underlying training set. One remedy to this limitation would be to only train on variables that are sampled at pre-determined times of a patients stay and are independent of a doctor decision- Finally, as previously noted, the models inability to maintain high prediction performance on hospital C indicates a limitation in generalising predictions to hospital data-sets on which the algorithm was not trained. There are a number of potential reasons as to why this might be the case: For example, the models were trained on variables that were highly dependent on physician decision-making processes and thus local hospital policies. A measurement or assessment that is encouraged to be taken by a doctor in one hospital may not comport with the practices of another hospital. As such, any model that is trained of data-sets from healthcare systems in which a gold standard for assessment procedures and measurements does not exist, will inherently adopt some of the biases of the underlying training set. One remedy to this limitation would be to only train on variables that are sampled at pre-determined times of a patients stay and are independent of a doctor decision-making process. One could then expect to more safely transfer an algorithm from across multiple hospital systems. Any remaining uncontrolled variables between hospitals such as patient demographics, socio-economic factors, etc would, of course, persist as sources of error. """

Some of the most important variables for improving the challenge utility score were the frequency at which the laboratory measurements were measured. Our hypothesis is that this intrinsically related to the opinion the doctor has on the health of the patient; the more worried the doctor is, the more frequent the measurements will be taken. The measurements that are taken and the rate at which this happens will be highly variable between hospitals. Suppose for example that hospital A and B always require a PaCO2 measurement if there is suspicion of sepsis, but this is not the case in hospital C (obviously this is not the case but serves to illustrate an extreme example) then we have trained on data where sepsis always occurs after a measurement of PaCO2, but in our new hospital, this is rarely the case. Incidentally, this is another issue with this challenge, positive labels are themselves doctor defined as it requires blood cultures to be taken, which is administered by a doctor, this only exacerbates this whole hospital policy type issue.

We have never been given access to examine the data from hospital C, so all of this is just hypothesizing. Hopefully, someday it will be released so we can further examine how the characteristics differ. Notably, if I train on hospital A and test on hospital B, the score on B became worse, but nowhere near as significantly so as it was for hospital C.

To generalise this better, training with variables that are explicitly not influenced by the doctor would be very important. Things like the number of samples taken or rate of samples should not be included. Taking care to remove the influence of doctor action, and anything that might be influenced by hospital policy would seem crucial. However, I know almost nothing about how hospital policy differs between ICUs so it would definitely be interesting to get the opinions of an expert on this.

One final point would be that the way they defined the utility score depends significantly on how many cases of sepsis there are in the dataset. For example if there are no cases of sepsis in the dataset, the maximum score you can get is zero. So if the distribution of sepsis cases was significantly different in hospital C, this could have a large impact. Similarly, it turned out the likelihood of developing sepsis increased significantly the longer a patient was in the ICU. If sepsis developed early in their stay, it was hard to detect, if it developed >48 hours into the stay, it is much easier (likely because people who stay over 2 days are much iller). So if the distribution of cases along the time-axis was also significantly different in hospital C, this could also have big implications for the score.

I apologise, this has turned into a longer email than I anticipated. I hope it is useful though!

Best wishes, James

From: "Sam Red" notifications@github.com To: "jambo6/sepsis_competition_physionet_2019" sepsis_competition_physionet_2019@noreply.github.com Cc: "Subscribed" subscribed@noreply.github.com Sent: Thursday, 4 June, 2020 19:12:55 Subject: [jambo6/sepsis_competition_physionet_2019] Question: Why is the model not generalizing better? (#1)

I am interested in predicting/diagnosing Sepsis and came across your winning submission to the Physionet 2019 Challenge. Congrats!

I had a chance to go through results across Hospital A, B and C and observe that model is performing poorly on Hospital C [where no training data was provided]. Do you have any hypothesis on why that might be the case, and how the model could have generalized better?

Also if there was some training data available from Hospital C, do you think techniques like transfer learning could have been applied to generalize better?

Thanks for your input. -Reddy

— You are receiving this because you are subscribed to this thread. Reply to this email directly, [ https://github.com/jambo6/sepsis_competition_physionet_2019/issues/1 | view it on GitHub ] , or [ https://github.com/notifications/unsubscribe-auth/AHYJXDJ6UTRNSJRGQTPPOTTRU7P2PANCNFSM4NS3WKMQ | unsubscribe ] .

harshblue commented 4 years ago

@jambo6 Thanks James for your time and valuable input. Aspects like Doctor Behavior and Sepsis likelihood across hospitals could have made the models less generalizable. I will attempt to build a model with variables not influenced by doctors to check for performance, as well as generalizability across A and B hospitals. I will share my findings if I uncover any interesting ones.