sunlabuiuc / PyHealth

A Deep Learning Python Toolkit for Healthcare Applications.
https://pyhealth.readthedocs.io
MIT License
994 stars 212 forks source link

Question about MIMIC-iii dataset #283

Closed zjs123 closed 7 months ago

zjs123 commented 7 months ago

Hi, I found that in the MoleRec paper, the processed mimic-iii dataset has 6, 350 patients and 14, 995 visits. However, I only got 5, 449 patients and 14, 141 visits when I using PyHealth to process this dataset. Here is my screenshot.

image
ycq091044 commented 7 months ago

Hello @zjs123, thanks for your question.

First, according to the patient number 6350, I assume that the MoleRec paper uses this github repo to process the MIMIC-III data https://github.com/ycq091044/SafeDrug.

Second, the drug_recommendation_mimic3_fn in the PyHealth package is a bit different from the data processing script in https://github.com/ycq091044/SafeDrug. The major difference is in https://github.com/sunlabuiuc/PyHealth/blob/master/pyhealth/tasks/drug_recommendation.py#L53.

image

In the SafeDrug repo, if one visit has either diagnoses or procedures, then it will be included. While in PyHealth, only if the visit has both the diagnoses and procedures, then it will be included (so the requirements here is more strict). There might be other minor difference.

P.S. if you take a look at the diff of patient number, 6350 - 5449 = 901 patients, and the diff of visit number is 14995 - 14141 = 854 visits, it is interesting that patient diff is larger than visit diff (I will assume the other way around). Anyway, it somehow tells us that the missing patients mostly have only one visit, which is not help in learning sequential models.