Reproducing the eICU experiment from section 5 of the manuscript

ghost commented 7 years ago

Hi,

I'm trying to train the RGAN to reproduce your results from section 5 table 2 of the manuscript. I've got the eICU dataset, but I'm not sure of how you pre-processed/reshaped it - to be honest I'm kind of lost?

Would it be possible to share the linux command-line instructions, and the directory structure you need to run in order to get the code training - pretty please with cherries on top.

Fan Q

corcra commented 7 years ago

Hi there!

The processing pipeline is pretty much this:

starting from the CSVs you get from eICU, turn them into hdf5 files
- I think @XinruiLyu did this in our group, since we use eICU for multiple projects
- I think you can probably skip this step and load from the CSVs directly, but this call: https://github.com/ratschlab/RGAN/blob/master/data_utils.py#L609 will need updating, probably to read it as a CSV and then select out the parts you want (if you have memory)
In several places it looks for a list of patient IDs in eICU (pids) - we do this because we access the hdf5 using the pid as a key, and have to iterate through that. If you're loading from CSV and pulling the whole thing into memory, you don't really need it, but it's probably simplest to just do it anyway - I'd recommend using something like sed '1d' vitalPeriodic.csv | cut -f 1 -d ',' | sort -u > pids.txt to grab that list of patient IDs, but whatever works!
In data_utils.py, you've got resampled_eICU, which should do most of the heavy lifting, it calls these main functions:
- generate_eICU_resampled_patients: resample patients to measurements every 15 minutes (by default, this is an option), in the variables of interest
- get_cohort_of_complete_downsampled_patients: subset the output of the above to patients missing no data (not sure why I made these separate functions!)
- ... and does some other preprocessing steps like filtering and so on

Regarding the directory structure, for this part of the code there's nothing very specific about it - there's an eICU_dir variable which gives the location of the h5 files (pat_df = pd.read_hdf(eICU_dir + '/vitalPeriodic.h5' etc), and otherwise the other functions will save the intermediate files in the folder they're run from, I think.

If you hit a particular snag/error I can try to give you more specific information, otherwise I'll just end up describing the whole script unfortunately.

ghost commented 7 years ago

Hey Hello!

and Fanks for the great help - I think I get what data_utils.py does.

So it looks like you only need to convert the patient.csv and vitalPeriod.csv to hdf5. That's actually the bit that usually trips me up, as my data conversion kung foo is out of practice.

I guess I should post here the steps in case anyone else wants to try doing the eICU experiment?

Here's what I've got so far,

import numpy as np
import pandas as pd
import h5py

patients_filename_hdf5 = '/home/ajay/PythonProjects/eicu-code-master/Data/patient.h5'
patients_filename_csv  = '/home/ajay/PythonProjects/eicu-code-master/Data/patient.csv'

# Load csv into memory as a pandas Dataframe
patients = pd.read_csv(patients_filename_csv)

# Have a look at the columns of the Dataframe
patients.head(5)

# convert it to a dictionary
patients_dict = patients.to_dict()

# have a look at the keys
patients_dict.keys()

# should be the same as the column names of the Dataframe
patients.columns.values

# This create the HDF5-file object we use to work on the file, in write ('w') mode.
h5f = h5py.File(patients_filename_hdf5, 'w')

# Now we add each of the arrays in the dictionary to the hdf file 
# see - https://stackoverflow.com/questions/37214482/saving-with-h5py-arrays-of-different-sizes

for k,v in patients_dict.items():
    print(k)
    h5f.create_dataset(k,data=v)

Which returns my old friend,

patientunitstayid
TypeError: Object dtype dtype('O') has no native HDF5 equivalent

I've also tried using this script I found online - csv_to_hdf5.py, but I got a Segmentation fault (core dumped) when I tried to convert the patient table and ran out of memory for the vitalPeriodic table?

Any ideas @XinruiLyu ?

As you said

I think you can probably skip this step and load from the CSVs directly

~~So I'll try that~~ ,...., bad idea,...., not enough memory

sbagchi12 commented 7 years ago

Could you please share the eICU data files? I am not able to locate them in the repository. Also, how much time does it take to train on the MNIST data on a CPU?

ratsch commented 7 years ago

We cannot share the eICU data. It’s not permitted by the data use agreement. Data access can be obtained here: http://eicu-crd.mit.edu/gettingstarted/access/

On Oct 16, 2017, at 5:51 AM, sbagchi12 notifications@github.com wrote:

Could you please share the eICU data files? I am not able to locate them in the repository. Also, how much time does it take to train on the MNIST data on a CPU?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ratschlab/RGAN/issues/11#issuecomment-336837593, or mute the thread https://github.com/notifications/unsubscribe-auth/AAqZsMmnlzRAeZ5NsAu_3bXJ36uTsTqmks5ssycygaJpZM4PM3IJ.

kazemSafari commented 6 years ago

@AjayTalati I downloaded and unzipped the eICU dataset. Then I tried using your script and got the same error. @AjayTalati @corcra @ratsch and @XinruiLyu is there a fix for this issue? Thank you in advance.

data-boss commented 6 years ago

Hi,

I'm also trying to train the RGAN to reproduce the results on the eICU data. I have not got the eICU dataset, could you give me a brief introduction abot the structure of the dataset?
what are the 7 lables in the manuscript?
thank you

contact me : zhangxuewen2018@gmail.com

ratschlab / RGAN

Reproducing the eICU experiment from section 5 of the manuscript #11