ratschlab / HIRID-ICU-Benchmark

Repository for the HiRID ICU Benchmark (HiB) project
MIT License
51 stars 10 forks source link

Index of specific clinical variables and subsets. #28

Closed xinformatics closed 8 months ago

xinformatics commented 10 months ago

Hi, Thank you for providing access to the dataset. I have two questions.

1: Could you please tell me how can I get the index for specific clinical variables? For e.g., I observed that there are 231 clinical variables so how can I get to the index of Heart Rate or Cardiac Output in the dataset?

2: Instead of 231 features, let's say I would like to train the model on a subset of features, what's the best to go about it?

Thank you.

hugoych commented 10 months ago

Hi, Thank you for your question:

Variables index

Regarding the index of the different variables, it depends on which stage.

For the final ML stage, because we use h5 tables, you can access the specific columns' names in the same order as in the data loader: https://github.com/ratschlab/HIRID-ICU-Benchmark/blob/bee770094bf8389920bc09823895b87e09a563dd/icu_benchmarks/data/loader.py#L136C9-L136C103

For the previous stages, the data frame header provides the names. However, before the common stage, the names are metavariable ids. You can find the matching in the preprocessing/resources/varref.tsv table.

Subset of variables training

We haven't implemented this feature however you can easily adapt the current codebase to do this.

The easiest way is to add a parameter subset (which is the subset of index you want)to the data loader and only load the desired columns here https://github.com/ratschlab/HIRID-ICU-Benchmark/blob/bee770094bf8389920bc09823895b87e09a563dd/icu_benchmarks/data/loader.py#L153C1-L153C101

with : self.lookup_table = {split: self.data_h5['data'][split][:][:, self.subset] for split in self.splits}

You can also provide column names instead and use self.columns to recover indexes.

Don't forget to adapt your config to have the correct input dimension.

Hope this answer your questions

xinformatics commented 8 months ago

Hi @hugoych, Thank you for your reply. It helped me a lot. I have one more question:

1: For the variable indexes corresponding to column names, I explored the data and found out that feature number 65 (0-indexed) corresponds to heart rate, but when I plotted the variable, it did not look like heart rate, and it varies between some negative to a positive range. Could you please tell me if some transformation is done on the dataset before the ML Stage?

Thank you again

hugoych commented 8 months ago

Hi again,

Indeed, at the ML stage, the data is scaled (standard scaling in this case), hence the positive and negative values. If you want to plot some variables, I suggest using the data in the common stage, which is 5-minute gridded data before imputation and scaling.

xinformatics commented 8 months ago

Thank you again. Also, how does dynamic prediction work? How is the data passed into the neural network for dynamic prediction, and how does it differ from the ICU mortality and Patient phenotyping tasks? I need help understanding this because the number of samples has gone up multifold for dynamic prediction but the data loader still outputs (batch,2016,231) dimensional input?

hugoych commented 8 months ago

For dynamic tasks, which are usually monitoring tasks, you make predictions at multiple occurrences during a stay, whereas for mortality of phenotyping, you make one prediction per stay.

For instance, in the circulatory task, you predict at every timestep, hence every 5min. Thus, you get SEQ_LEN number of examples for a single stay of length SEQ_LEN. On the other hand, in the Phenotyping task, you make one prediction for the entire stay, hence you get 1 example for a single stay of length SEQ_LEN.

This explains why your number of samples has gone up multifold as it should.

Hope this helps! All this information can be found in the manuscript https://arxiv.org/abs/2111.08536

xinformatics commented 8 months ago

Thank you so much. It makes complete sense now. I think prediction at each time step happens in every case; it is just that the prediction mask used in wrappers.py takes care of the output dimension associated with any given task.

xinformatics commented 5 months ago

Hi @hugoych, based on our previous discussion I was able to implement feature subset based training based on the variable names. Thanks again for that. Now, I have a related question. I saw some variables such as 'AiwayCode_1.0 and 'Ventilator mode_1.0'. Could you please tell me where I can the clinical descriptions of these variables?

Thank you

hugoych commented 5 months ago

Hi, You can find the information in the following spreadsheet from the original documentation (https://hirid.intensivecare.ai/data-details): https://docs.google.com/spreadsheets/d/1MjihfhyXX4dwni8Fxy3Ji5RCvSvnhipDCyjYo_6rixY/edit#gid=1345496616

AiwayCode is a typo, it should be AirwayCode for the Ventilator Airway code. According to the spreadsheet, it can take the following values: 1=Intubated, 2=Tracheostomy, 3=Mask, 4=Helmet, 5=Mouth piece, 6=nose mask

For the ventilator modes. It refers to what mode the ventilator is on but it is rather specific. Indeed meta subgroup are defined between 1 and 17 by the authors of HiRID. The grouping is done as follows:

Subgroup 1 mapped to Group 1 = Stand-by Subgroup 2-8 mapped to Group 2 = Controlled Subgroup 9-10 mapped to Group 3 = Spontaneous Subgroup 11+ mapped to Group 4 = Others(NIV, CPAP, etc...)