stanfordmlgroup / ed-monitor-data

10 stars 0 forks source link

ED Monitoring Data Processing

Collection of utilities to process ED monitor files.

edm

edm is a Pytorch Lightning module to help train machine learning models on the ED waveform data.

Install the edm locally by running pip install . from the edm directory.

Raw Data

The raw data represents the data exported from the Philips IntelliVue system.

Folder Structure

Root folder for all ED monitoring data (synced from Box):

/deep/group/ed-monitor

Study folders (named STUDY-XXXXX) which contains raw data files pulled off of Philips IntelliVue:

/deep/group/ed-monitor/2020_08_23_2020_09_23/
    ExportPatientList_2020.09.23_16.28.37.csv
    data/2020_08_23_2020_09_23/
        STUDY-00001
        STUDY-00002

Each study folder (STUDY-XXXXX) contains files similar to the the following:

STUDY-018087_2020-09-05_00-00-00.clock.txt      <- Used only to extract the start time
STUDY-018087_2020-09-05_00-00-00.CO2.dat        <- CO2 waveform (only available for limited studies)
STUDY-018087_2020-09-05_00-00-00.EctSta.txt     <- Not used
STUDY-018087_2020-09-05_00-00-00.hea            <- Header file for all waveforms
STUDY-018087_2020-09-05_00-00-00.II.dat         <- ECG data for Lead II
STUDY-018087_2020-09-05_00-00-00.II.pace.txt    <- Not used
STUDY-018087_2020-09-05_00-00-00.info           <- Information file on sampling rate, gain, etc.
STUDY-018087_2020-09-05_00-00-00.numerics.csv   <- Information on spO2 and other vital signs over time
STUDY-018087_2020-09-05_00-00-00.Pleth.dat      <- Plethysmograph waveform (available for most studies)
STUDY-018087_2020-09-05_00-00-00.Resp.dat       <- Respiration waveform (available for most studies)
STUDY-018087_2020-09-05_00-00-00.RhySta.txt     <- Not used
STUDY-018087_2020-09-05_00-00-00.SQI.txt        <- Not used
STUDY-018087_2020-09-05_00-00-00.timeJump.txt   <- Not used

Cohort Files

Contains general information on each patient visit.

/deep/group/ed-monitor/cp_cohort_v1.csv

This file is expected to have the following structure, in order to be compatible with the pre-processing scripts:

CSN MRN Age Gender  CC  Triage_acuity   Arrival_time    Arrival_area    Admit_area  First_room  Roomed_time First_bed   Last_bed    ED_dispo    DC_dispo    Dispo_time  Admit_time  Departure_time  Admit_service   ED_LOS  Dx_ICD9 Dx_ICD10    Dx_name Arrival_to_roomed   CC_CP_SOB   SpO2    RR  HR  Temp    SBP DBP
123456789012    12345678    64  M   ABDOMINAL PAIN,NAUSEA,EMESIS    3-Urgent    2015-08-01T01:57:00Z    SHC ED 500P ED LOBBY    SHC ED 500P ALPHA Z ZONE    X08 2015-08-01T03:34:00Z    A08 ADULT ED OVRFLW Admit to Inpatient  Home/Work (includes foster care)    2015-08-01T11:50:00Z    2015-08-01T11:17:00Z    2015-08-01T14:38:00Z    General Surgery 12.68   789.01, 338.19  R10.11  Acute abdominal pain in right upper quadrant    97  0   92  20  126 36.3    129 82
123456789013    12345679    25  F   SHORTNESS OF BREATH,ABDOMINAL PAIN  3-Urgent    2015-08-01T03:40:00Z    SHC ED 500P ED LOBBY    SHC ED 500P ALPHA Z ZONE    X09 2015-08-01T04:36:00Z    A09 A09 Discharge   Home/Work (includes foster care)            2015-08-01T14:06:00Z        10.43   305 F10.10  Alcohol abuse   56  1   98  19  98  36.3    118 80

Bedside Summary Files (Export Patient List)

Each of these files are found at the root of each data ED monitor data range folder. This file contains the study ID, bed label, and patient times which we can use to map to the cohort files.

/deep/group/ed-monitor/2020_08_23_2020_09_23/ExportPatientList_2020.09.23_16.28.37.csv 

Each file has the following structure:

"StudyId","PatientId","ClinicalUnit","BedLabel","LifetimeId","AlternateId","EncounterId","FirstName","MiddleName","LastName","Name","Gender","AdmitState","StartTime","Start?","EndTime","Discharged","ExportStartTime","ExportEndTime"
"STUDY-024636","0172dd32-f471-43e1-9df6-1e3573a83396","ED","04DELTA","","","","","","","","GenderUnknown","NotAdmitted","09/10/20 21:11:52","?","09/12/20 20:30:34","","08/23/20 15:00:00","09/23/20 16:00:00"
"STUDY-017322","0193f321-f28a-4ed8-8f55-e66386dd61c0","ED","03CARD","","","","","","","","GenderUnknown","NotAdmitted","08/30/20 18:24:51","?","08/30/20 20:12:59","","08/23/20 15:00:00","09/23/20 16:00:00"

Contrary to header labels, the patient information is never actually provided here. Instead, the following pre-processing steps will utilize the bed label and start/end times to map to the cohort file.

Data Pre-Processing

Refer to this notebook for a full walk-through of the data pre-processing steps, including the input to the scripts run.

Matching

The matching step (see above notebook) will match patients to their corresponding beds. This results in two files - example:

/deep/group/ed-monitor/patient_data_v9/matched-cohort.csv
/deep/group/ed-monitor/patient_data_v9/matched-export.csv

Consolidating Data

The raw data was pre-processed by extracting the relevant waveforms from the underlying studies and trimming to the time that the patient was actually in the room (based on the roomed_time and dispo_time). These files are known as the "consolidated" files.

Waveform Description

Consolidated File Summary

An example of this file is located at: /deep/group/ed-monitor/patient_data_v9/consolidated.csv.

Note that the filtered suffix refers to how the file was filtered to remove patients whose first_trop_time was before the roomed_time.

The column descriptions:

Consolidated Folder Structure

An example folder that contains all of the pre-processed waveforms is located at:

/deep/group/ed-monitor/patient_data_v9/patient-data/<Patient ID>

The v9 suffix here denotes that the dataset contains patients extracted between Aug 2020 - Aug 2021. Different datasetse were generated as more data was incrementally made available.

Each patient folder contains the following consolidated files. Note that all waveforms are pre-trimmed according to the roomed_time and dispo_time (to the extent that the data is available) but it is recommended to apply the recommended_trim_start_sec and recommended_trim_end_sec values during actual usage of the data.

Filtering

Multiple patients were excluded due to reasons such as missing data files, empty waveforms, or ambiguous bed times. The full flow is detailed in the notebook.

An example of this file is located at: /deep/group/ed-monitor/patient_data_v9/consolidated.csv.

Train/Val/Test Data

The filtered consolidated file was then split into a train/val/test dataset as described the notebook.

Waveform Sampling

In the extracted consolidated dataset, each patient may have multiple hours worth of waveform data. To facilitate downstream training, we can extract random or continuous waveform segments.

For most experiments, we extracted random ten waveforms from the consolidated files that we later passed through a pre-trained Transformer-based model to obtain embeddings. The final embeddings were produced by averaging the ten embeddings.

The script to accomplish this is the following, containing example input (note that this may run for several hours so it is advisable to submit a SLURM job):

python -u processing/prepare_ed_waveforms.py -i /deep/group/ed-monitor/patient_data_v9/consolidated.filtered.csv -d /deep/group/ed-monitor/patient_data_v9/patient-data -o /deep/group/ed-monitor/patient_data_v9/waveforms -l 15 -f 500 -n -w II -m 10 -b First_trop_result_time-waveform_start_time

Note the following parameters to this script:

The output of this script will be the following directory format:

/deep/group/ed-monitor/patient_data_v9/waveforms/
    15sec-500hz-1norm-10wpp/
        waveforms.dat.npy
        summary.csv

Waveforms Summary File

The summary.csv file is an index into the waveforms 2D NumPy array, where each row in this file corresponds to the row in the NumPy array. The file also indicates where in the original waveform the sampled waveform came from. Note that since we have chosen to sample 10 waveforms per patient, there are 10 rows per record. A record_name is the same as the patient_id or CSN.

record_name,pointer
831309290000,1263036
831309290000,132258
831309290000,395342
831309290000,840013
831309290000,701279
831309290000,422066
831309290000,1437245
831309290000,2029409
831309290000,2152970
831309290000,550134
831309290001,331637
831309290001,728771
831309290001,413427
...
Getting Original Waveform Given Embedding File

Let's say we wanted to see the first waveform for patient 831309290001.

  1. Read the file located at /deep/group/ed-monitor/patient_data_v9/patient-data/831309290001/II.dat.npy into a NumPy array using np.load(...)
  2. The waveform will be located at: waveform[1263036:1263036+15*500].

Waveforms NumPy Array

The waveforms.dat.npy is a single 2D NumPy array containing the extracted and sampled waveforms. It has the dimensions (n, l*f), where n is the number or rows in the summary file, l is the length of the sampled waveform in seconds, and f is the frequency of the waveform.

Embedding Extraction

Waveforms are nice, but the real power comes from creating embeddings from a pre-trained deep learning model. To do this, we can run the following script:

python -u processing/apply_pretrained_transformer.py -i /deep/group/ed-monitor/patient_data_v9/waveforms/15sec-500hz-1norm-10wpp/II -m /deep/u/tomjin/aihc-aut20-selfecg/prna/outputs-wide-64-15sec-bs64/saved_models/ctn/fold_1/ctn.tar -d 64

This script takes the waveform array produced previously and, for each waveform, runs it through the pretrained model provided as input. If there are multiple waveforms per patient, the embeddings will be averaged for the patient.

Depending on the embedding script run, it will produce files similar to the following structure:

/deep/group/ed-monitor/patient_data_v9/waveforms/15sec-500hz-1norm-10wpp/
    transformer-64/
        embeddings.dat.npy
        embeddings_summary.csv

While this script demonstrates how you may run the file for a pre-trained Transformer model, similar scripts also exist for pre-trained SimCLR and PCLR models. See the respective scripts for more details.

Embeddings Summary File

The embeddings_summary.csv file is an index into the waveforms 2D NumPy array, where each row in this file corresponds to the row in the NumPy array.

record_name
831309290001
831309290002
831309290003
831309290004
...

Embeddings NumPy Array

The embeddings.dat.npy is a single 2D NumPy array containing the extracted and sampled waveforms. It has the dimensions (n, d), where n is the number or rows in the summary file, d is the size of the embedding.

Numerics Extraction

Numerics files are generated alongside the waveform files. They contain data produced by the bedside monitor, and generally correspond to the information displayed on the bedside monitor screen. The data may be slightly different for each patient.

An example file is:

Date,Time,Diff,Sequence,SpO2,Pulse (SpO2),NBPs,NBPd,NBPm,Perf,Normal Beats,Supra-ventricular Beats,Supra-ventricular Premature Beats,PVC Beats,Number of Multiform PVCs,Number of R-on-T PVC Beats,Number of PVC Pairs,Total Beats,Percent Ventricular Trigeminy,Percent Ventricular Bigeminy,Pause Events,Number of SVPB Runs,Maximum HR in SVPB Runs,Minimum HR in SVPB Runs,Number of V? Runs,Number of PVC Runs,Maximum HR in PVC Runs,Minimum HR in PVC Runs,Longest PVC Run,Atrial Paced Beats,Ventricular Paced Beats,Dual Paced Beats,Total Paced Beats,Percent of Beats that were Paced,Percent Paced Beats that were Atrially Paced,Percent Paced Beats that were Dual Paced,Percent Paced Beats that were Ventricularly Paced,Number of Paced Runs,Maximum HR in Paced Runs,Minimum HR in Paced Runs,Number of Pacer Not Pacing Events,Number of Pacer Not Capture Events,Percent Irregular Heart Rate,pNN50,Square Root of NN Variance,Percent Poor Signal
07/02/2021, 21:04:47.000 -07:00, 00:00:00.000,678557127136,,,119,80,90,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
07/02/2021, 21:04:50.103 -07:00, 00:00:03.103,678557090240,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
07/02/2021, 21:05:33.143 -07:00, 00:00:43.040,678557133280,99,76,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

Because this file is messy and sometimes difficult to follow, we can extract the data out for each patient who we are interested in. This can be run using the following command, using the matched cohort file from one of the earlieset steps as a starting point. Note that the example below uses a different folder than previous arguments, but that is because we never ran the numerics extraction on the myocardial injury project.

python prepare_ed_numerics_from_matched_cohort.py -i /deep/group/physiologic-states/v1/matched-cohort.csv -d /deep/group/ed-monitor/2020_08_23_2020_09_23,/deep/group/ed-monitor/2020_09_23_2020_11_30,/deep/group/ed-monitor/2020_11_30_2020_12_31,/deep/group/ed-monitor/2021_01_01_2021_01_31,/deep/group/ed-monitor/2021_02_01_2021_02_28,/deep/group/ed-monitor/2021_03_01_2021_03_31,/deep/group/ed-monitor/2021_04_01_2021_05_12,/deep/group/ed-monitor/2021_05_13_2021_05_31,/deep/group/ed-monitor/2021_06_01_2021_06_30,/deep/group/ed-monitor/2021_07_01_2021_07_31 -o /deep/group/physiologic-states/v1/processed -p 100

Note the following parameters to this script:

If the script exits prematurely, it can be safely restarted with the same input.

Numerics Summary File

A summary file of what numerics files were produced, including the lengths and availability of each extracted measure.

csn,HR_len,RR_len,SpO2_len,NBPs_len,NBPd_len
831309290001,6854,6849,7861,10,10
831309290002,919,1029,555,0,0
...

Numerics Output Files

A Pickle object is produced for each patient with this script the following folder structure:

/deep/group/physiologic-states/v1/processed/

    00/
        831309290000.pkl
        831309300000.pkl
    01/
        831309290001.pkl
        831309300000.pkl
    ...

Note that the subfolders are used to make the file listing manageable, and are based on the last two digits of each patient ID or CSN.

Each pkl file contains the folowing keys (which may change depending on what fields the script is configured to extract). Also note that the lengths of each numeric value (e.g. HR, RR) may be different as they are recorded in different time intervals:

Training

Refer to this notebook on how you can use the extracted embeddings to train downstream models using the edm library.