Code the Reproduce the Dataset

Hey, thanks for your interest. Here are the main steps we did

Step 1: Obtain access and download mimic-iii data
Step 2: We first ran the following notebook (more details are on their repo): CAML - Data Load
Step 3: We then ran (more details are on their repo): KEPT - Data Load (It doesn't do anything extra - but makes some changes for the "50" task). The final result is a json file for each of the splits (train, dev, test). We did an extra step where we saved it as jsonl instead. Given below.
Step 4: Ran the following code to save as jsonl and some minor cleanup. Clean the data and store in a form that can be used for model training.

import re
import json
from collections import Counter, defaultdict
from pathlib import Path

def clean_text(text, long_char_max=8):
    text = re.sub(r'\[\*\*[^\]]*\*\*\]', '', text)
    text = text.replace('\n', ' ').replace("\r", " ").strip()
    text = re.sub('\s+', ' ', text)
    return re.sub(
        rf'(?P<specchar>([^a-zA-Z0-9_\s]|_)){{{long_char_max},}}', r'\g<specchar>' * long_char_max, text
    )

input_file = '<directory>/physionet.org/files/mimiciii/1.4/mimic3_train.json'

output_file = \
'<directory>phenotype_classification/data/multi_label_mimic_icd_9/full/train.jsonl'

data_list = json.load(open(input_file))

with open(output_file, 'w') as file:
    for data in data_list:
        data.pop('Addition')
        data['text'] = clean_text(data.pop('TEXT'))
        data['labels'] = data.pop('LABELS').replace(';', ' ')
        file.write(json.dumps(data) + '\n')

Step 5: Ran this for the train, dev and test splits: https://github.com/obi-ml-public/NoteContrast/blob/main/note_pretraining/pre_train_datasets/setup/mimic/mimic_icd10_map.py

Let us know if you want anymore details or if you have any questions! Thanks.

obi-ml-public / NoteContrast

Code the Reproduce the Dataset #1