Open stefanhgm opened 4 days ago
Hey, thanks for your interest. Here are the main steps we did
Step 1: Obtain access and download mimic-iii data
Step 2: We first ran the following notebook (more details are on their repo): CAML - Data Load
Step 3: We then ran (more details are on their repo): KEPT - Data Load (It doesn't do anything extra - but makes some changes for the "50" task). The final result is a json file for each of the splits (train, dev, test). We did an extra step where we saved it as jsonl instead. Given below.
Step 4: Ran the following code to save as jsonl and some minor cleanup. Clean the data and store in a form that can be used for model training.
import re
import json
from collections import Counter, defaultdict
from pathlib import Path
def clean_text(text, long_char_max=8):
text = re.sub(r'\[\*\*[^\]]*\*\*\]', '', text)
text = text.replace('\n', ' ').replace("\r", " ").strip()
text = re.sub('\s+', ' ', text)
return re.sub(
rf'(?P<specchar>([^a-zA-Z0-9_\s]|_)){{{long_char_max},}}', r'\g<specchar>' * long_char_max, text
)
input_file = '<directory>/physionet.org/files/mimiciii/1.4/mimic3_train.json'
output_file = \
'<directory>phenotype_classification/data/multi_label_mimic_icd_9/full/train.jsonl'
data_list = json.load(open(input_file))
with open(output_file, 'w') as file:
for data in data_list:
data.pop('Addition')
data['text'] = clean_text(data.pop('TEXT'))
data['labels'] = data.pop('LABELS').replace(';', ' ')
file.write(json.dumps(data) + '\n')
Let us know if you want anymore details or if you have any questions! Thanks.
Hello!
Thanks for sharing the code of your work. Could you point me to the code to reproduce the dataset used for the contrastive learning? I would be very interested in using this dataset.
Thank you Stefan