obi-ml-public / NoteContrast

MIT License
3 stars 0 forks source link

Code the Reproduce the Dataset #1

Open stefanhgm opened 4 days ago

stefanhgm commented 4 days ago

Hello!

Thanks for sharing the code of your work. Could you point me to the code to reproduce the dataset used for the contrastive learning? I would be very interested in using this dataset.

Thank you Stefan

prajwal967 commented 4 days ago

Hey, thanks for your interest. Here are the main steps we did

import re
import json
from collections import Counter, defaultdict
from pathlib import Path

def clean_text(text, long_char_max=8):
    text = re.sub(r'\[\*\*[^\]]*\*\*\]', '', text)
    text = text.replace('\n', ' ').replace("\r", " ").strip()
    text = re.sub('\s+', ' ', text)
    return re.sub(
        rf'(?P<specchar>([^a-zA-Z0-9_\s]|_)){{{long_char_max},}}', r'\g<specchar>' * long_char_max, text
    )

input_file = '<directory>/physionet.org/files/mimiciii/1.4/mimic3_train.json'

output_file = \
'<directory>phenotype_classification/data/multi_label_mimic_icd_9/full/train.jsonl'

data_list = json.load(open(input_file))

with open(output_file, 'w') as file:
    for data in data_list:
        data.pop('Addition')
        data['text'] = clean_text(data.pop('TEXT'))
        data['labels'] = data.pop('LABELS').replace(';', ' ')
        file.write(json.dumps(data) + '\n')

Let us know if you want anymore details or if you have any questions! Thanks.