jind11 / MedQA

Code and data for MedQA
MIT License
183 stars 16 forks source link

An answer end with extraneous characters #5

Open teetone opened 1 year ago

teetone commented 1 year ago

I found an example in the MedQA EN questions in dev.jsonl where one of the answers had extra characters (a new line and a double quote) appended to it:

Answer choice E is Pulmonary embolism\n\"

{"question": "A 47-year-old man comes to the physician because of severe retrosternal chest pain and shortness of breath for 45 minutes. He has dyslipidemia, hypertension, and type 2 diabetes mellitus. Current medications include hydrochlorothiazide, lisinopril, metformin, and atorvastatin. He has smoked 1 pack of cigarettes daily for 20 years. He appears pale and diaphoretic. His temperature is 37°C (98.6°F), pulse is 115/min, and blood pressure is 140/70 mm Hg. Breath sounds are normal. The remainder of the examination shows no abnormalities. An ECG shows left ventricular hypertrophy with ST-segment elevation in leads I, aVL, and V1–V6. High-dose aspirin, clopidogrel, metoprolol, sublingual nitroglycerin, and unfractionated heparin are administered. As the patient awaits transport to the nearest emergency room, he collapses and becomes unresponsive. His pulse and blood pressure cannot be detected. Despite resuscitative efforts, the patient dies. Which of the following is the most likely cause of death in this patient?", "answer": "Ventricular fibrillation", "options": {"A": "Papillary muscle rupture", "B": "Left ventricular failure", "C": "Ventricular fibrillation", "D": "Septal wall rupture", "E": "Pulmonary embolism\n\""}, "meta_info": "step2&3", "answer_idx": "C"}

teetone commented 1 year ago

I also noticed the same issue for multiple answers in the 4_options set.

jind11 commented 1 year ago

hi, thanks for the notice! These should be some special strings attached to the options in the webpages we parse during the data collection process. An easy shot is to remove them by string matching, e.g., x.replace('\"', '').replace('\n', '') (x is the option string).