stanford-crfm / BioMedLM

590 stars 61 forks source link

Using a UMLS based retriever to enhance MedQA-USMLE performance #14

Open manusikka opened 1 year ago

manusikka commented 1 year ago

We intended to supplement this MedQA-USMLE evaluation with a search on UMLS. UMLS is a large biomedical corpus and can be queried using API’s. We expect the new accuracy to go above the baseline, when adding searched medical term descriptions to the prompt during evaluation.

The techniques for adding additional context is: a) for each answer choice, do a lookup against UMLS, and find additional context, based on the answer choice. Concatenate new context with each question/answer pair.

For example prompt is the original question and actual answer is c3: Common iliac artery aneurysm The UMLS retriever returns searcher1 for c1, searcher2 for c2, searcher3 for c3, searcher4 for c4

prompt: A 68-year-old male comes to the physician for evaluation of right flank pain. He has a history of diabetes and peripheral artery disease. His blood pressure is 160/90 mm Hg. Physical examination shows abdominal tenderness and right flank tenderness. An ultrasound shows dilation of the right ureter and renal pelvis. Which of the following is the most likely underlying cause of this patient's condition? c1: Renal artery stenosis c2: Benign prostatic hyperplasia c3: Common iliac artery aneurysm c4: Urethral stricture searcher1: Narrowing of a main artery in the kidney. searcher2: Obstructive nephropathy which has developed in a patient with evidence of bladder outflow obstruction caused by benign prostatic hypertrophy. searcher3: An artery arising from the bifurcation of the abdominal aorta which then bifurcates forming the internal and external iliac arteries. searcher4: Narrowing of the urethra associated with inflammation or scar tissue. [HPO:probinson] predicted_label: 0 actual_label: 2

Our hypothesis is that the model will have better accuracy when answers are supplemented with the definitions from searcher1 , searcher2 etc.

Here is how the supplemented test data with searcher results looks like (see bolded above and below) {"id": "test-00006", "sent1": "A 68-year-old male comes to the physician for evaluation of right flank pain. He has a history of diabetes and peripheral artery disease. His blood pressure is 160/90 mm Hg. Physical examination shows abdominal tenderness and right flank tenderness. An ultrasound shows dilation of the right ureter and renal pelvis. Which of the following is the most likely underlying cause of this patient's condition?", "sent2": "", "ending0": "Narrowing of a main artery in the kidney. Renal artery stenosis", "ending1": "Obstructive nephropathy which has developed in a patient with evidence of bladder outflow obstruction caused by benign prostatic hypertrophy. Benign prostatic hyperplasia", "ending2": "An artery arising from the bifurcation of the abdominal aorta which then bifurcates forming the internal and external iliac arteries. Common iliac artery aneurysm", "ending3": "Narrowing of the urethra associated with inflammation or scar tissue. [HPO:probinson] Urethral stricture", "label": 2}

However, the accuracy actually slightly drops when using the retriever. Do we need to change any command line parameters because the answers are longer?Any thoughts would be welcome on why we are not seeing improvement in results Here is what we used: deepspeed --num_gpus 1 --num_nodes 1 run_multiple_choice.py\ --tokenizer_name stanford-crfm/pubmed_gpt_tokenizer \ --model_name_or_path "/content/drive/MyDrive/Colab Notebooks/SavedModel300" \ --train_file $datadir/train300.json \ --validation_file $datadir/dev300.json\ --test_file $datadir/test300newRetr.json \ --do_predict\ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 2 \ --learning_rate 2e-06\ --warmup_ratio 0.5\ --num_train_epochs 20\ --max_seq_length 560 \ --logging_steps 100 \ --save_strategy no\ --evaluation_strategy no\ --output_dir medqa-finetune-demo\ --overwrite_output_dir \ --fp16\ --seed 1\ --run_name medqa-finetune-demo\ --deepspeed deepspeed_config.json

J38 commented 1 year ago

I think you need to supplement all of the training data and fine tune a model on that.