Open manusikka opened 1 year ago
Do you want to fine-tune on MedQA or just run evaluation of a model ?
I work with @manusikka on the class research project. We are looking to evaluate a model, establish baseline and confirm results "new state of the art for the MedQA task of 50.3%". Any guidance would be appreciated. Thanks.
num_devices = number of GPUs checkpoint = file path of hugging face model checkpoint dir
These two settings are related and depend on number of GPUs and how much memory the GPUs have:
train_per_device_batch_size = examples per device grad_accum = number of steps to accumulate gradient
batch_size = train_per_device_batch_size x num_devices x grad_accum
So for example if you want batch_size=8, you'd set train_per_device_batch_size=1, num_devices=8, grad_accum=1
(assuming you have 8 GPU)
If you want batch_size=32 you might do:
train_per_device_batch_size=1, num_devices=8, grad_accum=4
You could try train_per_device_batch=2, but you may run out of GPU memory.
lr = learning rate , for example 2e-06 num_train_epochs = number of epochs, for example 10 numerical_format = bf16 seed = random seed, set this differently for each experiment to something like 1,2, or 3 you can remove data_seed option run_name = name for your experiment
Let me know if that clarifies and if you have any other questions ...
One note: the 50.3% is an average with seed=1, seed=2, and seed=3 ... so any given experiment won't yield that exact number, and experiments on your machine will probably yield different results since randomness will be different ... so don't expect to fall on 50.3% exactly or even on average, but hopefully it should be close to that on average
Thank you @J38 for such detailed explanation. I appears that many of the parameters you've mentioned are needed for training. I am confused: why are we training model, if are only trying to run Evaluation on existent model? Or, are we first building a model AND then running Eval on it, all in one batch command?
Also, can you clarify conceptual question and let me know if I am thinking right: BioMedLM is a model that has been already trained on data and is saved to HuggingFace: https://huggingface.co/stanford-crfm/BioMedLM.
I should be able to just download the BioMedLM model and run evaluation on MedQA WITHOUT training, right? For example, I would do something like this:
tokenizer = GPT2Tokenizer.from_pretrained("stanford-crfm/BioMedLM") model = GPT2LMHeadModel.from_pretrained("stanford-crfm/BioMedLM").to(device) input_ids = tokenizer.encode( "A 20-year-old woman presents with menorrhagia for the past several years..... Which of the following is the most likely cause of this patient’s symptoms? A: Factor V Leiden ...", return_tensors="pt" ).to(device)
sample_output = model.generate(input_ids, do_sample=True, max_length=50, top_k=50)
Then, compare output to correct label and see if there is an exact match to the answer. Is this evaluation method appropriate?
Also, I am trying to run evaluation on a MedQA question via model, as in: ` question = ("A 20-year-old woman presents with menorrhagia for the past several years." "She says that her menses “have always been heavy”, and she has experienced easy bruising for as long as she can remember." "Family history is significant for her mother, who had similar problems with bruising easily. " "The patient's vital signs include: heart rate 98/min, respiratory rate 14/min, temperature 36.1°C (96.9°F)," " and blood pressure 110/87 mm Hg. Physical examination is unremarkable. " " Laboratory tests show the following: platelet count 200,000/mm3, PT 12 seconds," " and PTT 43 seconds. Which of the following is the most likely cause of this patient’s symptoms?" "A: Factor V Leiden B: Hemophilia A C: Lupus anticoagulant D: Protein C deficiency E Von Willebrand disease" )
input_ids = tokenizer.encode( question, return_tensors="pt" ).to(device)
sample_output = model.generate(input_ids, do_sample=True, max_length=50, top_k=50)
print("Output:\n" + 100 * "-") print(tokenizer.decode(sample_output[0], skip_special_tokens=True)) `
attention_mask
to obtain reliable results.
Setting pad_token_id
to eos_token_id
:28895 for open-end generation.
Input length of input_ids is 209, but max_length
is set to 50. This can lead to unexpected behavior. You should consider increasing max_new_tokens
.
Output:A 20-year-old woman presents with menorrhagia for the past several years.She says that her menses “have always been heavy”, and she has experienced easy bruising for as long as she can remember.Family history is significant for her mother, who had similar problems with bruising easily. The patient's vital signs include: heart rate 98/min, respiratory rate 14/min, temperature 36.1°C (96.9°F), and blood pressure 110/87 mm Hg. Physical examination is unremarkable. Laboratory tests show the following: platelet count 200,000/mm3, PT 12 seconds, and PTT 43 seconds. Which of the following is the most likely cause of this patient’s symptoms?A: Factor V Leiden B: Hemophilia A C: Lupus anticoagulant D: Protein C deficiency E Von Willebrand disease An
Notice that the answer seem to be truncated (very last "An"). Is there a way to use above code snippet to display answer to the multiple choice MedQA question? Thanks!
I was able to run following command in terminal:
python -m torch.distributed.launch --nproc_per_node=1 --nnodes=1 --node_rank=0 \ run_multiple_choice.py --tokenizer_name stanford-crfm/pubmed_gpt_tokenizer --model_name_or_path \ /root/.cache/huggingface/hub/models--stanford-crfm--BioMedLM --stanford-crfm--BioMedLM --train_file data/medqa_usmle_hf/train.json --validation_file data/medqa_usmle_hf/dev.json \ --test_file data/medqa_usmle_hf/test.json --do_train --do_eval --do_predict --per_device_train_batch_size \ 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 1 \ --learning_rate 2e-06 --warmup_ratio 0.5 --num_train_epochs 10 --max_seq_length 512 \ --bf16 --seed 1 --data_seed 1 --logging_first_step --logging_steps 20 \ --save_strategy no --evaluation_strategy steps --eval_steps 500 --run_name alex \ --output_dir trash/ \ --overwrite_output_dir
I see all THREE command: do_train do_eval and do_predict. Should I be able to use just do_eval for my evaluation? Where should I be able to see the results from eval? Thank you.
I've found this tutorial on multi-choice inference: https://huggingface.co/docs/transformers/tasks/multiple_choice#inference Are we supposed to train our BioMedLM on Multi-Choice task, before running inference, as in this example: https://huggingface.co/docs/transformers/tasks/multiple_choice#train ?
Thank you.
The results will be printed out after the training is complete. I think do_eval will just work for eval. That command is running fine-tuning for multiple choice, and at the end prints out the results and puts .json
files in the directory for the fine-tuned model.
Thank you, @J38. Appreciate you response.
I am running following command on a single GPU (on https://colab.research.google.com/ using Pro+ GPU) task=medqa_usmle_hf datadir=data/$task outdir=runs/$task/GPT2 mkdir -p $outdir
python -m torch.distributed.launch --nproc_per_node=1 --nnodes=1 --node_rank=0 \ run_multiple_choice.py --tokenizer_name stanford-crfm/pubmed_gpt_tokenizer --model_name_or_path \ stanford-crfm/BioMedLM --train_file data/medqa_usmle_hf/train.json --validation_file data/medqa_usmle_hf/dev.json \ --test_file data/medqa_usmle_hf/test.json --do_train --do_eval --do_predict --per_device_train_batch_size \ 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 1 \ --learning_rate 2e-06 --warmup_ratio 0.5 --num_train_epochs 10 --max_seq_length 512 \ --fp16 --seed 1 --data_seed 1 --logging_first_step --logging_steps 20 \ --save_strategy no --evaluation_strategy steps --eval_steps 500 --run_name alex \ --output_dir trash/ \ --overwrite_output_dir
I am getting GPU error:
I've been experimenting with export 'PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128'
but getting the same error.
Do you have recommendations for parameters when running train/eval on single GPU?
Thanks
You're going to have to use cpu_offloading if you're trying to train this on a single GPU.
Here is a thread where I got it working on 1 GPU for sequence classification:
I think it may be sufficient to just update the deepspeed config to use cpu_offloading ... there is an example deepspeed config in that thread I shared in the previous comment.
What this will do is drop information to machine RAM allowing you to work with much larger models at the cost of running much more slowly. But it is the only option for a model this large when you don't have a lot of GPU memory ...
You will need to use DeepSpeed rather than the torch distributed launch ... so I can see if I can get an example for the MC choice code working. It should be similar to what I posted for the sequence classification example.
@J38 Thank you for the guidance. We've just got deepspeed to work!
Here is the code in Jupyter Notebook: !pip install fairscale !pip install accelerate !pip install deepspeed
Here is the command line that worked (but ran VERY slow) `task=medqa_usmle_hf ; datadir=data/$task ; export WANDB_PROJECT=biomedical-nlp-eval
deepspeed --num_gpus 1 --num_nodes 1 run_multiple_choice.py --tokenizer_name stanford-crfm/pubmed_gpt_tokenizer --model_name_or_path stanford-crfm/BioMedLM --train_file $datadir/train.json --validation_file $datadir/dev.json --test_file $datadir/test.json --do_train --do_eval --do_predict --per_device_train_batch_size 1 --gradient_accumulation_steps 2 --learning_rate 2e-06 --warmup_ratio 0.5 --num_train_epochs 20 --max_seq_length 560 --logging_steps 100 --save_strategy no --evaluation_strategy no --output_dir medqa-finetune-demo --overwrite_output_dir --fp16 --seed 1 --run_name medqa-finetune-demo --deepspeed deepspeed_config.json `
The deepspeed_config.json was taken from this thread: https://github.com/stanford-crfm/BioMedLM/issues/9
Just to summarize, there are several ways to run a fine-tuning process, including:
If you use deepspeed, the deepspeed config will determine optimizer settings. So for instance that config sets the learning rate, so make sure you review the deepspeed config and set the training parameters the way you want for the experiment.
I think the
--learning_rate 2e-06
in your command. It's possible deepspeed will just notice this, but I would advise carefully reviewing the config to make sure all of the settings are what you want.
Now it happens the deepspeed config I showed had learning rate 2e-06
... but just wanted to let you know that that config will influence the optimizer settings, because deepspeed executes the optimization.
It is expected to be really slow, sorry, but training a model this large on 1 GPU is going to take a bit of time vs. using multiple GPUs. I think 8 GPUs take 1.5h to fine tune on this set, so it will be substantially slower with 1 GPU and cpu_offloading.
I will work to take notes from these issues and update the documentation to have some clear fine-tune on 1 GPU examples ... I think 1 GPU with cpu_offloading is going to be a common use case for a lot of users.
The PubMedQA task should only take like 4 hours, but that is a lot smaller training set ...
We were able to run preprocess_medqa.py based on the steps in https://github.com/stanford-crfm/BioMedLM/tree/main/finetune/mc
Next we wanted to run the evaluator as we already downloaded the question and answers
We went here https://github.com/stanford-crfm/BioMedLM/tree/main/finetune and ran task=medqa_usmle_hf datadir=data/$task outdir=runs/$task/GPT2 mkdir -p $outdir python -m torch.distributed.launch --nproc_per_node={num_devices} --nnodes=1 --node_rank=0 \ run_multiple_choice.py --tokenizer_name stanford-crfm/pubmed_gpt_tokenizer --model_name_or_path \ {checkpoint} --train_file data/medqa_usmle_hf/train.json --validation_file data/medqa_usmle_hf/dev.json \ --test_file data/medqa_usmle_hf/test.json --do_train --do_eval --do_predict --per_device_train_batch_size \ {train_per_device_batch_size} --per_device_eval_batch_size 1 --gradient_accumulation_steps {grad_accum} \ --learning_rate {lr} --warmup_ratio 0.5 --num_train_epochs {epochs} --max_seq_length 512 \ --{numerical_format} --seed {seed} --data_seed {seed} --logging_first_step --logging_steps 20 \ --save_strategy no --evaluation_strategy steps --eval_steps 500 --run_name {run_name} \ --output_dir trash/ \ --overwrite_output_dir
It asks for various arguments that are missing e.g. {num_devices}, {checkpoint} {train_per_device_batch_size} etc
Can someone give us the command to execute "run_multiple_choice.py" exactly with arguments ?