stanford-crfm / BioMedLM

590 stars 61 forks source link

Can you share preprocessed datasets for fine-tuning? #5

Closed Nardien closed 1 year ago

Nardien commented 1 year ago

Thank you for open-sourcing this valuable resource. I am interested in reproducing the experiments in this repository and would like to follow the fine-tuning setup you used.

However, I noticed that the data folders for the fine-tuning tasks do not contain the datasets. Could you please share the preprocessed datasets or provide guidelines on how to preprocess the data for the reproduction of the results?

I would greatly appreciate any assistance you can provide.

J38 commented 1 year ago

Sure thing I will add more details about how to do that and some helper scripts.

Nardien commented 1 year ago

Thanks a lot!! Please close this issue if you finish the work.

J38 commented 1 year ago

I have added some basic instructions for getting MedQA, PubMedQA, and BioASQ running. There are preprocessing scripts in finetune/mc and finetune/seqcls . You do need to go to the original sources for these tasks to download the original data. Then you can run the preprocessing scripts and it will produce the one example per line .jsonl that is expected.

I haven't really tested out rebuilding this data, so please let me know if you encounter any issues or if anything is unclear and I will update the scripts and instructions !

evanbrociner commented 1 year ago

Are there pre trained weights for the MedQA or PubMedQA fine-tuned model. Thank you!

J38 commented 1 year ago

We didn't save the fine-tuned models for the tasks. There are some upcoming conference deadlines so our compute is really busy at this time, but when it frees up I intend to work on re-doing the fine-tuning and saving the models this time and we can make those available.

evanbrociner commented 1 year ago

No problem, thank you so much!

Nardien commented 1 year ago

Thank you for the quick update! I have checked that preprocessing codes on PubMedQA and MedQA work well, and successfully fine-tuned the GPT2 model on both datasets. (Unfortunately, my computation resources also cannot handle PubMedGPT fine-tuning...)

Regarding BioASQ, I fail to find appropriate raw dataset files in tsv format on its official homepage. I found there is a GitHub repo providing a preprocessed version of BioASQ, but I am not sure this is the same version that you used in your experiments. If it's okay, can you share more details on the BioASQ dataset you used in experiments?

Thanks again for your kind and responsive comment!

Nardien commented 1 year ago

++ Update

I found that LinkBERT offers detailed preprocessing scripts for bioasq following BLURB benchmark!

It seems that this repo also utilizes the same preprocessing pipeline with LinkBERT. So I will try it.

++ Update 2

I think now I can successfully run the bioasq fine-tuning. Finally, I obtain the following scores by fine-tuning GPT-2 (124M) on three tasks:

These results might be good starting points for my research... Thanks for your help!

J38 commented 1 year ago

Yes I directly copied those scripts into this repo as well in finetune/seqcls and finetune/mc

J38 commented 1 year ago

Did I miss anything when copying over? I did delete the non-QA stuff for now ... though technically people might want that too ... we haven't focused on the full BLURB yet ...

J38 commented 1 year ago

@michiyasunaga who created LinkBERT is one of the team members on this project as well ...

Nardien commented 1 year ago

At least for bioasq, I think people need the following codes from the BLURB benchmark to successfully preprocess it.

wget https://microsoft.github.io/BLURB/sample_code/data_generation.tar.gz
tar -xf data_generation.tar.gz

In the downloaded codes, there is a preprocessing code (preprocessor.py) containing bioasq preprocessing, which parses BioASQ-training7b/trainining7b.json and Task7BGoldenEnriched into train.tsv, dev.tsv, and test.tsv files. (Note that BioASQ-training7b/trainining7b.json and Task7BGoldenEnriched are files that can be downloaded from the official bioasq homepage - Datasets for task b - BioASQ7.)

With the preprocessed *.tsv files, I can finally get the jsonl files using the preprocessing code in this repo.

I think adding the above information to this repo might be helpful for future practitioners!

J38 commented 1 year ago

Sounds good ... the scripts from LinkBERT are exactly what we used for this ...

evanbrociner commented 1 year ago

@Nardien any chance you can share the weights of the fine-tuned model? Thank you