stanford-crfm / BioMedLM

601 stars 63 forks source link

Distilling PubMedGPT #3

Open ChantalMP opened 1 year ago

ChantalMP commented 1 year ago

Thank you very much for this great work and for publishing the model! Do you have any plans of training / publishing a distilled version of your model, as the current size requires a lot of recources?

J38 commented 1 year ago

We are very committed to helping people use the model and I think part of this project is just figuring out how to make a large scale model like this useful for a larger research community.

A simple solution would be for us to release one of the smaller models we trained on the way to the 2.7B. This would come at the cost of reduced task performance.

There are two aspects of this problem, handling the fine-tuning and handling the inference.

For fine-tuning, one possible way forward could be for us to fine-tune several biomedical task models (e.g. QA, summarization) ... and then make those fine-tuned available to researchers. You could imagine making a general biomedical QA model, and then if the user puts their custom QA task into the proper format, they could get reasonable results. I can't make any promises, but another possible direction is for users to give us their task data (if it is not private) and we can fine-tune models for them to make the model more accessible. I am asking if that is feasible for cases where it would only take us 30m-1h.

For inference, I think we could explore the kinds of things Tim Dettmers is working on, for instance making an 8bit version of the model for inference time. This would greatly reduce the resources needed to run inference.

Please feel free to let us know what projects you are working on and we can see what we can do to help make the model useful for you !

ChantalMP commented 1 year ago

Hi, I wanted to use the model as a decoder for medical VQA, where I would need to fine-tune it to also take into account the image information. Just fine-tuning few layers is a possibility but first of all this might harm the performance and also it is still very slow for me because of the model size. This is just one example for applications where it would be beneficial to have a model small enough for fine-tuning.

I was thinking about distillation as a potential way of reducing size while keeping the performance as high as possible.

J38 commented 1 year ago

Okay I understand. We're open minded about looking into that but may not have the time to get it working.

At the moment, this is the best resource I know for trying a distillation experiment: https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation

Is there anything better you know of?