microsoft / BioGPT

MIT License
4.26k stars 445 forks source link

BioGPT is now available in 🤗 Transformers #31

Open NielsRogge opened 1 year ago

NielsRogge commented 1 year ago

BioGPT is now available for usage in 🤗 Transformers!

Docs: https://huggingface.co/docs/transformers/main/en/model_doc/biogpt.

Checkpoints on the hub: https://huggingface.co/microsoft/biogpt

It'd be very nice if someone converted the remaining BioGPT checkpoints to the HuggingFace format. The conversion script can be found here.

kamalkraj commented 1 year ago

https://huggingface.co/kamalkraj/BioGPT-Large-PubMedQA

Screenshot 2023-02-04 at 1 21 23 PM

Screenshot 2023-02-04 at 1 26 48 PM

harveenchadha commented 1 year ago

@kamalkraj Where did you find the model dict? It should be inside the checkpoint folder but it is not provided explicitly. I then remembered the change in fairseq where they used to store the dict as a part of the model. But even after loading the model, I was unable to find the dict.

kamalkraj commented 1 year ago

Hi @harveenchadha,

Once the model is loaded like this below

import torch
torch.manual_seed(42)

from src.transformer_lm_prompt import TransformerLanguageModelPrompt

model = TransformerLanguageModelPrompt.from_pretrained(
        "../QA-PubMedQA-BioGPT-Large",
        "checkpoint.pt",
        "../QA-PubMedQA-BioGPT-Large",
        max_len_b=1024,
        max_tokens=12000,
        source_lang="x",
        target_lang="y")

You can save the dict using.

m.src_dict.save("new_dict/dict.txt")
harveenchadha commented 1 year ago

Hi @kamalkraj,

Thanks for the reply but Looks like to load the model itself you need a dict. What am I doing wrong?

Here is a colab

harveenchadha commented 1 year ago

Oh man! I just found out dict and bpecodes are present in data folder itself :D

NielsRogge commented 1 year ago

@kamalkraj do you mind converting the other BioGPT checkpoints?

Can I transfer this checkpoint to the Microsoft organization?

kamalkraj commented 1 year ago

@kamalkraj do you mind converting the other BioGPT checkpoints?

Can I transfer this checkpoint to the Microsoft organization?

You can transfer https://huggingface.co/kamalkraj/BioGPT-Large-PubMedQA to Microsoft.

I will update in this issue as i convert the other models

kamalkraj commented 1 year ago

@NielsRogge https://huggingface.co/kamalkraj/BioGPT-Large

evanbrociner commented 1 year ago

Is it possible to fine-tune a model through the huggingface package? Thank you!

sockthem commented 1 year ago

@harveenchadha were u able to execute in in colab?

sockthem commented 1 year ago

@NielsRogge can you help me with Question Answering inferencing documentation from the same model? Got multiple errors.

NielsRogge commented 1 year ago

@evanbrociner yes fine-tuning can be done easily. See our example notebook and example script to fine-tune any GPT-like model (like BioGPT) on your custom dataset.

NielsRogge commented 1 year ago

@sockthem sure, note that BioForCausalLM is just a generative model which you can prompt with text and it will continue the prompt. It's not like BertForQuestionAnswering which does extractive question answering from a piece of text.

evanbrociner commented 1 year ago

@NielsRogge Thank you for all the amazing help! Another quick question, might a hugging face implementaiton for Fine-tuned BioGPT for document classification task on HoC be in the works?

NielsRogge commented 1 year ago

@evanbrociner there's currently a contributor adding a BioGptForSequenceClassification class, which could be used for this purpose. Alternatively, you could fine-tune BioGPT to simply make it generate the appropriate class as next token.

However note that GPT-like (decoder-only Transformer) models oftentimes aren't the best at classification tasks, as they have a causal attention mask instead of a bidirectional attention mask (meaning they can only look at previous tokens when making a prediction, whereas BERT-like or encoder-only Transformers can look in both directions).

For classifying biomedical texts, a model like BioClinicalBERT might work better.

SalvatoreRa commented 1 year ago

Hi,

The BiogGPT checkpoint on transformers can be used for relation extraction on PubMed?

timothylimyl commented 1 year ago

@evanbrociner there's currently a contributor adding a BioGptForSequenceClassification class, which could be used for this purpose. Alternatively, you could fine-tune BioGPT to simply make it generate the appropriate class as next token.

However note that GPT-like (decoder-only Transformer) models oftentimes aren't the best at classification tasks, as they have a causal attention mask instead of a bidirectional attention mask (meaning they can only look at previous tokens when making a prediction, whereas BERT-like or encoder-only Transformers can look in both directions).

For classifying biomedical texts, a model like BioClinicalBERT might work better.

I found it surprising that BioGPT works better than BioBERT variants in the downstream tasks as shown by BioGPT's paper.

ZON-ZONG-MIN commented 1 year ago

@sockthem sure, note that BioForCausalLM is just a generative model which you can prompt with text and it will continue the prompt. It's not like BertForQuestionAnswering which does extractive question answering from a piece of text.

I want to make sure my bioGPT knowledge is correct.

The link below is an example where it seems to only be able to handle Text-Generation tasks. https://colab.research.google.com/drive/1YZxASGlrTOzM5Mxv3yF1rzyxehRa3SIh?usp=sharing#scrollTo=C8uvWlZGOtY_

If I want to try the Relation Extraction task, I need to add and train other modules (e.g. BioGPT-RE-BC5CDR or BioGPT-RE-DDI)

Is that right?

NielsRogge commented 1 year ago

Yeah from this list it looks like only 3 models have been converted to the HF format so far.

The conversion script (to convert models from this repository to the HF format) can be found here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/biogpt/convert_biogpt_original_pytorch_checkpoint_to_pytorch.py. cc @kamalkraj

esko22 commented 1 year ago

Hi @NielsRogge, @kamalkraj,

I wanted to take a stab at converting the fine-tuned models but came up short with the following error:

RuntimeError: Error(s) in loading state_dict for BioGptForCausalLM:
        size mismatch for biogpt.embed_tokens.weight: copying a param with shape torch.Size([42393, 1024]) from checkpoint, the shape in current model is torch.Size([42384, 1024]).
        size mismatch for output_projection.weight: copying a param with shape torch.Size([42393, 1024]) from checkpoint, the shape in current model is torch.Size([42384, 1024]).

Appears that the new model shapes are off by 9 params but I am not sure why. If I am missing something obvious, bare with me as I am just getting my feet wet here. I was able to run the script mentioned above with success on the Pre-trained-BioGPT with no problems at all. Regarding the bpecodes and the dict.txt, I ran the preprocessing step for all the models and copied them from the corresponding /data directories.

I pulled down the checkpoint files for DDI, DTI and BC5CDR as I am interested in trying out some of the NER tasks but I've have not been able to run any of those models successfully using PyTorch as I keep getting the following:

AssertionError: Could not infer task type from {'_name': 'language_modeling_prompt', 'data': 'data', 'sample_break_mode': 'none', 'tokens_per_sample': 1024, 'output_dictionary_size': -1, 'self_target': False, 'future_target': False, 'past_target': False, 'add_bos_token': False, 'max_target_positions': 1024, 'shorten_method': 'none', 'shorten_data_split_list': '', 'pad_to_fixed_length': False, 'pad_to_fixed_bsz': False, 'seed': 1, 'batch_size': None, 'batch_size_valid': None, 'dataset_impl': None, 'data_buffer_size': 10, 'tpu': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma', 'source_lang': None, 'target_lang': None, 'max_source_positions': 640, 'manual_prompt': None, 'learned_prompt': 9, 'learned_prompt_pattern': 'learned', 'prefix': False, 'sep_token': '<seqsep>'}. Available argparse tasks: dict_keys(['sentence_prediction', 'sentence_prediction_adapters', 'speech_unit_modeling', 'hubert_pretraining', 'denoising', 'multilingual_denoising', 'translation', 'multilingual_translation', 'translation_from_pretrained_bart', 'translation_lev', 'language_modeling', 'speech_to_text', 'legacy_masked_lm', 'text_to_speech', 'speech_to_speech', 'online_backtranslation', 'simul_speech_to_text', 'simul_text_to_text', 'audio_pretraining', 'semisupervised_translation', 'frm_text_to_speech', 'cross_lingual_lm', 'translation_from_pretrained_xlm', 'multilingual_language_modeling', 'audio_finetuning', 'masked_lm', 'sentence_ranking', 'translation_multi_simple_epoch', 'multilingual_masked_lm', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt']). Available hydra tasks: dict_keys(['sentence_prediction', 'sentence_prediction_adapters', 'speech_unit_modeling', 'hubert_pretraining', 'translation', 'translation_lev', 'language_modeling', 'simul_text_to_text', 'audio_pretraining', 'translation_from_pretrained_xlm', 'multilingual_language_modeling', 'audio_finetuning', 'masked_lm', 'dummy_lm', 'dummy_masked_lm'])

I can easily be something wrong here but being able to run the PreTrained model via PyTorch and through the HF conversion script but not the others makes me think there is something off with the fine-tuned checkpoint files - checkpoint_avg.pt

Cheers

TRGanesh commented 9 months ago

HI, I want to perform Question-Answering using BioGPT. Could you please help me in that one?