Open NielsRogge opened 1 year ago
@kamalkraj Where did you find the model dict? It should be inside the checkpoint folder but it is not provided explicitly. I then remembered the change in fairseq where they used to store the dict as a part of the model. But even after loading the model, I was unable to find the dict.
Hi @harveenchadha,
Once the model is loaded like this below
import torch
torch.manual_seed(42)
from src.transformer_lm_prompt import TransformerLanguageModelPrompt
model = TransformerLanguageModelPrompt.from_pretrained(
"../QA-PubMedQA-BioGPT-Large",
"checkpoint.pt",
"../QA-PubMedQA-BioGPT-Large",
max_len_b=1024,
max_tokens=12000,
source_lang="x",
target_lang="y")
You can save the dict using.
m.src_dict.save("new_dict/dict.txt")
Hi @kamalkraj,
Thanks for the reply but Looks like to load the model itself you need a dict. What am I doing wrong?
Here is a colab
Oh man! I just found out dict and bpecodes are present in data folder itself :D
@kamalkraj do you mind converting the other BioGPT checkpoints?
Can I transfer this checkpoint to the Microsoft organization?
@kamalkraj do you mind converting the other BioGPT checkpoints?
Can I transfer this checkpoint to the Microsoft organization?
You can transfer https://huggingface.co/kamalkraj/BioGPT-Large-PubMedQA to Microsoft.
I will update in this issue as i convert the other models
@NielsRogge https://huggingface.co/kamalkraj/BioGPT-Large
Is it possible to fine-tune a model through the huggingface package? Thank you!
@harveenchadha were u able to execute in in colab?
@NielsRogge can you help me with Question Answering inferencing documentation from the same model? Got multiple errors.
@evanbrociner yes fine-tuning can be done easily. See our example notebook and example script to fine-tune any GPT-like model (like BioGPT) on your custom dataset.
@sockthem sure, note that BioForCausalLM
is just a generative model which you can prompt with text and it will continue the prompt. It's not like BertForQuestionAnswering
which does extractive question answering from a piece of text.
@NielsRogge Thank you for all the amazing help! Another quick question, might a hugging face implementaiton for Fine-tuned BioGPT for document classification task on HoC be in the works?
@evanbrociner there's currently a contributor adding a BioGptForSequenceClassification
class, which could be used for this purpose. Alternatively, you could fine-tune BioGPT to simply make it generate the appropriate class as next token.
However note that GPT-like (decoder-only Transformer) models oftentimes aren't the best at classification tasks, as they have a causal attention mask instead of a bidirectional attention mask (meaning they can only look at previous tokens when making a prediction, whereas BERT-like or encoder-only Transformers can look in both directions).
For classifying biomedical texts, a model like BioClinicalBERT might work better.
Hi,
The BiogGPT checkpoint on transformers can be used for relation extraction on PubMed?
@evanbrociner there's currently a contributor adding a
BioGptForSequenceClassification
class, which could be used for this purpose. Alternatively, you could fine-tune BioGPT to simply make it generate the appropriate class as next token.However note that GPT-like (decoder-only Transformer) models oftentimes aren't the best at classification tasks, as they have a causal attention mask instead of a bidirectional attention mask (meaning they can only look at previous tokens when making a prediction, whereas BERT-like or encoder-only Transformers can look in both directions).
For classifying biomedical texts, a model like BioClinicalBERT might work better.
I found it surprising that BioGPT works better than BioBERT variants in the downstream tasks as shown by BioGPT's paper.
@sockthem sure, note that
BioForCausalLM
is just a generative model which you can prompt with text and it will continue the prompt. It's not likeBertForQuestionAnswering
which does extractive question answering from a piece of text.
I want to make sure my bioGPT knowledge is correct.
The link below is an example where it seems to only be able to handle Text-Generation tasks. https://colab.research.google.com/drive/1YZxASGlrTOzM5Mxv3yF1rzyxehRa3SIh?usp=sharing#scrollTo=C8uvWlZGOtY_
If I want to try the Relation Extraction task, I need to add and train other modules (e.g. BioGPT-RE-BC5CDR or BioGPT-RE-DDI)
Is that right?
Yeah from this list it looks like only 3 models have been converted to the HF format so far.
The conversion script (to convert models from this repository to the HF format) can be found here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/biogpt/convert_biogpt_original_pytorch_checkpoint_to_pytorch.py. cc @kamalkraj
Hi @NielsRogge, @kamalkraj,
I wanted to take a stab at converting the fine-tuned models but came up short with the following error:
RuntimeError: Error(s) in loading state_dict for BioGptForCausalLM:
size mismatch for biogpt.embed_tokens.weight: copying a param with shape torch.Size([42393, 1024]) from checkpoint, the shape in current model is torch.Size([42384, 1024]).
size mismatch for output_projection.weight: copying a param with shape torch.Size([42393, 1024]) from checkpoint, the shape in current model is torch.Size([42384, 1024]).
Appears that the new model shapes are off by 9 params but I am not sure why. If I am missing something obvious, bare with me as I am just getting my feet wet here. I was able to run the script mentioned above with success on the Pre-trained-BioGPT with no problems at all. Regarding the bpecodes
and the dict.txt
, I ran the preprocessing step for all the models and copied them from the corresponding /data
directories.
I pulled down the checkpoint files for DDI, DTI and BC5CDR as I am interested in trying out some of the NER tasks but I've have not been able to run any of those models successfully using PyTorch as I keep getting the following:
AssertionError: Could not infer task type from {'_name': 'language_modeling_prompt', 'data': 'data', 'sample_break_mode': 'none', 'tokens_per_sample': 1024, 'output_dictionary_size': -1, 'self_target': False, 'future_target': False, 'past_target': False, 'add_bos_token': False, 'max_target_positions': 1024, 'shorten_method': 'none', 'shorten_data_split_list': '', 'pad_to_fixed_length': False, 'pad_to_fixed_bsz': False, 'seed': 1, 'batch_size': None, 'batch_size_valid': None, 'dataset_impl': None, 'data_buffer_size': 10, 'tpu': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma', 'source_lang': None, 'target_lang': None, 'max_source_positions': 640, 'manual_prompt': None, 'learned_prompt': 9, 'learned_prompt_pattern': 'learned', 'prefix': False, 'sep_token': '<seqsep>'}. Available argparse tasks: dict_keys(['sentence_prediction', 'sentence_prediction_adapters', 'speech_unit_modeling', 'hubert_pretraining', 'denoising', 'multilingual_denoising', 'translation', 'multilingual_translation', 'translation_from_pretrained_bart', 'translation_lev', 'language_modeling', 'speech_to_text', 'legacy_masked_lm', 'text_to_speech', 'speech_to_speech', 'online_backtranslation', 'simul_speech_to_text', 'simul_text_to_text', 'audio_pretraining', 'semisupervised_translation', 'frm_text_to_speech', 'cross_lingual_lm', 'translation_from_pretrained_xlm', 'multilingual_language_modeling', 'audio_finetuning', 'masked_lm', 'sentence_ranking', 'translation_multi_simple_epoch', 'multilingual_masked_lm', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt']). Available hydra tasks: dict_keys(['sentence_prediction', 'sentence_prediction_adapters', 'speech_unit_modeling', 'hubert_pretraining', 'translation', 'translation_lev', 'language_modeling', 'simul_text_to_text', 'audio_pretraining', 'translation_from_pretrained_xlm', 'multilingual_language_modeling', 'audio_finetuning', 'masked_lm', 'dummy_lm', 'dummy_masked_lm'])
I can easily be something wrong here but being able to run the PreTrained model via PyTorch and through the HF conversion script but not the others makes me think there is something off with the fine-tuned checkpoint files - checkpoint_avg.pt
Cheers
HI, I want to perform Question-Answering using BioGPT. Could you please help me in that one?
BioGPT is now available for usage in 🤗 Transformers!
Docs: https://huggingface.co/docs/transformers/main/en/model_doc/biogpt.
Checkpoints on the hub: https://huggingface.co/microsoft/biogpt
It'd be very nice if someone converted the remaining BioGPT checkpoints to the HuggingFace format. The conversion script can be found here.