thunlp / OpenPrompt

An Open-Source Framework for Prompt-Learning.
https://thunlp.github.io/OpenPrompt/
Apache License 2.0
4.32k stars 447 forks source link

Can't load tokenizer for 'xlm-roberta-base'. #199

Open cmgchess opened 1 year ago

cmgchess commented 1 year ago

This is what I get when trying to load xlm-roberta-base

from openprompt.plms import load_plm
plm, tokenizer, model_config, WrapperClass = load_plm("roberta", "xlm-roberta-base")
OSError                                   Traceback (most recent call last)
[<ipython-input-3-bc593607bff3>](https://localhost:8080/#) in <module>
      1 from openprompt.plms import load_plm
----> 2 plm, tokenizer, model_config, WrapperClass = load_plm("roberta", "xlm-roberta-base")

1 frames
[/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py](https://localhost:8080/#) in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   1758         if all(full_file_name is None for full_file_name in resolved_vocab_files.values()):
   1759             raise EnvironmentError(
-> 1760                 f"Can't load tokenizer for '{pretrained_model_name_or_path}'. If you were trying to load it from "
   1761                 "'https://huggingface.co/models', make sure you don't have a local directory with the same name. "
   1762                 f"Otherwise, make sure '{pretrained_model_name_or_path}' is the correct path to a directory "

OSError: Can't load tokenizer for 'xlm-roberta-base'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'xlm-roberta-base' is the correct path to a directory containing all relevant files for a RobertaTokenizer tokenizer.

image

Help is much appreciated, Thanks

Achazwl commented 1 year ago

look into https://github.com/thunlp/OpenPrompt/blob/main/openprompt/plms/__init__.py#L87. The 'xlm-roberta-base' is not the same as "roberta", it uses XLMRobertaConfig other than RobertaConfig, XLMRobertaTokenizer instead of RobertaTokenizer.

It would be possible to modify here to add "xlm-roberta-base" into _MODEL_CLASSES. Or you can copy those codes in load_plm out into your juypter notebook, and modify those model_class.config, model_class.tokenizer, etc. into xlm-roberta related one.

cmgchess commented 1 year ago

@Achazwl thank you! any future plans on extending the framework for XLMR as well?

HodaMemar commented 1 year ago

Dear All I have a question about modifying init.py please guide me. I want to use the SciBERT model from Huggingface I try to add the model and tokenizer to init.py in colab. I don't know what is the config or wrapper. after that, I close the init.py and run again but the Seibert is not recognized. How I can test other models in Huggingface?

kinghmy commented 1 year ago

Dear All I have a question about modifying init.py please guide me. I want to use the SciBERT model from Huggingface I try to add the model and tokenizer to init.py in colab. I don't know what is the config or wrapper. after that, I close the init.py and run again but the Seibert is not recognized. How I can test other models in Huggingface?

after you modified the code, you should reload code in your python working space. eg: `from imp import reload

openprompt = reload(openprompt)

load_plm = openprompt.plms.load_plm`

and you should modify the code, and import it using `import sys

sys.path.insert(0, '/location_path/OpenPrompt')`

HodaMemar commented 1 year ago

Thank you for your reply

I change the code in colab like the bellow: ![Uploading image.png…]()

Adding a model will result in an error. I probably didn't do the right in reloading the module Your guidance in this regard will be very valuable

On Wed, 29 Mar 2023, 4:41 pm kinghmy, @.***> wrote:

Dear All I have a question about modifying init.py please guide me. I want to use the SciBERT model from Huggingface I try to add the model and tokenizer to init.py in colab. I don't know what is the config or wrapper. after that, I close the init.py and run again but the Seibert is not recognized. How I can test other models in Huggingface?

after you modified the code, you should reload code in your python working space. eg: from imp import reload openprompt = reload(openprompt) load_plm = openprompt.plms.load_plm

and you should modify the code, and import it using import sys sys.path.insert(0, '/location_path/OpenPrompt')

— Reply to this email directly, view it on GitHub https://github.com/thunlp/OpenPrompt/issues/199#issuecomment-1488486380, or unsubscribe https://github.com/notifications/unsubscribe-auth/AP4V5FJ4I6HISD5EBOFLC6DW6QRHXANCNFSM6AAAAAARE4SUEE . You are receiving this because you commented.Message ID: @.***>

HodaMemar commented 1 year ago

Thank you for your reply

I change the code in colab like the bellow: 1- add this model to init.py 'PubMedBERT': ModelClass(**{ 'config': BertConfig, 'tokenizer': AutoTokenizer.from_pretrained('microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext'), 'model':AutoModel.from_pretrained('microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext'), 'wrapper': MLMTokenizerWrapper, }), 2- reload the model `import sys import importlib

sys.path.insert(0, '/content/OpenPrompt') importlib.reload(sys)`

3- run the cell: plm, tokenizer, model_config, WrapperClass = load_plm("PubMedBERT",'microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext')

4- I get this error:

`KeyError Traceback (most recent call last) in <cell line: 1>() ----> 1 plm, tokenizer, model_config, WrapperClass = load_plm("PubMedBERT",'microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext')

1 frames /content/OpenPrompt/OpenPrompt/openprompt/plms/init.py in get_model_class(plm_type) 89 "tokenizer": GPT2Tokenizer, 90 "model": GPTJForCausalLM, ---> 91 "wrapper": LMTokenizerWrapper 92 }), 93 }

KeyError: 'PubMedBERT'`

Adding a model will result in an error. I probably didn't do the right in reloading the module Your guidance in this regard will be very valuable

HodaMemar commented 1 year ago

Thank you for your reply

I change the code in colab like the bellow in the attached figure. Adding a model will result in an error. I probably didn't do the right in reloading the module Your guidance in this regard will be very valuable

On Wed, 29 Mar 2023, 6:02 pm Hoda Memarzadeh, @.***> wrote:

Thank you for your reply

On Wed, 29 Mar 2023, 4:41 pm kinghmy, @.***> wrote:

Dear All I have a question about modifying init.py please guide me. I want to use the SciBERT model from Huggingface I try to add the model and tokenizer to init.py in colab. I don't know what is the config or wrapper. after that, I close the init.py and run again but the Seibert is not recognized. How I can test other models in Huggingface?

after you modified the code, you should reload code in your python working space. eg: from imp import reload openprompt = reload(openprompt) load_plm = openprompt.plms.load_plm

and you should modify the code, and import it using import sys sys.path.insert(0, '/location_path/OpenPrompt')

— Reply to this email directly, view it on GitHub https://github.com/thunlp/OpenPrompt/issues/199#issuecomment-1488486380, or unsubscribe https://github.com/notifications/unsubscribe-auth/AP4V5FJ4I6HISD5EBOFLC6DW6QRHXANCNFSM6AAAAAARE4SUEE . You are receiving this because you commented.Message ID: @.***>

NtaylorOX commented 1 year ago

Hi,

So if you want a potential fix that goes around the "load_plm" function from OpenPrompt, you can load each component in separately and then merge:

Actually I have one thing you can try - it will avoid using OpenPrompts "load_plm" function. For instance, the SciBERT model should still work with OpenPrompts MLM tokenizer wrapper, so you can load the components in separately and them piece together.

Imports

from openprompt.plms.seq2seq import T5TokenizerWrapper, T5LMTokenizerWrapper
from openprompt.plms.lm import LMTokenizerWrapper
from openprompt.plms.mlm import MLMTokenizerWrapper
from transformers import T5Config, T5Tokenizer, T5ForConditionalGeneration
from transformers import AutoModelForCausalLM, AutoModelForSeq2SeqLM, AutoModelForMaskedLM, AutoTokenizer,

Load components separately

model_name = "your_mode_name_here"
plm = AutoModelForMaskedLM.from_pretrained(model_name)
WrapperClass = MLMTokenizerWrapper
tokenizer = AutoTokenizer.from_pretrained(model_name,  use_fast = False)

Then you pass these to the prompt dataloader as you normally would. I do not have time right now to test this for the models outlined in this issue, but this has worked for me when using custom models. But SciBERT under the hood should potentially work directly with the OpenPrompt MLMTokenizerWrapper.

kinghmy commented 1 year ago

你好,来信我已收到,我会尽快处理,谢谢!

HodaMemar commented 1 year ago

Hi

Thank you very much for your time and explanation.

On Tue, May 9, 2023 at 1:58 PM kinghmy @.***> wrote:

你好,来信我已收到,我会尽快处理,谢谢!

— Reply to this email directly, view it on GitHub https://github.com/thunlp/OpenPrompt/issues/199#issuecomment-1539896269, or unsubscribe https://github.com/notifications/unsubscribe-auth/AP4V5FOCR46JC7Z37KX76XDXFIL4JANCNFSM6AAAAAARE4SUEE . You are receiving this because you commented.Message ID: @.***>