Closed neel04 closed 3 years ago
Hello! Thanks a lot for the well-crafted issue and reproducer, this is very helpful. Regarding your problem 2, I have a question: why are you saving the tokenizer's model, rather than the tokenizer itself?
I would argue that saving the entire tokenizer in a tokenizer.json
would be better:
# And now it is ready, we can save the vocabulary with
tokenizer.save('./tok/tokenizer.json')
Then you'll be able to reload your fast tokenizer (that is looking for a tokenizer.json
file!) seamlessly:
from transformers import BigBirdTokenizerFast
tokenizer = BigBirdTokenizerFast.from_pretrained("tok", max_len=16000)
I also verified that you do indeed recover the same encoding as when using the tokenizers
library:
>>> tokenizer("23 39999 999 8888 212").tokens()
['23', '39999', '999', '8888', '212']
Regarding your first question, I don't see anywhere in your code where you're adding a BERT template processor. I've taken the liberty to add it right after your tokenizer
creation, see below. I am unaware of the error you got, but when trying it I had an error saying that tokenizer.token_to_id("<s>")
was returning None
.
To fix this you can specify that <s>
and <s/>
are special tokens when initializing your BPE trainer, as I have done below.
from tokenizers import Tokenizer, trainers
from tokenizers.models import BPE
from tokenizers.normalizers import Lowercase
from tokenizers.pre_tokenizers import CharDelimiterSplit
# We build our custom tokenizer:
tokenizer = Tokenizer(BPE())
tokenizer.normalizer = Lowercase()
tokenizer.pre_tokenizer = CharDelimiterSplit(' ')
# We can train this tokenizer by giving it a list of path to text files:
trainer = trainers.BpeTrainer(special_tokens=["[UNK]", "<s>", "</s>"], show_progress=True)
tokenizer.train(files=['/content/dataset.txt'], trainer=trainer)
from tokenizers.processors import BertProcessing
import tokenizers
tokenizer.post_processor = tokenizers.processors.BertProcessing(
("</s>", tokenizer.token_to_id("</s>")),
("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=16000)
After this, encoding a sequence returns the correct tokens with the correct special tokens:
>>> tokenizer.encode("23 39999 999 8888 212").tokens
['<s>', '23', '39999', '999', '8888', '212', '</s>']
Thanks a ton @LysandreJik and replying so quickly and efficiently :cake: :+1: :rocket: !!!
For anyone else who might stumble on this problem, I have modified a simple example via the Colab link attached above. If in any case it might not be working, I have uploaded the .ipynb
file alongside this comment too. :hugs:
Have a fantastic day!
@LysandreJik Sorry to disturb you again, but I had this peculiar problem. I wanted to train BigBird on TPU, and its reporting that the config.json might have missing parameters.
[INFO|tokenization_auto.py:427] 2021-06-25 12:16:10,662 >> Could not locate the tokenizer configuration file, will try to use the model config instead.
[INFO|configuration_utils.py:528] 2021-06-25 12:16:10,668 >> loading configuration file ./tok/config.json
Exception in device=TPU:0: Unrecognized model in ./tok. Should have a `model_type` key in its config.json, or contain one of the following strings in its name: visual_bert, roformer, clip, bigbird_pegasus, deit, luke, detr, gpt_neo, big_bird, speech_to_text, vit, wav2vec2, m2m_100, convbert, led, blenderbot-small, retribert, ibert, mt5, t5, mobilebert, distilbert, albert, bert-generation, camembert, xlm-roberta, pegasus, marian, mbart, megatron_bert, mpnet, bart, blenderbot, reformer, longformer, roberta, deberta-v2, deberta, flaubert, fsmt, squeezebert, hubert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm-prophetnet, prophetnet, xlm, ctrl, electra, encoder-decoder, funnel, lxmert, dpr, layoutlm, rag, tapas
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
_start_fn(index, pf_cfg, fn, args)
File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
fn(gindex, *args)
File "/content/run_mlm.py", line 520, in _mp_fn
main()
File "/content/run_mlm.py", line 313, in main
tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, **tokenizer_kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/auto/tokenization_auto.py", line 529, in from_pretrained
config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/auto/configuration_auto.py", line 457, in from_pretrained
f"Unrecognized model in {pretrained_model_name_or_path}. "
ValueError: Unrecognized model in ./tok. Should have a `model_type` key in its config.json, or contain one of the following strings in its name: visual_bert, roformer, clip, bigbird_pegasus, deit, luke, detr, gpt_neo, big_bird, speech_to_text, vit, wav2vec2, m2m_100, convbert, led, blenderbot-small, retribert, ibert, mt5, t5, mobilebert, distilbert, albert, bert-generation, camembert, xlm-roberta, pegasus, marian, mbart, megatron_bert, mpnet, bart, blenderbot, reformer, longformer, roberta, deberta-v2, deberta, flaubert, fsmt, squeezebert, hubert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm-prophetnet, prophetnet, xlm, ctrl, electra, encoder-decoder, funnel, lxmert, dpr, layoutlm, rag, tapas
So apparently, I have been saving the tokenizer's state only, not the entire model. This is how I am doing
!mkdir tok
# And now it is ready, we can save the tokenizer's state only, not the model
tokenizer.save('./tok/config.json')
I think that config.json
might be a product of the tokenizer's model when saving, which we are omitting by saving the state only?
To make sure, I searched the json
file to confirm that key is indeed not present there.
Would you happen to have a clue as to what I can do here?
Assuming the tokenizer state to be saved is the specific one for the model, I did this
tokenizer = BigBirdTokenizerFast.from_pretrained("/content/tok", max_len=16000)
tokenizer.save_pretrained('./tokenizer')
And tried to load the tokenizer again. However, I can't verify whether it works because upon running the script, I lose connection to the instance :thinking:
Is this the correct usage though?
Hi @neel04.
I'm thinking you're facing an issue that was solved in the latest transformers
release. Before the latest transformers
release, AutoTokenizer
couldn't guess which tokenizer to load from just the tokenizer files, it also needed to have access to the model's config.json
in order to see the model and tokenizer classes.
It was addressed in the latest transformers
release, where the tokenizer class would now be saved in tokenizer_config.json
.
Please let me know if either of these fixes work:
Upgrade to the latest version, complete the tokenizer_config.json
in your ./tok
directory with the following:
"tokenizer_class": "BigBirdTokenizer"
If it's not present, then create it.
Stay at your current version, and add a config.json
file containing the same information in your ./tok
folder.
Regarding your second question, yes, using save_pretrained
alongside from_pretrained
is the correct usage.
Hey @LysandreJik, Thanks a ton for the tips, I will surely try them if I face this error again! :hugs:
I am using the master
branch now for my project, so I hope I won't face this problem again. However, I can't completely verify whether it works because I am unable to run it on TPU due to some memory leak.
If related problems arise, I would surely try out either of your fixes :rocket:
Have a fantastic day!
Environment info
transformers
version: 4.6.1Who can help
Information
Model I am using (Bert, XLNet ...):
BigBird
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
I have a dataset for which I wanted to use a tokenizer based on whitespace rather than any subword segmentation approach.
This snippet I got off github has a way to construct and use the custom tokenizer that operates on whitespaces:-
I wanted to use it for pre-training the
BigBird
model, but facing two issues:tokenizer
above to convert tokenized sentences in model-friendly sequencesThis returns me an error, and without any preprocessing the output does not contain the sequence start and end tokens (
<s>
;</s>
) as expected.since it yields the error that my directory does not ‘reference’ the tokenizer files, which shouldn’t be an issue since using
RobertaTokenizerFast
does work - I assume it has something to do in the tokenizationpost-processing
phase.Fully Reproducible Colab
I am really confused about this - I have created a fully reproducible colab notebook, with commented problems and synthetic data. Please find it here.
Thanx a ton in advance!!