Using whitespace tokenizer for training models

neel04 commented 3 years ago

Environment info

transformers version: 4.6.1
Platform: Linux-5.4.109+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.10
PyTorch version (GPU?): 1.8.1+cu101 (False)
Tensorflow version (GPU?): 2.5.0 (False)
Using GPU in script?: Yes/depends
Using distributed or parallel set-up in script?: No

Who can help

longformer, reformer, transfoxl, xlnet: @patrickvonplaten
tokenizers: @LysandreJik

Information

Model I am using (Bert, XLNet ...): BigBird

The problem arises when using:

[ ] the official example scripts: (give details below)
[x] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[x] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

I have a dataset for which I wanted to use a tokenizer based on whitespace rather than any subword segmentation approach.

This snippet I got off github has a way to construct and use the custom tokenizer that operates on whitespaces:-

from tokenizers import Tokenizer, trainers
from tokenizers.models import BPE
from tokenizers.normalizers import Lowercase
from tokenizers.pre_tokenizers import CharDelimiterSplit

# We build our custom tokenizer:
tokenizer = Tokenizer(BPE()) 
tokenizer.normalizer = Lowercase()
tokenizer.pre_tokenizer = CharDelimiterSplit(' ')

# We can train this tokenizer by giving it a list of path to text files:
trainer = trainers.BpeTrainer(special_tokens=["[UNK]"], show_progress=True)
tokenizer.train(files=['/content/dataset.txt'], trainer=trainer)

I wanted to use it for pre-training the BigBird model, but facing two issues:

I can’t seem to be able to use this snippet with the custom tokenizer above to convert tokenized sentences in model-friendly sequences

from tokenizers.processors import BertProcessing

tokenizer._tokenizer.post_processor = tokenizers.processors.BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=16000)

This returns me an error, and without any preprocessing the output does not contain the sequence start and end tokens (<s>; </s>) as expected.

Next problem arises, when I save the tokenizer state in the specified folder, I am unable to use it via:

tokenizer = BigBirdTokenizerFast.from_pretrained("./tok", max_len=16000)

since it yields the error that my directory does not ‘reference’ the tokenizer files, which shouldn’t be an issue since using RobertaTokenizerFast does work - I assume it has something to do in the tokenization post-processing phase.

Fully Reproducible Colab

I am really confused about this - I have created a fully reproducible colab notebook, with commented problems and synthetic data. Please find it here.

Thanx a ton in advance!!

LysandreJik commented 3 years ago

Hello! Thanks a lot for the well-crafted issue and reproducer, this is very helpful. Regarding your problem 2, I have a question: why are you saving the tokenizer's model, rather than the tokenizer itself?

I would argue that saving the entire tokenizer in a tokenizer.json would be better:

# And now it is ready, we can save the vocabulary with
tokenizer.save('./tok/tokenizer.json')

Then you'll be able to reload your fast tokenizer (that is looking for a tokenizer.json file!) seamlessly:

from transformers import BigBirdTokenizerFast

tokenizer = BigBirdTokenizerFast.from_pretrained("tok", max_len=16000)

I also verified that you do indeed recover the same encoding as when using the tokenizers library:

>>> tokenizer("23 39999 999 8888 212").tokens()
['23', '39999', '999', '8888', '212']

Regarding your first question, I don't see anywhere in your code where you're adding a BERT template processor. I've taken the liberty to add it right after your tokenizer creation, see below. I am unaware of the error you got, but when trying it I had an error saying that tokenizer.token_to_id("<s>") was returning None.

To fix this you can specify that <s> and <s/> are special tokens when initializing your BPE trainer, as I have done below.

from tokenizers import Tokenizer, trainers
from tokenizers.models import BPE
from tokenizers.normalizers import Lowercase
from tokenizers.pre_tokenizers import CharDelimiterSplit

# We build our custom tokenizer:
tokenizer = Tokenizer(BPE()) 
tokenizer.normalizer = Lowercase()
tokenizer.pre_tokenizer = CharDelimiterSplit(' ')

# We can train this tokenizer by giving it a list of path to text files:
trainer = trainers.BpeTrainer(special_tokens=["[UNK]", "<s>", "</s>"], show_progress=True)
tokenizer.train(files=['/content/dataset.txt'], trainer=trainer)

from tokenizers.processors import BertProcessing
import tokenizers 

tokenizer.post_processor = tokenizers.processors.BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=16000)

After this, encoding a sequence returns the correct tokens with the correct special tokens:

>>> tokenizer.encode("23 39999 999 8888 212").tokens
['<s>', '23', '39999', '999', '8888', '212', '</s>']

neel04 commented 3 years ago

Thanks a ton @LysandreJik and replying so quickly and efficiently :cake: :+1: :rocket: !!!

For anyone else who might stumble on this problem, I have modified a simple example via the Colab link attached above. If in any case it might not be working, I have uploaded the .ipynb file alongside this comment too. :hugs:

Have a fantastic day!

HF_issue_repro.zip

neel04 commented 3 years ago

@LysandreJik Sorry to disturb you again, but I had this peculiar problem. I wanted to train BigBird on TPU, and its reporting that the config.json might have missing parameters.

[INFO|tokenization_auto.py:427] 2021-06-25 12:16:10,662 >> Could not locate the tokenizer configuration file, will try to use the model config instead.
[INFO|configuration_utils.py:528] 2021-06-25 12:16:10,668 >> loading configuration file ./tok/config.json
Exception in device=TPU:0: Unrecognized model in ./tok. Should have a `model_type` key in its config.json, or contain one of the following strings in its name: visual_bert, roformer, clip, bigbird_pegasus, deit, luke, detr, gpt_neo, big_bird, speech_to_text, vit, wav2vec2, m2m_100, convbert, led, blenderbot-small, retribert, ibert, mt5, t5, mobilebert, distilbert, albert, bert-generation, camembert, xlm-roberta, pegasus, marian, mbart, megatron_bert, mpnet, bart, blenderbot, reformer, longformer, roberta, deberta-v2, deberta, flaubert, fsmt, squeezebert, hubert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm-prophetnet, prophetnet, xlm, ctrl, electra, encoder-decoder, funnel, lxmert, dpr, layoutlm, rag, tapas
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/content/run_mlm.py", line 520, in _mp_fn
    main()
  File "/content/run_mlm.py", line 313, in main
    tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, **tokenizer_kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/models/auto/tokenization_auto.py", line 529, in from_pretrained
    config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/models/auto/configuration_auto.py", line 457, in from_pretrained
    f"Unrecognized model in {pretrained_model_name_or_path}. "
ValueError: Unrecognized model in ./tok. Should have a `model_type` key in its config.json, or contain one of the following strings in its name: visual_bert, roformer, clip, bigbird_pegasus, deit, luke, detr, gpt_neo, big_bird, speech_to_text, vit, wav2vec2, m2m_100, convbert, led, blenderbot-small, retribert, ibert, mt5, t5, mobilebert, distilbert, albert, bert-generation, camembert, xlm-roberta, pegasus, marian, mbart, megatron_bert, mpnet, bart, blenderbot, reformer, longformer, roberta, deberta-v2, deberta, flaubert, fsmt, squeezebert, hubert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm-prophetnet, prophetnet, xlm, ctrl, electra, encoder-decoder, funnel, lxmert, dpr, layoutlm, rag, tapas

So apparently, I have been saving the tokenizer's state only, not the entire model. This is how I am doing

!mkdir tok
# And now it is ready, we can save the tokenizer's state only, not the model
tokenizer.save('./tok/config.json')

I think that config.json might be a product of the tokenizer's model when saving, which we are omitting by saving the state only? To make sure, I searched the json file to confirm that key is indeed not present there.

Would you happen to have a clue as to what I can do here?

neel04 commented 3 years ago

Assuming the tokenizer state to be saved is the specific one for the model, I did this

tokenizer = BigBirdTokenizerFast.from_pretrained("/content/tok", max_len=16000)
tokenizer.save_pretrained('./tokenizer')

And tried to load the tokenizer again. However, I can't verify whether it works because upon running the script, I lose connection to the instance :thinking:

Is this the correct usage though?

LysandreJik commented 3 years ago

Hi @neel04.

I'm thinking you're facing an issue that was solved in the latest transformers release. Before the latest transformers release, AutoTokenizer couldn't guess which tokenizer to load from just the tokenizer files, it also needed to have access to the model's config.json in order to see the model and tokenizer classes.

It was addressed in the latest transformers release, where the tokenizer class would now be saved in tokenizer_config.json.

Please let me know if either of these fixes work:

Upgrade to the latest version, complete the tokenizer_config.json in your ./tok directory with the following:
```
"tokenizer_class": "BigBirdTokenizer"
```
If it's not present, then create it.
Stay at your current version, and add a config.json file containing the same information in your ./tok folder.

Regarding your second question, yes, using save_pretrained alongside from_pretrained is the correct usage.

neel04 commented 3 years ago

Hey @LysandreJik, Thanks a ton for the tips, I will surely try them if I face this error again! :hugs:

I am using the master branch now for my project, so I hope I won't face this problem again. However, I can't completely verify whether it works because I am unable to run it on TPU due to some memory leak.

If related problems arise, I would surely try out either of your fixes :rocket:

Have a fantastic day!

huggingface / transformers