Exception raised when running `T5-like span-masked language modeling` example in `examples/flax/language-modeling/`

hyunjongkimmath commented 1 month ago

System Info

transformers version: 4.41.2
Platform: Linux-6.5.0-41-generic-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.23.2
Safetensors version: 0.4.3
Accelerate version: 0.31.0
Accelerate config: not found
PyTorch version (GPU?): 1.13.1+cu117 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): 0.8.5 (cpu)
Jax version: 0.4.30
JaxLib version: 0.4.30
Using GPU in script?: Yes?
Using distributed or parallel set-up in script?: No?

Who can help?

@sanchit-gandhi

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

I am trying out the example in https://github.com/huggingface/transformers/tree/main/examples/flax/language-modeling#t5-like-span-masked-language-modeling. Unforunately, when I tried to initialize the tokenizer I get an error:

Steps to reproduce the error:

Copy t5_tokenizer_model.py into the working directory.
Create a subdirectory in the working directory. cd ./norwegian-t5-base
Run the following excerpt of code as in the example:

import datasets

from t5_tokenizer_model import SentencePieceUnigramTokenizer

vocab_size = 32_000
input_sentence_size = None

# Initialize a dataset
dataset = datasets.load_dataset("oscar", name="unshuffled_deduplicated_no", split="train")

tokenizer = SentencePieceUnigramTokenizer(unk_token="<unk>", eos_token="</s>", pad_token="<pad>")

The following is the error message that I got:

Output exceeds the [size limit](command:workbench.action.openSettings?[). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?acc5d017-d379-4bbe-b6c8-a321bc6453e8)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[2], line 1
----> 1 tokenizer = SentencePieceUnigramTokenizer(unk_token="<unk>", eos_token="</s>", pad_token="<pad>")

File ~/Documents/Development/Python/trouver_personal_playground/ml_model_training_playground/t5_tokenizer_model.py:49, in SentencePieceUnigramTokenizer.__init__(self, replacement, add_prefix_space, unk_token, eos_token, pad_token)
     37 tokenizer = Tokenizer(Unigram())
     39 tokenizer.normalizer = normalizers.Sequence(
     40     [
     41         normalizers.Nmt(),
   (...)
     45     ]
     46 )
     47 tokenizer.pre_tokenizer = pre_tokenizers.Sequence(
     48     [
---> 49         pre_tokenizers.Metaspace(
     50             replacement=replacement, add_prefix_space="always" if add_prefix_space else "never"
     51         ),
     52         pre_tokenizers.Digits(individual_digits=True),
     53         pre_tokenizers.Punctuation(),
     54     ]
     55 )
     56 tokenizer.decoder = decoders.Metaspace(
     57     replacement=replacement, add_prefix_space="always" if add_prefix_space else "never"
     58 )
...
     61     single=f"$A {self.special_tokens['eos']['token']}",
     62     special_tokens=[(self.special_tokens["eos"]["token"], self.special_tokens["eos"]["id"])],
     63 )

TypeError: Metaspace.__new__() got an unexpected keyword argument 'add_prefix_space'

Expected behavior

I expect the tokenizer to be initialized.

amyeroberts commented 1 month ago

cc @ArthurZucker

fshp971 commented 1 month ago

I think this is because tokenizers.pre_tokenizers.Metaspace has been refactored and the argument add_prefix_space has been replace with prepend_scheme (see this commit).

I can open a PR to fix this.

huggingface / transformers

Exception raised when running `T5-like span-masked language modeling` example in `examples/flax/language-modeling/` #32124