huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.7k stars 26.22k forks source link

Exception raised when running `T5-like span-masked language modeling` example in `examples/flax/language-modeling/` #32124

Closed hyunjongkimmath closed 1 month ago

hyunjongkimmath commented 1 month ago

System Info

Who can help?

@sanchit-gandhi

Information

Tasks

Reproduction

I am trying out the example in https://github.com/huggingface/transformers/tree/main/examples/flax/language-modeling#t5-like-span-masked-language-modeling. Unforunately, when I tried to initialize the tokenizer I get an error:

Steps to reproduce the error:

  1. Copy t5_tokenizer_model.py into the working directory.
  2. Create a subdirectory in the working directory. cd ./norwegian-t5-base
  3. Run the following excerpt of code as in the example:
import datasets

from t5_tokenizer_model import SentencePieceUnigramTokenizer

vocab_size = 32_000
input_sentence_size = None

# Initialize a dataset
dataset = datasets.load_dataset("oscar", name="unshuffled_deduplicated_no", split="train")

tokenizer = SentencePieceUnigramTokenizer(unk_token="<unk>", eos_token="</s>", pad_token="<pad>")

The following is the error message that I got:

Output exceeds the [size limit](command:workbench.action.openSettings?[). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?acc5d017-d379-4bbe-b6c8-a321bc6453e8)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[2], line 1
----> 1 tokenizer = SentencePieceUnigramTokenizer(unk_token="<unk>", eos_token="</s>", pad_token="<pad>")

File ~/Documents/Development/Python/trouver_personal_playground/ml_model_training_playground/t5_tokenizer_model.py:49, in SentencePieceUnigramTokenizer.__init__(self, replacement, add_prefix_space, unk_token, eos_token, pad_token)
     37 tokenizer = Tokenizer(Unigram())
     39 tokenizer.normalizer = normalizers.Sequence(
     40     [
     41         normalizers.Nmt(),
   (...)
     45     ]
     46 )
     47 tokenizer.pre_tokenizer = pre_tokenizers.Sequence(
     48     [
---> 49         pre_tokenizers.Metaspace(
     50             replacement=replacement, add_prefix_space="always" if add_prefix_space else "never"
     51         ),
     52         pre_tokenizers.Digits(individual_digits=True),
     53         pre_tokenizers.Punctuation(),
     54     ]
     55 )
     56 tokenizer.decoder = decoders.Metaspace(
     57     replacement=replacement, add_prefix_space="always" if add_prefix_space else "never"
     58 )
...
     61     single=f"$A {self.special_tokens['eos']['token']}",
     62     special_tokens=[(self.special_tokens["eos"]["token"], self.special_tokens["eos"]["id"])],
     63 )

TypeError: Metaspace.__new__() got an unexpected keyword argument 'add_prefix_space'

Expected behavior

I expect the tokenizer to be initialized.

amyeroberts commented 1 month ago

cc @ArthurZucker

fshp971 commented 1 month ago

I think this is because tokenizers.pre_tokenizers.Metaspace has been refactored and the argument add_prefix_space has been replace with prepend_scheme (see this commit).

I can open a PR to fix this.