huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
9.02k stars 798 forks source link

Bug with tokenizer save in 0.8.0.dev2 #297

Closed sarahwie closed 4 years ago

sarahwie commented 4 years ago

Version info: transformers==2.9.1 tokenizers==0.8.0.dev2

tokenizer = GPT2TokenizerFast.from_pretrained('gpt2', cache_dir=cache_dir)
tokenizer.save_pretrained(output_directory)
Traceback (most recent call last):
  File "/home/sarahw/miniconda3/envs/project_huggingface/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 1117, in save_pretrained
    vocab_files = self.save_vocabulary(save_directory)
  File "/home/sarahw/miniconda3/envs/project_huggingface/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 2657, in save_vocabulary
    files = self._tokenizer.save(save_directory)
  File "/home/sarahw/miniconda3/envs/project_huggingface/lib/python3.8/site-packages/tokenizers/implementations/base_tokenizer.py", line 312, in save
    return self._tokenizer.save(path, pretty)
Exception: Is a directory (os error 21)
sarahwie commented 4 years ago

Fix: update 2 lines in transformers to save_model().

n1t0 commented 4 years ago

Hi @sarahwie, thank you for reporting this. The support for tokenizers==0.8.0 has been added to transformers from the 3.0.0 that we just released.

IssaIssa1 commented 4 years ago

Version info: transformers==3.0.0 tokenizers==0.8.0

When I try to save BertWordPieceTokenizer tokenizer after training

tokenizer.save("./", "data")

I got this error

/usr/local/lib/python3.6/dist-packages/tokenizers/implementations/base_tokenizer.py in save(self, path, pretty)
    330                 A path to the destination Tokenizer file
    331         """
--> 332         return self._tokenizer.save(path, pretty)
    333 
    334     def to_str(self, pretty: bool = False):

TypeError: 

May you help please?

n1t0 commented 4 years ago
tokenizer.save("./tokenizer.json")
seyonechithrananda commented 4 years ago

Hi @n1t0,

I happen to be having the same issue with tokenizer.save when saving a trained tokenizer.

tokenizer = ByteLevelBPETokenizer()

tokenizer.train(files='/content/drive/My Drive/Project De Novo/Molecule Transformer/pubchem/shard_00.txt', vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

Any advice on how to fix this issue?