huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.69k stars 747 forks source link

A whitespace character not displaying at a specific position #1399

Closed scissorstail closed 7 months ago

scissorstail commented 7 months ago

It seems that whitespace characters at a specific position are not being displayed.

To Reproduce

Please run sample below.

sample.txt

1234567890...

sample.py

from transformers import PreTrainedTokenizerFast
from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

files = [
    "./sample.txt",
]

tokenizer = Tokenizer(models.BPE())
tokenizer.normalizer = normalizers.Sequence([
    normalizers.NFC(),
])
tokenizer.pre_tokenizer = pre_tokenizers.Sequence([
    pre_tokenizers.ByteLevel(add_prefix_space=True),
    pre_tokenizers.Digits(individual_digits=True),
    pre_tokenizers.Punctuation(behavior="contiguous"),
])
tokenizer.decoder = decoders.Sequence([
    decoders.ByteLevel(),
])
trainer = trainers.BpeTrainer(
    vocab_size=100,
    special_tokens=["<pad>", "<bos>", "<sep>", "<eos>"],
    max_token_length=5,
    min_frequency=1,
)

tokenizer.train(files, trainer=trainer)
tokenizer.save("tokenizer.json")

def test(tokenizer, text):
    tokens = tokenizer.tokenize(text)
    encoding = tokenizer.encode(text)
    result = tokenizer.decode(encoding)
    text = (' ' + normalizers.NFC().normalize_str(text))

    print(tokens)
    print(encoding)

    print(result)
    print(text)
    print(result == text)

tokenizer = PreTrainedTokenizerFast.from_pretrained('./')

text = "123 ..."

test(tokenizer, text)

# python sample.py
# [00:00:00] Pre-processing files (0 Mo)    ██████████████████                100%
# [00:00:00] Tokenize words                 ██████████████████ 14       /       14
# [00:00:00] Count pairs                    ██████████████████ 14       /       14
# [00:00:00] Compute merges                 ██████████████████ 3        /        3
# ['Ä ', '1', '2', '3', 'Ä ', '...']
# [16, 6, 7, 8, 16, 19]
# 123...
# 123 ...
# False

Expected behavior

['Ä ', '1', '2', '3', 'Ä ', '...']
[16, 6, 7, 8, 16, 19]
 123 ...
 123 ...
True

Actual behavior

['Ä ', '1', '2', '3', 'Ä ', '...']
[16, 6, 7, 8, 16, 19]
 123...
 123 ...
False

I might have configured something incorrectly. This behavior is different from what I was expecting.

ArthurZucker commented 7 months ago

Hey! This might be related to the cleanup_tokenization_spaces argument available in transformers. Would you mind trying with tokenizer.decode(encoding, clean_up_tokenization_spaces = False). If that does not work could you push the tokenizer to the hub?

scissorstail commented 7 months ago

It seems to be working. However, I'm not quite sure what this option does. Thanks anyway.