huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.68k stars 746 forks source link

How to add byte_fallback tokens? #1407

Open dinhanhx opened 7 months ago

dinhanhx commented 7 months ago

Alternative title

How to make a tokenizer behaving similarly to Llama

Background

Llama tokenizer considers byte_fallback tokens not special. When it decodes, it doesn't remove these tokens other than special tokens (unk, pad, bos, eos).

What I am trying to do

I'm trying to create a tokenizer behaving like Llama. However, I am only able to add byte_fallback tokens as special tokens.

from tokenizers import Tokenizer
from tokenizers import decoders, pre_tokenizers
from tokenizers.models import BPE
from tokenizers.processors import TemplateProcessing
from tokenizers.trainers import BpeTrainer
from tokenizers import AddedToken

from datasets import load_dataset

dataset = load_dataset("tapaco")

def topaco_generator():
    for i in dataset['train']:
        yield i['paraphrase']

bpe_trainer = BpeTrainer(
    special_tokens=["<unk>", "<s>", "</s>", "<pad>"]
    + [f"<0x{i:02X}>" for i in range(256)]  # byte_fallback tokens
)

tokenizer = Tokenizer(BPE(byte_fallback=True))
tokenizer.pre_tokenizer = pre_tokenizers.Sequence(
    [pre_tokenizers.Metaspace(), pre_tokenizers.Digits(individual_digits=True)]
)
tokenizer.enable_padding(pad_id=3, pad_token="<pad>")
tokenizer.post_processor = TemplateProcessing(
    single="<s> $A </s>",
    pair="<s> $A </s> $B </s>",
    special_tokens=[
        ("<s>", 1),
        ("</s>", 2),
    ],
)
tokenizer.decoder = decoders.Sequence(
    [
        decoders.Metaspace(),
        decoders.ByteFallback(),
    ]
)
# my attempt to add byte_fallback as non-special tokens
# tokenizer.add_tokens([AddedToken(content=f"<0x{i:02X}>", special=True, normalized=False) for i in range(256)])

tokenizer.train_from_iterator(topaco_generator(), trainer=bpe_trainer)
tokenizer.save("topaco_tokenizer.json")

tokenizer = Tokenizer.from_file("topaco_tokenizer.json")

text = "I love you more than I can say 🤗"
encoded_text = tokenizer.encode(text)
print(encoded_text.tokens)
# My work around to preverse byte_fallback tokens
# and remove other special tokens
decoded_text = tokenizer.decode(encoded_text.ids, skip_special_tokens=False)
print(decoded_text.removeprefix('<s> ').removesuffix('</s>'))

Problem

No matter how I tried this line tokenizer.add_tokens([AddedToken(content=f"<0x{i:02X}>", special=True, normalized=False) for i in range(256)]) with different position in my code (before training, after training) and with different parameters of AddedToken, I still can not achieve Llama's behavior.

ArthurZucker commented 5 months ago

Sorry I did not have time to check this. The tokens should be added directly to the vocab, not as special tokens. The trainer should be handling this but it's not supported yet.

ArthurZucker commented 5 months ago

It is planned

ArthurZucker commented 4 months ago

Both trainers don't support that yet. I'll work on this at some point!

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.