huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.67k stars 743 forks source link

How to allow the merging of consecutive newline tokens \n when training a byte-level bpe tokenizer? #1534

Open liuslnlp opened 1 month ago

liuslnlp commented 1 month ago

Hello, I'm currently working on training a byte-level BPE tokenizer using the Huggingface tokenizers library. I've created a simple training script, a sample corpus, and provided the output produced by this script. My aim is to understand why consecutive newline tokens \n are not being merged into a single token \n\n during the tokenization process. Below are the details:

from tokenizers import (
    Tokenizer,
    pre_tokenizers,
    models,
    decoders,
    trainers,
    processors,
)

files = ["demo_corpus.txt"]
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.Sequence([
    pre_tokenizers.Digits(individual_digits=True),
    pre_tokenizers.ByteLevel(add_prefix_space=False, use_regex=True)
])
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel()

trainer = trainers.BpeTrainer(
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
    vocab_size=2000,
    special_tokens=[
        "<pad>", "<|beginoftext|>", "<|endoftext|>"
    ]
)
tokenizer.train(files, trainer)
test_text = "#include <set>\n\n\n\n\n"

print("pre-tokenize spans:", tokenizer.pre_tokenizer.pre_tokenize_str(test_text))
ids = tokenizer.encode(test_text).ids
print(f"tokens: {[tokenizer.decode([tid]) for tid in ids]}")

demo_corpus.txt:

#include <cstdio>

#include <vector>

#include <set>

using namespace std;

int main(){
    int N, A[100000], p = 0;

    multiset<int> S;

    scanf("%d", &N);

    int p0 = 0, q0 = 1, q = N-1;

    vector<int> result;

    for(int i: result)

        printf("%d\n", i);
}

output of training script:

pre-tokenize spans: [('#', (0, 1)), ('include', (1, 8)), ('Ġ<', (8, 10)), ('set', (10, 13)), ('>', (13, 14)), ('ĊĊĊĊĊ', (14, 19))]
tokens: ['#', 'include', ' <', 'set', '>', '\n', '\n', '\n', '\n', '\n']

the following is tokens produced by llama3 tokenizer:

tokenizer = LlamaTokenizerFast.from_pretrained("my llama3 vocab path")
test_text = "#include <set>\n\n\n\n\n"
print([tokenizer.decode([tid]) for tid in tokenizer(test_text)["input_ids"]])

# output
# ['<|begin_of_text|>', '#include', ' <', 'set', '>\n\n\n\n\n']
liuslnlp commented 1 month ago

Hi, @Narsil @ArthurZucker I need some help.

josharian commented 3 weeks ago

Possibly related: https://github.com/meta-llama/llama3/issues/227

ArthurZucker commented 2 weeks ago

Hey! That is a good question will answer in a bit