Hello, I'm currently working on training a byte-level BPE tokenizer using the Huggingface tokenizers library. I've created a simple training script, a sample corpus, and provided the output produced by this script. My aim is to understand why consecutive newline tokens \n are not being merged into a single token \n\n during the tokenization process. Below are the details:
Hello, I'm currently working on training a byte-level BPE tokenizer using the Huggingface tokenizers library. I've created a simple training script, a sample corpus, and provided the output produced by this script. My aim is to understand why consecutive newline tokens
\n
are not being merged into a single token\n\n
during the tokenization process. Below are the details:demo_corpus.txt:
output of training script:
the following is tokens produced by llama3 tokenizer: