huggingface / course

The Hugging Face course on Transformers
https://huggingface.co/course
Apache License 2.0
2.14k stars 693 forks source link

Tokenization Course Issues #121

Open KeremTurgutlu opened 2 years ago

KeremTurgutlu commented 2 years ago

Hello,

I believe the corpus and the word_freqs output used in the BPE / WordPiece implementations have a mismatch simply Course -> course is not capitalized in corpus but word_freqs seem to use the capitalized version.

To reproduce

corpus = [
    "This is the Hugging Face course.",
    "This chapter is about tokenization.",
    "This section shows several tokenizer algorithms.",
    "Hopefully, you will be able to understand how they are trained and generate tokens.",
]

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

word_freqs = defaultdict(int)
for text in corpus:
    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
    words = [word for word, _ in words_with_offsets]
    for word in words:
        word_freqs[word] += 1

assert word_freqs == defaultdict(int, {'This': 3, 'is': 2, 'the': 1, 'Hugging': 1, 'Face': 1, 'Course': 1, '.': 4, 'chapter': 1, 'about': 1,
    'tokenization': 1, 'section': 1, 'shows': 1, 'several': 1, 'tokenizer': 1, 'algorithms': 1, 'Hopefully': 1,
    ',': 1, 'you': 1, 'will': 1, 'be': 1, 'able': 1, 'to': 1, 'understand': 1, 'how': 1, 'they': 1, 'are': 1,
    'trained': 1, 'and': 1, 'generate': 1, 'tokens': 1})
KeremTurgutlu commented 2 years ago

In WordPiece if you go to line where we train the tokenizer and print the learned vocab:

print(vocab)

vocab from this print statement is missing the merge ab and has 69 merges, although vocab_size is set to 70.

KeremTurgutlu commented 2 years ago

Same typo Course -> course is also present in Unigram. Final tokenizations assumes capital Course is used and results in ['▁This', '▁is', '▁the', '▁Hugging', '▁Face', '▁', 'c', 'ou', 'r', 's', 'e', '.']. However if lowercased course is used then the tokenization would be ['▁This', '▁is', '▁the', '▁Hugging', '▁Face', '▁course.']

lewtun commented 2 years ago

Thanks for reporting these typos @KeremTurgutlu - you're totally right that the capitalization isn't applied consistently. I think the simplest change would be to capitalise Course in the corpus list - would you like to open a PR with the fixes?

KeremTurgutlu commented 2 years ago

@lewtun created https://github.com/huggingface/course/pull/166