huggingface / course

The Hugging Face course on Transformers
https://huggingface.co/course
Apache License 2.0
2.15k stars 697 forks source link

Mistake in Unigram tokenization #277

Open CapBlood opened 2 years ago

CapBlood commented 2 years ago

Hi! Is it a mistake? There should be 17 instead of 5 in the end. Снимок экрана 2022-07-08 в 17 41 45

lewtun commented 1 year ago

Hi @CapBlood are you referring to the frequency of ugs? Since the corpus is defined by these words:

("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)

I think the frequency for this token is correct.

CapBlood commented 1 year ago

Hi @lewtun no, i'm referring to the frequency of p. It must be 17 / 210 instead of 5 / 210 in the formula. There is the same error for token pu - 17 / 210 instead of 5 / 210.