Closed CsAbdulelah closed 1 year ago
ord()
. Obviously, having an actual embedding for every single unicode codepoint would be too many embeddings, and so CANINE (and SHIBA) use a hashing/bucketing strategy in the embedder to reduce the total number of embeddings. Tokenization code can be seen here, and the hashing embedder can be seen here. For further detail your best bet would be referencing the CANINE paper. One thing I would add though is that in retrospect, we probably could have decided to support only Japanese and English characters and we might not have needed this hashing. In that sense, if you're training for only one or two languages, you might also not need the hashing. Closing for now, feel free to open a new issue if you have more questions.
Thanks for your response, I have question about the difference between rand_span and rand_char? I found that the CANINE use rand char I will appreciate if you could elaborate. Thank you in advance
@CsAbdulelah CANINE uses random span masking - where are you seeing that it uses random character masking? Below is a screenshot from the CANINE paper.
In regards to the difference - both strategies mask characters. rand_char masks characters entirely randomly, whereas rand_span masks random consecutive groups of characters. Importantly, rand_span is the "harder" pretraining task and we found it produced better models. If you have more detailed questions, please feel free to open a new issue.
Hi Joshua thank you for your support
I have couple of questions that confuses me
1- Codepoint tokenizer is just using ord() function in python no thing else? 2- the output of to_example file is just input ids with no attention mask? 3- I see in training folder there is bbe_vocap file and toknizer training file what is the purpose of them? while CANINE use only character as a inputs could you elaborate more about point?