Tokenization questions - Githubissues

CsAbdulelah commented 1 year ago

Hi Joshua thank you for your support

I have couple of questions that confuses me

1- Codepoint tokenizer is just using ord() function in python no thing else? 2- the output of to_example file is just input ids with no attention mask? 3- I see in training folder there is bbe_vocap file and toknizer training file what is the purpose of them? while CANINE use only character as a inputs could you elaborate more about point?

Mindful commented 1 year ago

Yeah, this is correct, codepoints are computed using ord(). Obviously, having an actual embedding for every single unicode codepoint would be too many embeddings, and so CANINE (and SHIBA) use a hashing/bucketing strategy in the embedder to reduce the total number of embeddings. Tokenization code can be seen here, and the hashing embedder can be seen here. For further detail your best bet would be referencing the CANINE paper. One thing I would add though is that in retrospect, we probably could have decided to support only Japanese and English characters and we might not have needed this hashing. In that sense, if you're training for only one or two languages, you might also not need the hashing.
Yes, that's correct. In fact it has to be this way, because we don't know how the examples are going to be batched during training and consequently where padding will be (unless you fix the batches and use the same batches every time, which is generally not how people train). Computing the attention mask from a batch of input ids/padding is trivial so this isn't an issue; you can see code for this here.
We experimented with a BPE masking strategy, but random span masking worked the best for us overall, so we decided to go with that. The training script lets you pick, but if you intend to use random span masking (or random character masking) you don't need to worry about the BPE info. If you want to use BPE masking you'll need to train a BPE model on your own language.

Mindful commented 1 year ago

Closing for now, feel free to open a new issue if you have more questions.

CsAbdulelah commented 1 year ago

Thanks for your response, I have question about the difference between rand_span and rand_char? I found that the CANINE use rand char I will appreciate if you could elaborate. Thank you in advance

Mindful commented 1 year ago

@CsAbdulelah CANINE uses random span masking - where are you seeing that it uses random character masking? Below is a screenshot from the CANINE paper. Screen Shot 2023-05-27 at 9 31 25

In regards to the difference - both strategies mask characters. rand_char masks characters entirely randomly, whereas rand_span masks random consecutive groups of characters. Importantly, rand_span is the "harder" pretraining task and we found it produced better models. If you have more detailed questions, please feel free to open a new issue.

octanove / shiba

Tokenization questions #7