Closed cbaziotis closed 4 years ago
We do independent masking of tokens. I'm pretty sure whole-word masking would improve results a bit, but it is hard to implement whole-word masking when the masking is done dynamically in tensorflow instead of in a pre-processing step.
Thanks!
Hi, congrats for the paper. I really like the idea. I was wondering, what is your approach for masking tokens. Do you mask individual tokens independently, regardless of whether they might be units of a multi-token word, or do you mask all the tokens of a given word?
Let's say that we have this tokenized sentence and we want to mask
shareholder
:Independent masking:
shareholder
consists of 3 tokens and you allow for one of them to be masked, without masking the other 2.Whole word masking: all tokens of
shareholder
have to be masked.Which one is it? Or do you have a different approach?