google-research / electra

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
Apache License 2.0
2.31k stars 351 forks source link

Token-masking method: whole words or sub-words? #57

Closed cbaziotis closed 4 years ago

cbaziotis commented 4 years ago

Hi, congrats for the paper. I really like the idea. I was wondering, what is your approach for masking tokens. Do you mask individual tokens independently, regardless of whether they might be units of a multi-token word, or do you mask all the tokens of a given word?

Let's say that we have this tokenized sentence and we want to mask shareholder:

<s> ▁Meanwhile , ▁share hold er ▁funds ▁have ▁climbed ▁160 ▁per ▁cent ▁since ▁2009 , ▁when ▁Miss ▁Moore ▁first ▁became ▁a ▁fully - fl ed ged ▁BBC ▁commissioner , </s>'
  1. Independent masking: shareholder consists of 3 tokens and you allow for one of them to be masked, without masking the other 2.

    <s> ▁Meanwhile , ▁share <mask> er ▁funds ▁have ▁climbed ▁160 ▁per ▁cent ▁since ▁2009 , ▁when ▁Miss ▁Moore ▁first ▁became ▁a ▁fully - fl ed ged ▁BBC ▁commissioner , </s>'
  2. Whole word masking: all tokens of shareholder have to be masked.

    <s> ▁Meanwhile , <mask> <mask> <mask> ▁funds ▁have ▁climbed ▁160 ▁per ▁cent ▁since ▁2009 , ▁when ▁Miss ▁Moore ▁first ▁became ▁a ▁fully - fl ed ged ▁BBC ▁commissioner , </s>'

    Which one is it? Or do you have a different approach?

clarkkev commented 4 years ago

We do independent masking of tokens. I'm pretty sure whole-word masking would improve results a bit, but it is hard to implement whole-word masking when the masking is done dynamically in tensorflow instead of in a pre-processing step.

cbaziotis commented 4 years ago

Thanks!