google-research / albert

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Apache License 2.0
3.24k stars 569 forks source link

Be careful if you use foreign language with spm model when you create pretraining data #161

Open akakakakakaa opened 4 years ago

akakakakakaa commented 4 years ago

When you create pretraining data,

spm model treats foreign language(except for english) tokens as a whole piece.

for example, if you have two tokens (_care, ful),

is_start_piece(_care) and is_start_piece(ful) always returns true.

So, If you want to use foreign language, you have to remove all() function in _is_start_piece_sp function.