Open akakakakakaa opened 4 years ago
When you create pretraining data,
spm model treats foreign language(except for english) tokens as a whole piece.
for example, if you have two tokens (_care, ful),
is_start_piece(_care) and is_start_piece(ful) always returns true.
So, If you want to use foreign language, you have to remove all() function in _is_start_piece_sp function.
When you create pretraining data,
spm model treats foreign language(except for english) tokens as a whole piece.
for example, if you have two tokens (_care, ful),
is_start_piece(_care) and is_start_piece(ful) always returns true.
So, If you want to use foreign language, you have to remove all() function in _is_start_piece_sp function.