Issue in BertWordPieceTokenizer

KyloRen1 commented 3 years ago

I am trying to train a custom BertWordPieceTokenizer for the Ukrainian language.

tokenizer = ByteLevelBPETokenizer(lowercase = True, unicode_normalizer='nfkc')

tokenizer.train( files=paths[0], vocab_size=31000, min_frequency=2, special_tokens=['[PAD]', '[UNK]', '[CLS]', '[MASK]', '[SEP]'] )

And after it is trained, I have tried to tokenize one of the samples using ByteLevel decoder

[decoder.decode([tok]) for tok in tokenizer.encode('Тарас Шевченко – великий українсьский').tokens]

['�', '�', 'ара', 'с',' �', '�', 'евчен', 'ко',' –',' великий',' україн','сь', 'ский']

Why does this token '�' occurs?

And another problem that has occurred is that after training BertWordPieceTokenizer

tokenizer = BertWordPieceTokenizer( clean_text=True, handle_chinese_chars=True, strip_accents=True, lowercase=True, ) tokenizer.train( paths[0], vocab_size=31000, min_frequency=2, show_progress=True, special_tokens=['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]'], limit_alphabet=1000, wordpieces_prefix="##" )

Tokenized text replaces 'ї' by 'і', and 'й' by 'и'. Why that can happen? ['[CLS]', 'тарас', 'шевченко', '–', 'великии', 'украін', '##сь', '##скии', '[SEP]']

Thanks!

Narsil commented 3 years ago

Hi, ByteLevel will replace some unicode codepoints for display reasons. But the decode should be fine (it doesn't matter if the parts are non readable, you should look at the ids instead anyway)

So '�' is probbly be fine (would have to check a bit more but if decode is fine, then yes this is intended.)

Tokenized text replaces 'ї' by 'і', and 'й' by 'и'. Why that can happen? You set strip_accents=True, so it is stripping accents.

Cheers.

KyloRen1 commented 3 years ago

@Narsil thanks for the clarification. The thing is, I am trying to convert the Byte level tokenizer vocab file to BertWordPiece tokenizer vocab file, and I am struggling with '�'

stefan-it commented 3 years ago

Hi @KyloRen1 , regarding to your other question about 'ї' to 'і' conversion: this comes from the accent stripping option (strip_accents=True) - so if you want to train an uncased model, you should first try to disable it.

(I'm not sure for what downstream tasks or languages, accent stripping is a good idea btw. ...)

KyloRen1 commented 3 years ago

@stefan-it Thanks a lot, setting strip_accents=False improves the tokenisation

Narsil commented 3 years ago

Byte level tokenizer vocab file to BertWordPiece tokenizer vocab file, and I am struggling with '�'

I don't think you can convert that way. ByteLevel is emulating something like a gpt2/roberta tokenizer. Bert if I'm not mistaken is NOT ByteLevel, so you can't "convert" one into the other. Internally the ByteLevel strategy rewrites unicode characters to make sure every token is printable and has a size to ease debugging/printing. With that rewrite, you're going to have a tough time reverting it. The code that does the unicode rewrite is here if you really want to do that: https://github.com/huggingface/tokenizers/blob/master/tokenizers/src/pre_tokenizers/byte_level.rs#L11

Why can't you just train a BertWordPiece in the first place ?

KyloRen1 commented 3 years ago

@Narsil Thanks for the link. I have tried to train BertWordPiece at first, but due to the strip_accents=True I couldn't figure out 'ї' to 'і' conversion. Now I will use BertWordPiece.

KyloRen1 commented 3 years ago

My approach converting ByteLevel tokenizer to BertWordPiece relied on iterating over ByteLevel vocab file and converting those byte's to tokens, using ByteDecoder. And adding ## prior to subword tokens.

Narsil commented 3 years ago

Ok, it might have worked, but that seems flaky. You should be better using BertWordPiece directly.

KyloRen1 commented 3 years ago

Thanks again for the clarification. I will close the issue

huggingface / tokenizers

Issue in BertWordPieceTokenizer #756