Closed KyloRen1 closed 3 years ago
Hi, ByteLevel will replace some unicode codepoints for display reasons. But the decode should be fine (it doesn't matter if the parts are non readable, you should look at the ids instead anyway)
So '�' is probbly be fine (would have to check a bit more but if decode is fine, then yes this is intended.)
Tokenized text replaces 'ї' by 'і', and 'й' by 'и'. Why that can happen? You set
strip_accents=True
, so it is stripping accents.
Cheers.
@Narsil thanks for the clarification. The thing is, I am trying to convert the Byte level tokenizer vocab file to BertWordPiece tokenizer vocab file, and I am struggling with '�'
Hi @KyloRen1 , regarding to your other question about 'ї' to 'і' conversion: this comes from the accent stripping option (strip_accents=True
) - so if you want to train an uncased model, you should first try to disable it.
(I'm not sure for what downstream tasks or languages, accent stripping is a good idea btw. ...)
@stefan-it Thanks a lot, setting strip_accents=False
improves the tokenisation
Byte level tokenizer vocab file to BertWordPiece tokenizer vocab file, and I am struggling with '�'
I don't think you can convert that way. ByteLevel is emulating something like a gpt2/roberta tokenizer. Bert if I'm not mistaken is NOT ByteLevel, so you can't "convert" one into the other. Internally the ByteLevel strategy rewrites unicode characters to make sure every token is printable and has a size to ease debugging/printing. With that rewrite, you're going to have a tough time reverting it. The code that does the unicode rewrite is here if you really want to do that: https://github.com/huggingface/tokenizers/blob/master/tokenizers/src/pre_tokenizers/byte_level.rs#L11
Why can't you just train a BertWordPiece in the first place ?
@Narsil Thanks for the link. I have tried to train BertWordPiece at first, but due to the strip_accents=True
I couldn't figure out 'ї' to 'і' conversion. Now I will use BertWordPiece.
My approach converting ByteLevel tokenizer to BertWordPiece relied on iterating over ByteLevel vocab file and converting those byte's to tokens, using ByteDecoder. And adding ##
prior to subword tokens.
Ok, it might have worked, but that seems flaky. You should be better using BertWordPiece directly.
Thanks again for the clarification. I will close the issue
I am trying to train a custom BertWordPieceTokenizer for the Ukrainian language.
tokenizer = ByteLevelBPETokenizer(lowercase = True, unicode_normalizer='nfkc')
tokenizer.train( files=paths[0], vocab_size=31000, min_frequency=2, special_tokens=['[PAD]', '[UNK]', '[CLS]', '[MASK]', '[SEP]'] )
And after it is trained, I have tried to tokenize one of the samples using ByteLevel decoder
[decoder.decode([tok]) for tok in tokenizer.encode('Тарас Шевченко – великий українсьский').tokens]
['�', '�', 'ара', 'с',' �', '�', 'евчен', 'ко',' –',' великий',' україн','сь', 'ский']
Why does this token '�' occurs?
And another problem that has occurred is that after training BertWordPieceTokenizer
tokenizer = BertWordPieceTokenizer( clean_text=True, handle_chinese_chars=True, strip_accents=True, lowercase=True, )
tokenizer.train( paths[0], vocab_size=31000, min_frequency=2, show_progress=True, special_tokens=['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]'], limit_alphabet=1000, wordpieces_prefix="##" )
Tokenized text replaces 'ї' by 'і', and 'й' by 'и'. Why that can happen?
['[CLS]', 'тарас', 'шевченко', '–', 'великии', 'украін', '##сь', '##скии', '[SEP]']
Thanks!