Closed tshmak closed 4 months ago
>>> print( tokenizer.tokenize(text))
['ɡ', 'ei2', 'j', 'a4', 'n', '|', 'ɡ', 'eɜ', 'p', 'anɡ4', 'j', 'au5', 't', 'aaɜ', 'n', '|', 'z', 'o2', 'j', 'a1', 't', '|', 'h', 'au2', 'h', 'eiɜ']
the [UNK] comes from the '|'
which does not seem to be part of your vocab:
>>> '|' in tokenizer.get_vocab()
False
But why does it insert '|' at somewhat random locations? As you can see, I have no word boundaries in text
.
Why would it insert it at a random location? It inserts it in place of '|'
. You may not have boundaries but the text is tokenized anyways
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I guess I had a misconception about huggingface tokenizers. It seems in general, these tokenizers are themselves neural networks and their outputs may be somewhat unpredictable.
No no, they are not. You sometimes have dropout, but in this specific case I was mentioning that the unk token is not place at random positions at all! It is placed instead of '|' which is not part of the vocab
But how does it determine where to insert the '|'? The output here,
ɡ ei2 j a4 n [UNK] ɡ eɜ p anɡ4 j au5 t aaɜ n [UNK] z o2 j a1 t [UNK] h au2 h eiɜ
does not make sense. If the "words" here correspond to chinese characters, then the word boundaries should be as follows:
ɡ ei2 | j a4 n | ɡ eɜ | p anɡ4 | j au5 | t aaɜ n | z o2 | j a1 t | h au2 | h eiɜ
So I suspect there must be a network of some kind (wrongly trained) that outputs these boundaries at the wrong locations, which appear somewhat random to me.
Actually to give a bit more context, these phones
here are sub-units of jyutping, which are like the pinyins for Mandarin Chinese, except it's for Cantonese Chinese. There's no reason to expect the transformers Wav2Vec2CTCTokenizer
to have been pretrained on Cantonese jyutping, so it's not surprising it got it wrong.
I suspect then, that Wav2Vec2CTCTokenizer
was pretrained on English data, or perhaps even multi-lingual data. And because these Cantonese phones are not part of the training data, it parses it in an apparently random way. I just made some tests:
text = "應中共中央政治局委員兼外長王毅邀請,南韓外長趙兌烈將於下星期一至星期星期二訪華。" # Chinese characters
print(''.join(tokenizer.tokenize(text)))
text = "This is a good day. Let's go picnic!" # English
print(''.join(tokenizer.tokenize(text)))
The tokenizer correctly inserts the word boundaries for English:
"This|is|a|good|day.|Let's|go|picnic!"
And for Chinese, it correctly leaves out the word boundaries:
'應中共中央政治局委員兼外長王毅邀請,南韓外長趙兌烈將於下星期一至星期星期二訪華。'
So I think the conclusion of the matter is, Wav2Vec2CTCTokenizer
is not designed for arbitrary sequences/languages (e.g. jyutping phones).
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Yep as it was trained / designed specifically for english it makes sense that is is not optimale for chinese!
System Info
transformers
version: 4.29.2Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Example script:
Output:
Expected behavior
The unknown token (157, [UNK]) in this case should not be part of the encoded input.