Wav2Vec2CTCTokenizer adds random unknown tokens to encoded input

tshmak commented 6 months ago

System Info

transformers version: 4.29.2
Platform: Linux-5.15.0-92-generic-x86_64-with-glibc2.35
Python version: 3.11.8
Huggingface_hub version: 0.22.2
Safetensors version: 0.4.2
PyTorch version (GPU?): 2.2.2 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Example script:

import json
from transformers import Wav2Vec2CTCTokenizer

phones = 'ɔː,(en),oɜ,inɡɜ,o4,b,f,enɡ1,oi2,aa7,eɪ,eː,au7,aaiɜ,onɡ4,oe6,uiɜ,ɒ,iə,c,aa2,oenɡ1,ei7,oenɡ6,au1,ŋ5,iu5,aɪə,ou4,d,ai7,k,i2,eoi5,aai2,j,oenɡɜ,u1,ŋ4,i,m,oi6,unɡɜ,ou2,au2,p,yu1,a,yu4,onɡ1,ɛ,e5,əʊ,ou6,yu5,aɜ,oi1,onɡ5,ai5,aau5,inɡ5,ai1,eɜ,ei5,uɜ,o2,i5,nɡ6,enɡ4,ɐ,l,o1,iu4,enɡ6,ou5,onɡ7,anɡ1,tʃ,aau2,eo6,aa6,iː,enɡ7,oenɡ5,ŋ,aau1,u5,eo5,yu7,oi7,aaɜ,oiɜ,yu2,aa5,ɑː,oe1,n,eoi2,ui2,oenɡ2,inɡ1,anɡ4,t,au4,ei4,u2,aanɡ2,ui4,dʒ,[PAD],a1,e,oenɡ7,aau4,onɡɜ,eoi6,unɡ5,ɹ,e6,yu6,ɪ,ʃ,ei2,aauɜ,enɡɜ,unɡ1,aɪ,i6,eiɜ,aanɡ1,inɡ6,iu1,o5,ui1,inɡ2,unɡ4,eoi4,eo4,uː,ei1,oenɡ4,aa4,aanɡ7,a2,e4,enɡ2,a5,auɜ,iɜ,əl,ai6,iu2,a4,e2,ouɜ,eoi1,anɡ2,[UNK],h,onɡ6,aau6,nɡ5,nɡ4,enɡ5,oeɜ,inɡ4,a6,eoiɜ,e1,ʊ,i1,o7,z,au6,ai4,anɡ6,aai1,oi5,aʊ,v,iu6,unɡ7,au5,eoɜ,aanɡ6,ou1,aanɡ5,(zhy),anɡɜ,oi4,onɡ2,a7,w,ui5,ui6,oe5,unɡ6,aanɡ4,ɔɪ,inɡ7,ɡ,s,o6,aa1,u6,aai4,ʌ,ou7,yuɜ,ɜː,ei6,aiɜ,ə,anɡ7,ai2,u4,iu7,iuɜ,eo1,aai6,eo2,i4,i7,aai5,unɡ2'.split(',')

phones_dict = {x:i for i, x in enumerate(phones)}
with open('test.json', 'w') as f: 
    json.dump(phones_dict, f, indent=4, ensure_ascii=False)

tokenizer = Wav2Vec2CTCTokenizer('test.json', unk_token='[UNK]', pad_token='[PAD]')
text = 'ɡ ei2 j a4 n ɡ eɜ p anɡ4 j au5 t aaɜ n z o2 j a1 t h au2 h eiɜ'
print(tokenizer(text))
print(tokenizer.decode(tokenizer(text)['input_ids'], spaces_between_special_tokens=True))

Output:

{'input_ids': [200, 122, 35, 152, 96, 157, 200, 62, 45, 101, 35, 182, 102, 90, 96, 157, 172, 65, 35, 110, 102, 157, 158, 44, 158, 128], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
ɡ ei2 j a4 n [UNK] ɡ eɜ p anɡ4 j au5 t aaɜ n [UNK] z o2 j a1 t [UNK] h au2 h eiɜ

Expected behavior

The unknown token (157, [UNK]) in this case should not be part of the encoded input.

ArthurZucker commented 6 months ago

>>> print( tokenizer.tokenize(text))
['ɡ', 'ei2', 'j', 'a4', 'n', '|', 'ɡ', 'eɜ', 'p', 'anɡ4', 'j', 'au5', 't', 'aaɜ', 'n', '|', 'z', 'o2', 'j', 'a1', 't', '|', 'h', 'au2', 'h', 'eiɜ']

the [UNK] comes from the '|' which does not seem to be part of your vocab:

>>> '|' in tokenizer.get_vocab()
False

tshmak commented 6 months ago

But why does it insert '|' at somewhat random locations? As you can see, I have no word boundaries in text.

ArthurZucker commented 5 months ago

Why would it insert it at a random location? It inserts it in place of '|'. You may not have boundaries but the text is tokenized anyways

github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

tshmak commented 5 months ago

I guess I had a misconception about huggingface tokenizers. It seems in general, these tokenizers are themselves neural networks and their outputs may be somewhat unpredictable.

ArthurZucker commented 5 months ago

No no, they are not. You sometimes have dropout, but in this specific case I was mentioning that the unk token is not place at random positions at all! It is placed instead of '|' which is not part of the vocab

tshmak commented 4 months ago

But how does it determine where to insert the '|'? The output here,

ɡ ei2 j a4 n [UNK] ɡ eɜ p anɡ4 j au5 t aaɜ n [UNK] z o2 j a1 t [UNK] h au2 h eiɜ

does not make sense. If the "words" here correspond to chinese characters, then the word boundaries should be as follows:

ɡ ei2 | j a4 n | ɡ eɜ | p anɡ4 | j au5 | t aaɜ n | z o2 | j a1 t | h au2 | h eiɜ

So I suspect there must be a network of some kind (wrongly trained) that outputs these boundaries at the wrong locations, which appear somewhat random to me.

Actually to give a bit more context, these phones here are sub-units of jyutping, which are like the pinyins for Mandarin Chinese, except it's for Cantonese Chinese. There's no reason to expect the transformers Wav2Vec2CTCTokenizer to have been pretrained on Cantonese jyutping, so it's not surprising it got it wrong.

I suspect then, that Wav2Vec2CTCTokenizer was pretrained on English data, or perhaps even multi-lingual data. And because these Cantonese phones are not part of the training data, it parses it in an apparently random way. I just made some tests:

text = "應中共中央政治局委員兼外長王毅邀請，南韓外長趙兌烈將於下星期一至星期星期二訪華。" # Chinese characters 
print(''.join(tokenizer.tokenize(text)))
text = "This is a good day. Let's go picnic!" # English
print(''.join(tokenizer.tokenize(text)))

The tokenizer correctly inserts the word boundaries for English:

"This|is|a|good|day.|Let's|go|picnic!"

And for Chinese, it correctly leaves out the word boundaries:

'應中共中央政治局委員兼外長王毅邀請，南韓外長趙兌烈將於下星期一至星期星期二訪華。'

So I think the conclusion of the matter is, Wav2Vec2CTCTokenizer is not designed for arbitrary sequences/languages (e.g. jyutping phones).

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ArthurZucker commented 3 months ago

Yep as it was trained / designed specifically for english it makes sense that is is not optimale for chinese!

huggingface / transformers