huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
129.4k stars 25.67k forks source link

Wav2Vec2CTCTokenizer adds random unknown tokens to encoded input #30561

Open tshmak opened 2 months ago

tshmak commented 2 months ago

System Info

Who can help?

No response

Information

Tasks

Reproduction

Example script:

import json
from transformers import Wav2Vec2CTCTokenizer

phones = 'ɔː,(en),oɜ,inɡɜ,o4,b,f,enɡ1,oi2,aa7,eɪ,eː,au7,aaiɜ,onɡ4,oe6,uiɜ,ɒ,iə,c,aa2,oenɡ1,ei7,oenɡ6,au1,ŋ5,iu5,aɪə,ou4,d,ai7,k,i2,eoi5,aai2,j,oenɡɜ,u1,ŋ4,i,m,oi6,unɡɜ,ou2,au2,p,yu1,a,yu4,onɡ1,ɛ,e5,əʊ,ou6,yu5,aɜ,oi1,onɡ5,ai5,aau5,inɡ5,ai1,eɜ,ei5,uɜ,o2,i5,nɡ6,enɡ4,ɐ,l,o1,iu4,enɡ6,ou5,onɡ7,anɡ1,tʃ,aau2,eo6,aa6,iː,enɡ7,oenɡ5,ŋ,aau1,u5,eo5,yu7,oi7,aaɜ,oiɜ,yu2,aa5,ɑː,oe1,n,eoi2,ui2,oenɡ2,inɡ1,anɡ4,t,au4,ei4,u2,aanɡ2,ui4,dʒ,[PAD],a1,e,oenɡ7,aau4,onɡɜ,eoi6,unɡ5,ɹ,e6,yu6,ɪ,ʃ,ei2,aauɜ,enɡɜ,unɡ1,aɪ,i6,eiɜ,aanɡ1,inɡ6,iu1,o5,ui1,inɡ2,unɡ4,eoi4,eo4,uː,ei1,oenɡ4,aa4,aanɡ7,a2,e4,enɡ2,a5,auɜ,iɜ,əl,ai6,iu2,a4,e2,ouɜ,eoi1,anɡ2,[UNK],h,onɡ6,aau6,nɡ5,nɡ4,enɡ5,oeɜ,inɡ4,a6,eoiɜ,e1,ʊ,i1,o7,z,au6,ai4,anɡ6,aai1,oi5,aʊ,v,iu6,unɡ7,au5,eoɜ,aanɡ6,ou1,aanɡ5,(zhy),anɡɜ,oi4,onɡ2,a7,w,ui5,ui6,oe5,unɡ6,aanɡ4,ɔɪ,inɡ7,ɡ,s,o6,aa1,u6,aai4,ʌ,ou7,yuɜ,ɜː,ei6,aiɜ,ə,anɡ7,ai2,u4,iu7,iuɜ,eo1,aai6,eo2,i4,i7,aai5,unɡ2'.split(',')

phones_dict = {x:i for i, x in enumerate(phones)}
with open('test.json', 'w') as f: 
    json.dump(phones_dict, f, indent=4, ensure_ascii=False)

tokenizer = Wav2Vec2CTCTokenizer('test.json', unk_token='[UNK]', pad_token='[PAD]')
text = 'ɡ ei2 j a4 n ɡ eɜ p anɡ4 j au5 t aaɜ n z o2 j a1 t h au2 h eiɜ'
print(tokenizer(text))
print(tokenizer.decode(tokenizer(text)['input_ids'], spaces_between_special_tokens=True)) 

Output:

{'input_ids': [200, 122, 35, 152, 96, 157, 200, 62, 45, 101, 35, 182, 102, 90, 96, 157, 172, 65, 35, 110, 102, 157, 158, 44, 158, 128], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
ɡ ei2 j a4 n [UNK] ɡ eɜ p anɡ4 j au5 t aaɜ n [UNK] z o2 j a1 t [UNK] h au2 h eiɜ

Expected behavior

The unknown token (157, [UNK]) in this case should not be part of the encoded input.

ArthurZucker commented 2 months ago
>>> print( tokenizer.tokenize(text))
['ɡ', 'ei2', 'j', 'a4', 'n', '|', 'ɡ', 'eɜ', 'p', 'anɡ4', 'j', 'au5', 't', 'aaɜ', 'n', '|', 'z', 'o2', 'j', 'a1', 't', '|', 'h', 'au2', 'h', 'eiɜ']

the [UNK] comes from the '|' which does not seem to be part of your vocab:

>>> '|' in tokenizer.get_vocab()
False
tshmak commented 2 months ago

But why does it insert '|' at somewhat random locations? As you can see, I have no word boundaries in text.

ArthurZucker commented 1 month ago

Why would it insert it at a random location? It inserts it in place of '|'. You may not have boundaries but the text is tokenized anyways

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

tshmak commented 1 month ago

I guess I had a misconception about huggingface tokenizers. It seems in general, these tokenizers are themselves neural networks and their outputs may be somewhat unpredictable.

ArthurZucker commented 4 weeks ago

No no, they are not. You sometimes have dropout, but in this specific case I was mentioning that the unk token is not place at random positions at all! It is placed instead of '|' which is not part of the vocab

tshmak commented 3 weeks ago

But how does it determine where to insert the '|'? The output here,

ɡ ei2 j a4 n [UNK] ɡ eɜ p anɡ4 j au5 t aaɜ n [UNK] z o2 j a1 t [UNK] h au2 h eiɜ

does not make sense. If the "words" here correspond to chinese characters, then the word boundaries should be as follows:

ɡ ei2 | j a4 n | ɡ eɜ | p anɡ4 | j au5 | t aaɜ n | z o2 | j a1 t | h au2 | h eiɜ

So I suspect there must be a network of some kind (wrongly trained) that outputs these boundaries at the wrong locations, which appear somewhat random to me.

Actually to give a bit more context, these phones here are sub-units of jyutping, which are like the pinyins for Mandarin Chinese, except it's for Cantonese Chinese. There's no reason to expect the transformers Wav2Vec2CTCTokenizer to have been pretrained on Cantonese jyutping, so it's not surprising it got it wrong.

I suspect then, that Wav2Vec2CTCTokenizer was pretrained on English data, or perhaps even multi-lingual data. And because these Cantonese phones are not part of the training data, it parses it in an apparently random way. I just made some tests:

text = "應中共中央政治局委員兼外長王毅邀請,南韓外長趙兌烈將於下星期一至星期星期二訪華。" # Chinese characters 
print(''.join(tokenizer.tokenize(text)))
text = "This is a good day. Let's go picnic!" # English
print(''.join(tokenizer.tokenize(text)))

The tokenizer correctly inserts the word boundaries for English:

"This|is|a|good|day.|Let's|go|picnic!"

And for Chinese, it correctly leaves out the word boundaries:

'應中共中央政治局委員兼外長王毅邀請,南韓外長趙兌烈將於下星期一至星期星期二訪華。'

So I think the conclusion of the matter is, Wav2Vec2CTCTokenizer is not designed for arbitrary sequences/languages (e.g. jyutping phones).