mozilla / DeepSpeech

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Mozilla Public License 2.0
25.29k stars 3.96k forks source link

KeyError on Unicode character #1099

Closed kjanko closed 6 years ago

kjanko commented 6 years ago

Traceback (most recent call last): File "/usr/local/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner self.run() File "/usr/local/anaconda2/lib/python2.7/threading.py", line 754, in run self.target(*self.args, **self.__kwargs) File "/home/kjanko/DeepSpeech/util/feeding.py", line 148, in _populate_batch_queue target = text_to_char_array(transcript, self._alphabet) File "/home/kjanko/DeepSpeech/util/text.py", line 40, in text_to_char_array return np.asarray([alphabet.label_from_string(c) for c in original]) File "/home/kjanko/DeepSpeech/util/text.py", line 30, in label_from_string return self._str_to_label[string] KeyError: u'\u0441'

Exception in thread Thread-8: Traceback (most recent call last): File "/usr/local/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner self.run() File "/usr/local/anaconda2/lib/python2.7/threading.py", line 754, in run self.target(*self.args, **self.__kwargs) File "/home/kjanko/DeepSpeech/util/feeding.py", line 148, in _populate_batch_queue target = text_to_char_array(transcript, self._alphabet) File "/home/kjanko/DeepSpeech/util/text.py", line 40, in text_to_char_array return np.asarray([alphabet.label_from_string(c) for c in original]) File "/home/kjanko/DeepSpeech/util/text.py", line 30, in label_from_string return self._str_to_label[string] KeyError: u'\u0442'

I've modified the code to use UTF-8 Macedonian alphabet. Did I do something wrong in the process because I'm receiving these exceptions when running DeepSpeech?

lissyx commented 6 years ago

Can you share your modifications ?

kjanko commented 6 years ago

@lissyx Vocab.txt > https://pastebin.com/v1BESQH4 Alphabet.txt > https://pastebin.com/YZYSDYDB (The same sentences from the CSV are inside the Vocab.txt, trie and language model created on this data)

kjanko commented 6 years ago

Any news on this? I've triple checked our setup, but I have no idea what's going on. Characters 0441 and 0442 are both valid characters and are located inside the alphabet

reuben commented 6 years ago

Can you post the output of the following command? Replace the actual path of alphabet.txt. Run it from the folder where DeepSpeech.py is located.

python -c "from util.text import Alphabet; a = Alphabet('/path/to/alphabet.txt'); print('\n'.join([u'\'{}\' ({}) -> {}'.format(s, ':'.join('{:02x}'.format(ord(c)) for c in s), a._str_to_label[s]) for s in a._str_to_label]))"
kjanko commented 6 years ago

@reuben

'ш' (448) -> 31 'ж' (436) -> 8 'ѕ' (455) -> 10 'ф' (444) -> 26 'у' (443) -> 25 'њ' (45a) -> 18 'о' (43e) -> 19 'р' (440) -> 21 'љ' (459) -> 15 'г' (433) -> 4 'х' (445) -> 27 'к' (43a) -> 13 'и' (438) -> 11 'т' (442) -> 23 'ѓ' (453) -> 6 'а' (430) -> 1 'ц' (446) -> 28 'ј' (458) -> 12 'з' (437) -> 9 'ќ' (45c) -> 24 'м' (43c) -> 16 'с' (441) -> 22 'л' (43b) -> 14 'ч' (447) -> 29 'б' (431) -> 2 ' ' (20) -> 0 'н' (43d) -> 17 'д' (434) -> 5 'в' (432) -> 3 'п' (43f) -> 20 'е' (435) -> 7 ''' (27) -> 32 'џ' (45f) -> 30

reuben commented 6 years ago

And you're definitely passing that same file in the --alphabet_config_path parameter? No typos? If so, I don't know what's going on. The data structure matches the file, and has the relevant characters…

kjanko commented 6 years ago

@reuben It's correct.

lissyx commented 6 years ago

@kjanko Do you think you could share a small subset example of your training data (.csv files, audio/text material, alphabet, ...) that exposes the issue? I could try to run that on my desktop at home and investigate more.

kjanko commented 6 years ago

@lissyx I'll share all the data, give me a moment.

kjanko commented 6 years ago

https://www.dropbox.com/s/snswh28djdmr9bz/jargon-data.zip?dl=0

kjanko commented 6 years ago

My data apparently contains missing characters from the alphabet due to strange utf-8 encoding issues.

imranm12 commented 5 years ago

@lissyx Hi, I am training urdu dataset.....but getting this error help me in this :

Traceback (most recent call last):
  File "/home/hashim/Desktop/Hashim/UrduCorpus/DeepSpeech/util/text.py", line 33, in label_from_string
    return self._str_to_label[string]
KeyError: ' '

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "DeepSpeech.py", line 941, in <module>
    tf.app.run(main)
  File "/home/hashim/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "DeepSpeech.py", line 893, in main
    train()
  File "DeepSpeech.py", line 388, in train
    hdf5_cache_path=FLAGS.train_cached_features_path)
  File "/home/hashim/Desktop/Hashim/UrduCorpus/DeepSpeech/util/preprocess.py", line 69, in preprocess
    out_data = pmap(step_fn, source_data.iterrows())
  File "/home/hashim/Desktop/Hashim/UrduCorpus/DeepSpeech/util/preprocess.py", line 13, in pmap
    results = pool.map(fun, iterable)
  File "/home/hashim/anaconda3/lib/python3.6/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/hashim/anaconda3/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
  File "/home/hashim/anaconda3/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/hashim/anaconda3/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/hashim/Desktop/Hashim/UrduCorpus/DeepSpeech/util/preprocess.py", line 23, in process_single_file
    transcript = text_to_char_array(file.transcript, alphabet)
  File "/home/hashim/Desktop/Hashim/UrduCorpus/DeepSpeech/util/text.py", line 64, in text_to_char_array
    return np.asarray([alphabet.label_from_string(c) for c in original])
  File "/home/hashim/Desktop/Hashim/UrduCorpus/DeepSpeech/util/text.py", line 64, in <listcomp>
    return np.asarray([alphabet.label_from_string(c) for c in original])
  File "/home/hashim/Desktop/Hashim/UrduCorpus/DeepSpeech/util/text.py", line 44, in label_from_string
    ).with_traceback(e.__traceback__)

File "/home/hashim/Desktop/Hashim/UrduCorpus/DeepSpeech/util/text.py", line 33, in label_from_string return self._str_to_label[string] KeyError: '\n ERROR: You have characters in your transcripts\n which do not occur in your data/alphabet.txt\n file. Please verify that your alphabet.txt\n contains all neccessary characters. Use\n util/check_characters.py to see what characters are in\n your train / dev / test transcripts.\n '

lissyx commented 5 years ago

@imranm12 so? what's not clear in the error message?

waqasr6 commented 5 years ago

@lissyx Hi dear, I am getting the same issue while training for my Urdu language model. I am sharing a small subset example of my training data so you can help me figure it out.

https://drive.google.com/open?id=1_875pTb1YVDcWVYfgkR8A9QnFk_qPv06

lissyx commented 5 years ago

Hi dear, I am getting the same issue while training for my Urdu language model. I am sharing a small subset example of my training data so you can help me figure it out.

I'm asking again, what is not clear in the error message ?

lissyx commented 5 years ago

Especially now that we have https://github.com/mozilla/DeepSpeech/blob/master/util/check_characters.py

lissyx commented 5 years ago

@waqasr6 Check util/check_characters.py ?

waqasr6 commented 5 years ago

@waqasr6 Check util/check_characters.py ?

Yes i've checked earlier. Its giving me some strange results

Reading in the following transcript files:

['trans_urdu/cv-valid-train.csv']

The following unique characters were found in your transcripts:

['\x81', '\x82', '\x85', '\x84', '\x86', '\x88', '\x8c', '\x91', '\x92', ' ', '\xa2', '\xa7', '\xa9', '\xa8', '\xab', '\xaa', '\xad', '\xac', '\xaf', '\xae', '\xb1', '\xb0', '\xb3', '\xb2', '\xb5', '\xb4', '\xb7', '\xb6', '\xb9', '\xb8', '\xba', '\xbe', '\xd9', '\xd8', '\xdb', '\xda', 'a', 'c', 'i', 'n', 'p', 's', 'r', 't']

All these characters should be in your data/alphabet.txt file

waqasr6 commented 5 years ago

@lissyx
You can check my csv file as it is properly UTF-8 encoded ?

waqasr6 commented 5 years ago

and while training its giving me this error:

Preprocessing ['trans_urdu/cv-valid-train.csv']
Traceback (most recent call last):
  File "DeepSpeech.py", line 1959, in <module>
    tf.app.run(main)
  File "/home/waqas/.local/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 124, in run
    _sys.exit(main(argv))
  File "DeepSpeech.py", line 1915, in main
    train()
  File "DeepSpeech.py", line 1464, in train
    hdf5_cache_path=FLAGS.train_cached_features_path)
  File "/home/waqas/DeepSpeech/DeepSpeech/util/preprocess.py", line 68, in preprocess
    out_data = pmap(step_fn, source_data.iterrows())
  File "/home/waqas/DeepSpeech/DeepSpeech/util/preprocess.py", line 13, in pmap
    results = pool.map(fun, iterable)
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 288, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 670, in get
    raise self._value
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/waqas/DeepSpeech/DeepSpeech/util/preprocess.py", line 22, in process_single_file
    transcript = text_to_char_array(file.transcript, alphabet)
  File "/home/waqas/DeepSpeech/DeepSpeech/util/text.py", line 41, in text_to_char_array
    return np.asarray([alphabet.label_from_string(c) for c in original])
  File "/home/waqas/DeepSpeech/DeepSpeech/util/text.py", line 41, in <listcomp>
    return np.asarray([alphabet.label_from_string(c) for c in original])
  File "/home/waqas/DeepSpeech/DeepSpeech/util/text.py", line 31, in label_from_string
    return self._str_to_label[string]
KeyError: 'ک'
lissyx commented 5 years ago

@lissyx You can check my csv file as it is properly UTF-8 encoded ?

Why ? You already have all the elements to fix it ...

waqasr6 commented 5 years ago

@lissyx You can check my csv file as it is properly UTF-8 encoded ?

Why ? You already have all the elements to fix it ...

What i suppose to do to fix it ? As error shows its reading Urdu characters properly KeyError: 'ک' and all the csv transcript characters are available in my alphabet.txt file so i don't expect UTF-8 is the problem here.

lissyx commented 5 years ago

@lissyx You can check my csv file as it is properly UTF-8 encoded ?

Why ? You already have all the elements to fix it ...

What i suppose to do to fix it ? As error shows its reading Urdu characters properly KeyError: 'ک' and all the csv transcript characters are available in my alphabet.txt file so i don't expect UTF-8 is the problem here.

Did you read the error and the output correctly ? It states that you have missing characters in your alphabet. And util/check_characters.py just gave you the list ... Add this list to your alphabet, not much to do.

There's no UTF-8 issue here.

waqasr6 commented 5 years ago

@lissyx I've added the above list got from util/check_characters.py to my alphabet, still getting the same output.

reuben commented 5 years ago

Looks like util/check_characters.py is broken for multi-byte characters. For example, 'ک' is \xDA\xA9 in UTF-8, but the output is suggesting you add \xDA and \xA9 individually, which is incorrect.

reuben commented 5 years ago

Looks like it's a Python 2 specific problem, Python 2 strings are treated as byte sequences, while Python 3 strings are treated as Unicode codepoint sequences. (Are there multi-codepoint graphemes in any languages?)

Try running util/check_characters.py with Python 3 on the same data.

waqasr6 commented 5 years ago

Looks like it's a Python 2 specific problem, Python 2 strings are treated as byte sequences, while Python 3 strings are treated as Unicode codepoint sequences. (Are there multi-codepoint graphemes in any languages?)

Try running util/check_characters.py with Python 3 on the same data.

@reuben yes it was the issue wuth python 2. I tried with python 3 and below are results which are ok. but still getting error while training.

waqas@SID-245:~/DeepSpeech/DeepSpeech$ python3 util/check_characters.py trans_urdu/cv-valid-train.csv

Reading in the following transcript files:

['trans_urdu/cv-valid-train.csv']

The following unique characters were found in your transcripts:

['ق', 'r', 'ح', 'ے', 'ں', 'ک', 'گ', 'ص', 'غ', 's', 'ا', 'پ', 'ہ', 'ذ', 'i', 'ل', 'ھ', 'م', 'ز', 'ر', 'n', 'c', 'چ', 'ڑ', 'ع', 'خ', 'ٹ', 'ن', 'ث', 'ظ', 'س', 'p', 't', 'a', 'ڈ', 'ف', 'و', 'ض', 'ط', 'ش', 'ی', 'ج', 'د', 'ب', ' ', 'ت', 'آ']

All these characters should be in your data/alphabet.txt file

lissyx commented 5 years ago

I tried with python 3 and below are results which are ok. but still getting error while training.

Have you added this list of characters to your alphabet? Have you checked all your CSVs with this tool ?

['ق', 'r', 'ح', 'ے', 'ں', 'ک', 'گ', 'ص', 'غ', 's', 'ا', 'پ', 'ہ', 'ذ', 'i', 'ل', 'ھ', 'م', 'ز', 'ر', 'n', 'c', 'چ', 'ڑ', 'ع', 'خ', 'ٹ', 'ن', 'ث', 'ظ', 'س', 'p', 't', 'a', 'ڈ', 'ف', 'و', 'ض', 'ط', 'ش', 'ی', 'ج', 'د', 'ب', ' ', 'ت', 'آ']
waqasr6 commented 5 years ago

I tried with python 3 and below are results which are ok. but still getting error while training.

Have you added this list of characters to your alphabet? Have you checked all your CSVs with this tool ?

['ق', 'r', 'ح', 'ے', 'ں', 'ک', 'گ', 'ص', 'غ', 's', 'ا', 'پ', 'ہ', 'ذ', 'i', 'ل', 'ھ', 'م', 'ز', 'ر', 'n', 'c', 'چ', 'ڑ', 'ع', 'خ', 'ٹ', 'ن', 'ث', 'ظ', 'س', 'p', 't', 'a', 'ڈ', 'ف', 'و', 'ض', 'ط', 'ش', 'ی', 'ج', 'د', 'ب', ' ', 'ت', 'آ']

Yes i've carefully added all characters in my alphabet file. and checked all my csv's also. here is my alphabet file looks like https://pastebin.com/jbWnMfxC

lissyx commented 5 years ago

Okay, since there was some Python 2 playing that I missed (should not reply when I need to sleep), can you make sure you run everything under only Python 3 ?

Have you ran the character checking tool on all the CSV files you are passing ? Are you sure you are not passing wrong files ? (happened to me ...)

waqasr6 commented 5 years ago

@lissyx Yes now i am running every thing under Python 3. Yes ran the character checking many times on all CSV files. and i am 100% sure about parsing correct files. I've also shared small subset example of my training data so you can believe me :)

https://drive.google.com/open?id=1_875pTb1YVDcWVYfgkR8A9QnFk_qPv06

lissyx commented 5 years ago

@lissyx Yes now i am running every thing under Python 3. Yes ran the character checking many times on all CSV files. and i am 100% sure about parsing correct files. I've also shared small subset example of my training data so you can believe me :)

https://drive.google.com/open?id=1_875pTb1YVDcWVYfgkR8A9QnFk_qPv06

I'm sorry, but I really don't have any time to investigate your dataset, and Urdu is nothing more specific than other language, and we could train with a lot of languages using non ASCII chars, including Kabyle for example in my case :)

lissyx commented 5 years ago

Your alphabet file has two blank lines at the top, not sure this is expected nor what it might produce ?

waqasr6 commented 5 years ago

Your alphabet file has two blank lines at the top, not sure this is expected nor what it might produce ?

i've also checked by removing blank lines.

lissyx commented 5 years ago

Your alphabet file has two blank lines at the top, not sure this is expected nor what it might produce ?

i've also checked by removing blank lines.

Care to share full command lines of your:

waqasr6 commented 5 years ago

@lissyx Yes now i am running every thing under Python 3. Yes ran the character checking many times on all CSV files. and i am 100% sure about parsing correct files. I've also shared small subset example of my training data so you can believe me :) https://drive.google.com/open?id=1_875pTb1YVDcWVYfgkR8A9QnFk_qPv06

I'm sorry, but I really don't have any time to investigate your dataset, and Urdu is nothing more specific than other language, and we could train with a lot of languages using non ASCII chars, including Kabyle for example in my case :)

So do you need to change any thing in code like in util/text.py for Kabyle language ?

lissyx commented 5 years ago

No, I just had to ensure I had all characters in the alphabet, and I ran into some tricky ones, but no code change at all.

waqasr6 commented 5 years ago

Your alphabet file has two blank lines at the top, not sure this is expected nor what it might produce ?

i've also checked by removing blank lines.

Care to share full command lines of your:

  • util/check_characters.py use
  • training

Sure i'll share it once reach home as now i am traveling out station

waqasr6 commented 5 years ago

@lissyx @reuben thanks for helping. I've resolved that error. Problem was in my alphabet file, i am using Linux terminal on Windows and editing alphabet file in Windows text editor, when i checked it in Linux terminal its shows characters arrangement entirety different. Its very strange to me but at least its done. Now i am getting error in decoding, hopefully will resolve it also.

lissyx commented 5 years ago

Oh, that makes sense :-). Maybe we should check line endings? @reuben

reuben commented 5 years ago

Or at least mention line endings in the messages printed by check_characters.py? Sounds like a good idea.

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.