Closed kjanko closed 6 years ago
Can you share your modifications ?
@lissyx Vocab.txt > https://pastebin.com/v1BESQH4 Alphabet.txt > https://pastebin.com/YZYSDYDB (The same sentences from the CSV are inside the Vocab.txt, trie and language model created on this data)
Any news on this? I've triple checked our setup, but I have no idea what's going on. Characters 0441 and 0442 are both valid characters and are located inside the alphabet
Can you post the output of the following command? Replace the actual path of alphabet.txt
. Run it from the folder where DeepSpeech.py
is located.
python -c "from util.text import Alphabet; a = Alphabet('/path/to/alphabet.txt'); print('\n'.join([u'\'{}\' ({}) -> {}'.format(s, ':'.join('{:02x}'.format(ord(c)) for c in s), a._str_to_label[s]) for s in a._str_to_label]))"
@reuben
'ш' (448) -> 31 'ж' (436) -> 8 'ѕ' (455) -> 10 'ф' (444) -> 26 'у' (443) -> 25 'њ' (45a) -> 18 'о' (43e) -> 19 'р' (440) -> 21 'љ' (459) -> 15 'г' (433) -> 4 'х' (445) -> 27 'к' (43a) -> 13 'и' (438) -> 11 'т' (442) -> 23 'ѓ' (453) -> 6 'а' (430) -> 1 'ц' (446) -> 28 'ј' (458) -> 12 'з' (437) -> 9 'ќ' (45c) -> 24 'м' (43c) -> 16 'с' (441) -> 22 'л' (43b) -> 14 'ч' (447) -> 29 'б' (431) -> 2 ' ' (20) -> 0 'н' (43d) -> 17 'д' (434) -> 5 'в' (432) -> 3 'п' (43f) -> 20 'е' (435) -> 7 ''' (27) -> 32 'џ' (45f) -> 30
And you're definitely passing that same file in the --alphabet_config_path
parameter? No typos? If so, I don't know what's going on. The data structure matches the file, and has the relevant characters…
@reuben It's correct.
@kjanko Do you think you could share a small subset example of your training data (.csv
files, audio/text material, alphabet, ...) that exposes the issue? I could try to run that on my desktop at home and investigate more.
@lissyx I'll share all the data, give me a moment.
My data apparently contains missing characters from the alphabet due to strange utf-8 encoding issues.
@lissyx Hi, I am training urdu dataset.....but getting this error help me in this :
Traceback (most recent call last):
File "/home/hashim/Desktop/Hashim/UrduCorpus/DeepSpeech/util/text.py", line 33, in label_from_string
return self._str_to_label[string]
KeyError: ' '
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "DeepSpeech.py", line 941, in <module>
tf.app.run(main)
File "/home/hashim/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "DeepSpeech.py", line 893, in main
train()
File "DeepSpeech.py", line 388, in train
hdf5_cache_path=FLAGS.train_cached_features_path)
File "/home/hashim/Desktop/Hashim/UrduCorpus/DeepSpeech/util/preprocess.py", line 69, in preprocess
out_data = pmap(step_fn, source_data.iterrows())
File "/home/hashim/Desktop/Hashim/UrduCorpus/DeepSpeech/util/preprocess.py", line 13, in pmap
results = pool.map(fun, iterable)
File "/home/hashim/anaconda3/lib/python3.6/multiprocessing/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/home/hashim/anaconda3/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
File "/home/hashim/anaconda3/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/home/hashim/anaconda3/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/home/hashim/Desktop/Hashim/UrduCorpus/DeepSpeech/util/preprocess.py", line 23, in process_single_file
transcript = text_to_char_array(file.transcript, alphabet)
File "/home/hashim/Desktop/Hashim/UrduCorpus/DeepSpeech/util/text.py", line 64, in text_to_char_array
return np.asarray([alphabet.label_from_string(c) for c in original])
File "/home/hashim/Desktop/Hashim/UrduCorpus/DeepSpeech/util/text.py", line 64, in <listcomp>
return np.asarray([alphabet.label_from_string(c) for c in original])
File "/home/hashim/Desktop/Hashim/UrduCorpus/DeepSpeech/util/text.py", line 44, in label_from_string
).with_traceback(e.__traceback__)
File "/home/hashim/Desktop/Hashim/UrduCorpus/DeepSpeech/util/text.py", line 33, in label_from_string return self._str_to_label[string] KeyError: '\n ERROR: You have characters in your transcripts\n which do not occur in your data/alphabet.txt\n file. Please verify that your alphabet.txt\n contains all neccessary characters. Use\n util/check_characters.py to see what characters are in\n your train / dev / test transcripts.\n '
@imranm12 so? what's not clear in the error message?
@lissyx Hi dear, I am getting the same issue while training for my Urdu language model. I am sharing a small subset example of my training data so you can help me figure it out.
https://drive.google.com/open?id=1_875pTb1YVDcWVYfgkR8A9QnFk_qPv06
Hi dear, I am getting the same issue while training for my Urdu language model. I am sharing a small subset example of my training data so you can help me figure it out.
I'm asking again, what is not clear in the error message ?
Especially now that we have https://github.com/mozilla/DeepSpeech/blob/master/util/check_characters.py
@waqasr6 Check util/check_characters.py
?
@waqasr6 Check
util/check_characters.py
?
Yes i've checked earlier. Its giving me some strange results
['trans_urdu/cv-valid-train.csv']
['\x81', '\x82', '\x85', '\x84', '\x86', '\x88', '\x8c', '\x91', '\x92', ' ', '\xa2', '\xa7', '\xa9', '\xa8', '\xab', '\xaa', '\xad', '\xac', '\xaf', '\xae', '\xb1', '\xb0', '\xb3', '\xb2', '\xb5', '\xb4', '\xb7', '\xb6', '\xb9', '\xb8', '\xba', '\xbe', '\xd9', '\xd8', '\xdb', '\xda', 'a', 'c', 'i', 'n', 'p', 's', 'r', 't']
@lissyx
You can check my csv file as it is properly UTF-8 encoded ?
and while training its giving me this error:
Preprocessing ['trans_urdu/cv-valid-train.csv']
Traceback (most recent call last):
File "DeepSpeech.py", line 1959, in <module>
tf.app.run(main)
File "/home/waqas/.local/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 124, in run
_sys.exit(main(argv))
File "DeepSpeech.py", line 1915, in main
train()
File "DeepSpeech.py", line 1464, in train
hdf5_cache_path=FLAGS.train_cached_features_path)
File "/home/waqas/DeepSpeech/DeepSpeech/util/preprocess.py", line 68, in preprocess
out_data = pmap(step_fn, source_data.iterrows())
File "/home/waqas/DeepSpeech/DeepSpeech/util/preprocess.py", line 13, in pmap
results = pool.map(fun, iterable)
File "/usr/lib/python3.6/multiprocessing/pool.py", line 288, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib/python3.6/multiprocessing/pool.py", line 670, in get
raise self._value
File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/home/waqas/DeepSpeech/DeepSpeech/util/preprocess.py", line 22, in process_single_file
transcript = text_to_char_array(file.transcript, alphabet)
File "/home/waqas/DeepSpeech/DeepSpeech/util/text.py", line 41, in text_to_char_array
return np.asarray([alphabet.label_from_string(c) for c in original])
File "/home/waqas/DeepSpeech/DeepSpeech/util/text.py", line 41, in <listcomp>
return np.asarray([alphabet.label_from_string(c) for c in original])
File "/home/waqas/DeepSpeech/DeepSpeech/util/text.py", line 31, in label_from_string
return self._str_to_label[string]
KeyError: 'ک'
@lissyx You can check my csv file as it is properly UTF-8 encoded ?
Why ? You already have all the elements to fix it ...
@lissyx You can check my csv file as it is properly UTF-8 encoded ?
Why ? You already have all the elements to fix it ...
What i suppose to do to fix it ? As error shows its reading Urdu characters properly KeyError: 'ک' and all the csv transcript characters are available in my alphabet.txt file so i don't expect UTF-8 is the problem here.
@lissyx You can check my csv file as it is properly UTF-8 encoded ?
Why ? You already have all the elements to fix it ...
What i suppose to do to fix it ? As error shows its reading Urdu characters properly KeyError: 'ک' and all the csv transcript characters are available in my alphabet.txt file so i don't expect UTF-8 is the problem here.
Did you read the error and the output correctly ? It states that you have missing characters in your alphabet. And util/check_characters.py
just gave you the list ... Add this list to your alphabet, not much to do.
There's no UTF-8 issue here.
@lissyx I've added the above list got from util/check_characters.py
to my alphabet, still getting the same output.
Looks like util/check_characters.py
is broken for multi-byte characters. For example, 'ک' is \xDA\xA9
in UTF-8, but the output is suggesting you add \xDA
and \xA9
individually, which is incorrect.
Looks like it's a Python 2 specific problem, Python 2 strings are treated as byte sequences, while Python 3 strings are treated as Unicode codepoint sequences. (Are there multi-codepoint graphemes in any languages?)
Try running util/check_characters.py
with Python 3 on the same data.
Looks like it's a Python 2 specific problem, Python 2 strings are treated as byte sequences, while Python 3 strings are treated as Unicode codepoint sequences. (Are there multi-codepoint graphemes in any languages?)
Try running
util/check_characters.py
with Python 3 on the same data.
@reuben yes it was the issue wuth python 2. I tried with python 3 and below are results which are ok. but still getting error while training.
waqas@SID-245:~/DeepSpeech/DeepSpeech$ python3 util/check_characters.py trans_urdu/cv-valid-train.csv
['trans_urdu/cv-valid-train.csv']
['ق', 'r', 'ح', 'ے', 'ں', 'ک', 'گ', 'ص', 'غ', 's', 'ا', 'پ', 'ہ', 'ذ', 'i', 'ل', 'ھ', 'م', 'ز', 'ر', 'n', 'c', 'چ', 'ڑ', 'ع', 'خ', 'ٹ', 'ن', 'ث', 'ظ', 'س', 'p', 't', 'a', 'ڈ', 'ف', 'و', 'ض', 'ط', 'ش', 'ی', 'ج', 'د', 'ب', ' ', 'ت', 'آ']
I tried with python 3 and below are results which are ok. but still getting error while training.
Have you added this list of characters to your alphabet? Have you checked all your CSVs with this tool ?
['ق', 'r', 'ح', 'ے', 'ں', 'ک', 'گ', 'ص', 'غ', 's', 'ا', 'پ', 'ہ', 'ذ', 'i', 'ل', 'ھ', 'م', 'ز', 'ر', 'n', 'c', 'چ', 'ڑ', 'ع', 'خ', 'ٹ', 'ن', 'ث', 'ظ', 'س', 'p', 't', 'a', 'ڈ', 'ف', 'و', 'ض', 'ط', 'ش', 'ی', 'ج', 'د', 'ب', ' ', 'ت', 'آ']
I tried with python 3 and below are results which are ok. but still getting error while training.
Have you added this list of characters to your alphabet? Have you checked all your CSVs with this tool ?
['ق', 'r', 'ح', 'ے', 'ں', 'ک', 'گ', 'ص', 'غ', 's', 'ا', 'پ', 'ہ', 'ذ', 'i', 'ل', 'ھ', 'م', 'ز', 'ر', 'n', 'c', 'چ', 'ڑ', 'ع', 'خ', 'ٹ', 'ن', 'ث', 'ظ', 'س', 'p', 't', 'a', 'ڈ', 'ف', 'و', 'ض', 'ط', 'ش', 'ی', 'ج', 'د', 'ب', ' ', 'ت', 'آ']
Yes i've carefully added all characters in my alphabet file. and checked all my csv's also. here is my alphabet file looks like https://pastebin.com/jbWnMfxC
Okay, since there was some Python 2 playing that I missed (should not reply when I need to sleep), can you make sure you run everything under only Python 3 ?
Have you ran the character checking tool on all the CSV files you are passing ? Are you sure you are not passing wrong files ? (happened to me ...)
@lissyx Yes now i am running every thing under Python 3. Yes ran the character checking many times on all CSV files. and i am 100% sure about parsing correct files. I've also shared small subset example of my training data so you can believe me :)
https://drive.google.com/open?id=1_875pTb1YVDcWVYfgkR8A9QnFk_qPv06
@lissyx Yes now i am running every thing under Python 3. Yes ran the character checking many times on all CSV files. and i am 100% sure about parsing correct files. I've also shared small subset example of my training data so you can believe me :)
https://drive.google.com/open?id=1_875pTb1YVDcWVYfgkR8A9QnFk_qPv06
I'm sorry, but I really don't have any time to investigate your dataset, and Urdu is nothing more specific than other language, and we could train with a lot of languages using non ASCII chars, including Kabyle for example in my case :)
Your alphabet file has two blank lines at the top, not sure this is expected nor what it might produce ?
Your alphabet file has two blank lines at the top, not sure this is expected nor what it might produce ?
i've also checked by removing blank lines.
Your alphabet file has two blank lines at the top, not sure this is expected nor what it might produce ?
i've also checked by removing blank lines.
Care to share full command lines of your:
util/check_characters.py
use@lissyx Yes now i am running every thing under Python 3. Yes ran the character checking many times on all CSV files. and i am 100% sure about parsing correct files. I've also shared small subset example of my training data so you can believe me :) https://drive.google.com/open?id=1_875pTb1YVDcWVYfgkR8A9QnFk_qPv06
I'm sorry, but I really don't have any time to investigate your dataset, and Urdu is nothing more specific than other language, and we could train with a lot of languages using non ASCII chars, including Kabyle for example in my case :)
So do you need to change any thing in code like in util/text.py
for Kabyle language ?
No, I just had to ensure I had all characters in the alphabet, and I ran into some tricky ones, but no code change at all.
Your alphabet file has two blank lines at the top, not sure this is expected nor what it might produce ?
i've also checked by removing blank lines.
Care to share full command lines of your:
util/check_characters.py
use- training
Sure i'll share it once reach home as now i am traveling out station
@lissyx @reuben thanks for helping. I've resolved that error. Problem was in my alphabet file, i am using Linux terminal on Windows and editing alphabet file in Windows text editor, when i checked it in Linux terminal its shows characters arrangement entirety different. Its very strange to me but at least its done. Now i am getting error in decoding, hopefully will resolve it also.
Oh, that makes sense :-). Maybe we should check line endings? @reuben
Or at least mention line endings in the messages printed by check_characters.py
? Sounds like a good idea.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Traceback (most recent call last): File "/usr/local/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner self.run() File "/usr/local/anaconda2/lib/python2.7/threading.py", line 754, in run self.target(*self.args, **self.__kwargs) File "/home/kjanko/DeepSpeech/util/feeding.py", line 148, in _populate_batch_queue target = text_to_char_array(transcript, self._alphabet) File "/home/kjanko/DeepSpeech/util/text.py", line 40, in text_to_char_array return np.asarray([alphabet.label_from_string(c) for c in original]) File "/home/kjanko/DeepSpeech/util/text.py", line 30, in label_from_string return self._str_to_label[string] KeyError: u'\u0441'
Exception in thread Thread-8: Traceback (most recent call last): File "/usr/local/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner self.run() File "/usr/local/anaconda2/lib/python2.7/threading.py", line 754, in run self.target(*self.args, **self.__kwargs) File "/home/kjanko/DeepSpeech/util/feeding.py", line 148, in _populate_batch_queue target = text_to_char_array(transcript, self._alphabet) File "/home/kjanko/DeepSpeech/util/text.py", line 40, in text_to_char_array return np.asarray([alphabet.label_from_string(c) for c in original]) File "/home/kjanko/DeepSpeech/util/text.py", line 30, in label_from_string return self._str_to_label[string] KeyError: u'\u0442'
I've modified the code to use UTF-8 Macedonian alphabet. Did I do something wrong in the process because I'm receiving these exceptions when running DeepSpeech?