Closed Edresson closed 4 years ago
This is not a DeepSpeech bug, please use discourse for support-related questions.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Hi,
I am trying to train the DeepSpeech model for Brazilian Portuguese, In Brazilian Portuguese there are few datasets available (here a work which used 14 hours of speech). In 2017 @reuben did some experiments with the DeepSpeech model using the LapsBM Dataset (a small dataset from Portuguese) apparently unsuccessful could we update on this @reuben?
I was able to get a 109 hour dataset in Brazilian Portuguese and I am trying to train DeepSpeech in this dataset (the dataset is spontaneous speaking and was collected from sociolinguistic interviews and was completely manually transcribed by humans)
For creating LM and trie I followed the documentation recommendations: I created words.arpa with the following command (RawText.txt contains all the transcripts (but the wav file paths have been removed from this file):
./lmplz --text ../../datasets/ASR-Portuguese-Corpus-V1/RawText.txt --arpa /tmp/words.arpa --order 5 --temp_prefix /tmp/
I generated lm.binary:
kenlm/build/bin/build_binary -a 255 -q 8 trie lm.arpa lm.binary
I installed the native client:
python util/taskcluster.py --arch gpu --target native_client --branch v0.6.0
I created the file alphabet.txt with the following:
`# Each line in this file represents the Unicode codepoint (UTF-8 encoded)# associated with a numeric label.# A line that starts with # is a comment. You can escape it with # if you wish # to use '#' as a label.
a b c d e f g h i j k l m n o p q r s t u v w x y z ç ã à á â ê é í ó ô õ ú û`
After I generated the trie:
DeepSpeech/native_client/generate_trie ../datasets/ASR-Portuguese-Corpus-V1/alphabet.txt lm.binary trie
After I trained the model with the following command:
python DeepSpeech.py \ --train_files ../../datasets/ASR-Portuguese-Corpus-V1/metadata_train.csv \ --checkpoint_dir ../deepspeech_v6-0-0/checkpoints/ \ --test_files ../../datasets/ASR-Portuguese-Corpus-V1/metadata_test_200.csv \ --alphabet_config_path ../../datasets/ASR-Portuguese-Corpus-V1/alphabet.txt \ --lm_binary_path ../../datasets/deepspeech-data/lm.binary \ --lm_trie_path ../../datasets/deepspeech-data/trie \ --train_batch_size 2 \ --test_batch_size 2 \ --dev_batch_size 2 \ --export_batch_size 2 \ --epochs 200 \ --early_stop False \
Previously I trained the model with early_stop (specifying dev_files), however the model stopped training after 4 epochs, so I removed the early stop. Both the 50 and 4 epochs models have the same results. I run the test using the following command:
The result was:
The model often transcribes the letter "e", the use of this letter is very frequent in the dataset.
Am I doing something wrong?
How can I check if my lm.binarry and trie are correct?
Does anyone have any suggestions?
Best Regards,