uhh-lt / kaldi-tuda-de

Scripts for training general-purpose large vocabulary German acoustic models for ASR with Kaldi.
Apache License 2.0
172 stars 36 forks source link

run.sh failing: text contains 103359 lines with non-printable characters #45

Closed prvit closed 4 years ago

prvit commented 4 years ago

I am trying to run ./run.sh script to build a model. I didn't make any changes to any of the scripts or data. On stage 8 during making mfcc features steps/make_mfcc.sh --cmd utils/run.pl --nj 28 data/swc_train exp/make_mfcc/swc_train mfcc is called (line 399). Then inside steps/make_mfcc.sh on line 76 utils/validate_data_dir.sh is called. And it exits with code 1 at line 131 because it found 103359 lines with-non printable characters: utils/validate_data_dir.sh: text contains 103359 lines with non-printable characters

How do I fix or avoid this non-printable characters to continue building a model? There is a boolean variable that is false by defaultnon_print=false but I don't know what it would influence on if change it to true.

prvit commented 4 years ago

@bmilde @milde

Hello, as I figured out that this error is related to German (DE) locale. I've installed german language commands using commands like this:

sudo apt-get install $(check-language-support)
sudo update-locale LANG=de_DE.UTF-8 

Also changed variables in ~/profile file and did everything I was able to found on the internet about german language for ubuntu.

Now when I run the following command to one of the files (data/dev/text, data/test/text, data/swc_train/text)

grep -c '[^[:print:][:space:]]' data/swc_train/text

the result is 0 image

If I try to find some word not regex, I am getting the correct result: image

So grep command doesn't find any non-printable characters when running manually,

But running ./run.sh script still fails with the same error utils/validate_data_dir.sh: text contains 103352 lines with non-printable characters image

Amount (103352) is different from what I posted in the initial issue because I manually changed some ö and ß symbols to manually check the problem.

Could you, please, tell me how to configure my environment correctly as you do, or do you have any suggestions about the issue?

prvit commented 4 years ago

Even more, I've created empty .sh file with the following

n_non_print=$(grep -c '[^[:print:][:space:]]' data/test/text) && \
echo "$0: text contains $n_non_print lines with non-printable characters" &&\
exit 1;

And running it does not finish with exit code 1. So looks like I've correctly installed german locale but ./run.sh script is still exiting with code 1. Any ideas how to fix?

milde commented 4 years ago

I am running the scripts successfully on a server that has this LANG conf:

declare -x LANG="C.UTF-8" declare -x LANGUAGE="en_US:en" declare -x LC_ADDRESS="C.UTF-8" declare -x LC_IDENTIFICATION="C.UTF-8" declare -x LC_MEASUREMENT="C.UTF-8" declare -x LC_MONETARY="C.UTF-8" declare -x LC_NAME="C.UTF-8" declare -x LC_NUMERIC="C.UTF-8" declare -x LC_PAPER="C.UTF-8" declare -x LC_TELEPHONE="C.UTF-8" declare -x LC_TIME="C.UTF-8"

so I don't think you necessarily need the de locales (doesn't hurt to install them though), as long as you have a UTF-8 locale. The python3 script needs this otherwise you can't properly read the utf8 texts. Note that you shouldn't run ./run.sh after you've sourced cmd.sh or path.sh manually because this sets LANG="C" (without utf-8 support, Kaldi needs this). Hope this helps!

Am Di., 11. Aug. 2020 um 12:40 Uhr schrieb prvit notifications@github.com:

Even more, I've created empty .sh file with the following

n_non_print=$(grep -c '[^[:print:][:space:]]' data/test/text) && \echo "$0: text contains $n_non_print lines with non-printable characters" &&\exit 1;

And running it does not finish with exit code 1. So looks like I've correctly installed german locale but ./run.sh script is still exiting with code 1. Any ideas how to fix?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/uhh-lt/kaldi-tuda-de/issues/45#issuecomment-671870842, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANJMD6SXMUL6QAYFIWX5TLSAEN3HANCNFSM4OZVOVDQ .

prvit commented 4 years ago

Closing this. Thank you @milde , your answer helped me a lot, there were few places where LC_ALL variable was set to C even besides path.sh file. I've changed them to C.UTF-8 and successfully passed that step.

Also, I have some general questions about a project, could you please take a look