Closed prvit closed 4 years ago
@bmilde @milde
Hello, as I figured out that this error is related to German (DE) locale. I've installed german language commands using commands like this:
sudo apt-get install $(check-language-support)
sudo update-locale LANG=de_DE.UTF-8
Also changed variables in ~/profile file and did everything I was able to found on the internet about german language for ubuntu.
Now when I run the following command to one of the files (data/dev/text, data/test/text, data/swc_train/text)
grep -c '[^[:print:][:space:]]' data/swc_train/text
the result is 0
If I try to find some word not regex, I am getting the correct result:
So grep command doesn't find any non-printable characters when running manually,
But running ./run.sh script still fails with the same error utils/validate_data_dir.sh: text contains 103352 lines with non-printable characters
Amount (103352) is different from what I posted in the initial issue because I manually changed some ö
and ß
symbols to manually check the problem.
Could you, please, tell me how to configure my environment correctly as you do, or do you have any suggestions about the issue?
Even more, I've created empty .sh file with the following
n_non_print=$(grep -c '[^[:print:][:space:]]' data/test/text) && \
echo "$0: text contains $n_non_print lines with non-printable characters" &&\
exit 1;
And running it does not finish with exit code 1. So looks like I've correctly installed german locale but ./run.sh script is still exiting with code 1. Any ideas how to fix?
I am running the scripts successfully on a server that has this LANG conf:
declare -x LANG="C.UTF-8" declare -x LANGUAGE="en_US:en" declare -x LC_ADDRESS="C.UTF-8" declare -x LC_IDENTIFICATION="C.UTF-8" declare -x LC_MEASUREMENT="C.UTF-8" declare -x LC_MONETARY="C.UTF-8" declare -x LC_NAME="C.UTF-8" declare -x LC_NUMERIC="C.UTF-8" declare -x LC_PAPER="C.UTF-8" declare -x LC_TELEPHONE="C.UTF-8" declare -x LC_TIME="C.UTF-8"
so I don't think you necessarily need the de locales (doesn't hurt to install them though), as long as you have a UTF-8 locale. The python3 script needs this otherwise you can't properly read the utf8 texts. Note that you shouldn't run ./run.sh after you've sourced cmd.sh or path.sh manually because this sets LANG="C" (without utf-8 support, Kaldi needs this). Hope this helps!
Am Di., 11. Aug. 2020 um 12:40 Uhr schrieb prvit notifications@github.com:
Even more, I've created empty .sh file with the following
n_non_print=$(grep -c '[^[:print:][:space:]]' data/test/text) && \echo "$0: text contains $n_non_print lines with non-printable characters" &&\exit 1;
And running it does not finish with exit code 1. So looks like I've correctly installed german locale but ./run.sh script is still exiting with code 1. Any ideas how to fix?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/uhh-lt/kaldi-tuda-de/issues/45#issuecomment-671870842, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANJMD6SXMUL6QAYFIWX5TLSAEN3HANCNFSM4OZVOVDQ .
Closing this.
Thank you @milde , your answer helped me a lot, there were few places where LC_ALL
variable was set to C
even besides path.sh
file. I've changed them to C.UTF-8
and successfully passed that step.
Also, I have some general questions about a project, could you please take a look
I am trying to run
./run.sh
script to build a model. I didn't make any changes to any of the scripts or data. On stage 8 during making mfcc featuressteps/make_mfcc.sh --cmd utils/run.pl --nj 28 data/swc_train exp/make_mfcc/swc_train mfcc
is called (line 399). Then insidesteps/make_mfcc.sh
on line 76utils/validate_data_dir.sh
is called. And it exits with code 1 at line 131 because it found 103359 lines with-non printable characters:utils/validate_data_dir.sh: text contains 103359 lines with non-printable characters
How do I fix or avoid this non-printable characters to continue building a model? There is a boolean variable that is false by default
non_print=false
but I don't know what it would influence on if change it to true.