Closed Shreeshrii closed 5 years ago
One problem with the generated $lang.plus.training_text is that it is biased in favor of words in the beginning of the wordlist. I will update those later in a separate commit.
The other problem is lack of punctuation in training_texts, specially for languages which do not have existing training_text in langdata.
Once @theraysmith / @jbreiden update langdata repo for 4.0.0, this won't matter. Meanwhile these files provide a sample for users.
I have regenerated the $lang.plus.training_text
Final version of script used:
#!/bin/bash
fast_files=${fast_files}' '$(ls ./tessdata_fast/*.traineddata)
for fast_file in ${fast_files}; do
lang=$(basename "${fast_file##*/}" .traineddata)
echo -e "\n ********************** " $lang " ****"
combine_tessdata -u ${fast_file} ./tessdata_fast/$lang.
mkdir -p ./langdata_fast/$lang
dawg2wordlist ./tessdata_fast/$lang.lstm-unicharset ./tessdata_fast/$lang.lstm-word-dawg ./langdata_fast/$lang/$lang.wordlist
dawg2wordlist ./tessdata_fast/$lang.lstm-unicharset ./tessdata_fast/$lang.lstm-number-dawg ./langdata_fast/$lang/$lang.numbers
dawg2wordlist ./tessdata_fast/$lang.lstm-unicharset ./tessdata_fast/$lang.lstm-punc-dawg ./langdata_fast/$lang/$lang.punc
cp ./tessdata_fast/$lang.lstm-unicharset ./langdata_fast/$lang/$lang.unicharset
cp ./tessdata_fast/$lang.config ./langdata_fast/$lang/$lang.config
cp ./tessdata_fast/$lang.version ./langdata_fast/$lang/$lang.version
## shuf ./langdata_fast/$lang/$lang.wordlist > tmpwords.txt
awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01 || FNR==1) print > "tmpwords.txt"}' ./langdata_fast/$lang/$lang.wordlist
sed -f langdata.sed ./langdata_fast/$lang/$lang.unicharset > tmp0.txt
sort tmp0.txt > ./langdata_fast/$lang/$lang.unichars_sorted.txt
sed 's/[]\.$^]/\\&/g' tmp0.txt > tmp1.txt
sed '/*/d' tmp1.txt > tmp.txt
cat tmp.txt | while read target; do grep -F -m 20 $target tmpwords.txt; done > tmp2.txt
shuf tmp2.txt > tmp3.txt
cat ./langdata/$lang/$lang.training_text tmp3.txt > tmp4.txt
shuf tmp4.txt > tmp5.txt
fmt -w 150 < tmp5.txt > ./langdata_fast/$lang/$lang.plus.training_text
rm tmp*.*
done
The data from tessdata_fast has a lot of problems – at least for deu
, like bugs in the wordlists and in the unicharset. What would we gain if we replace the current status with the new one?
Those people who are interested in the extracted data can extract them on demand from tessdata_fast or tessdata_best.
I am still working on the langdata thing.
@stweil Thanks for bringing the problem with tessdata_fast to attention.
If the wordlist have bugs, should the corresponding dawgs be replaced in the traineddata files in tessdata_fast with corrected ones? Would that improve recognition
Regarding unicharset, the new additions via updates to desired_characters are not reflected in these traineddata files which are from June 2017.
I used the following script to generate updated langdata from 4.00.00alpha upload of tessdata_fast files.
The dawg files are unpacked using the lstm-unicharset for the wordlist, punc and numbers.
Also copied are config, version and unicharset files and a sorted list of unichars.
I have made an attempt to create plus.training_text which uses up to 20 words from a subset of the wordlist for each character in the unicharset and combines it with the existing training_text from 3.04, if available.