tesseract-ocr / langdata

Source training data for Tesseract for lots of languages
Apache License 2.0
834 stars 888 forks source link

Do not merge: update based on tessdata_fast at 7274cfa #127

Closed Shreeshrii closed 5 years ago

Shreeshrii commented 6 years ago

I used the following script to generate updated langdata from 4.00.00alpha upload of tessdata_fast files.

The dawg files are unpacked using the lstm-unicharset for the wordlist, punc and numbers.

Also copied are config, version and unicharset files and a sorted list of unichars.

I have made an attempt to create plus.training_text which uses up to 20 words from a subset of the wordlist for each character in the unicharset and combines it with the existing training_text from 3.04, if available.

Shreeshrii commented 6 years ago

One problem with the generated $lang.plus.training_text is that it is biased in favor of words in the beginning of the wordlist. I will update those later in a separate commit.

The other problem is lack of punctuation in training_texts, specially for languages which do not have existing training_text in langdata.

Once @theraysmith / @jbreiden update langdata repo for 4.0.0, this won't matter. Meanwhile these files provide a sample for users.

Shreeshrii commented 6 years ago

I have regenerated the $lang.plus.training_text

Final version of script used:

#!/bin/bash
    fast_files=${fast_files}' '$(ls ./tessdata_fast/*.traineddata)
    for fast_file in ${fast_files}; do
        lang=$(basename "${fast_file##*/}" .traineddata)
        echo -e "\n ********************** " $lang " ****"
        combine_tessdata -u ${fast_file}  ./tessdata_fast/$lang.
        mkdir -p ./langdata_fast/$lang
        dawg2wordlist ./tessdata_fast/$lang.lstm-unicharset ./tessdata_fast/$lang.lstm-word-dawg ./langdata_fast/$lang/$lang.wordlist
        dawg2wordlist ./tessdata_fast/$lang.lstm-unicharset ./tessdata_fast/$lang.lstm-number-dawg ./langdata_fast/$lang/$lang.numbers
        dawg2wordlist ./tessdata_fast/$lang.lstm-unicharset ./tessdata_fast/$lang.lstm-punc-dawg ./langdata_fast/$lang/$lang.punc
        cp  ./tessdata_fast/$lang.lstm-unicharset ./langdata_fast/$lang/$lang.unicharset
        cp  ./tessdata_fast/$lang.config ./langdata_fast/$lang/$lang.config
        cp  ./tessdata_fast/$lang.version ./langdata_fast/$lang/$lang.version
##     shuf ./langdata_fast/$lang/$lang.wordlist > tmpwords.txt
        awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01 || FNR==1) print > "tmpwords.txt"}' ./langdata_fast/$lang/$lang.wordlist 
        sed -f langdata.sed ./langdata_fast/$lang/$lang.unicharset > tmp0.txt
        sort tmp0.txt > ./langdata_fast/$lang/$lang.unichars_sorted.txt
        sed 's/[]\.$^]/\\&/g'  tmp0.txt > tmp1.txt 
        sed '/*/d' tmp1.txt > tmp.txt
        cat tmp.txt | while read target; do grep -F -m 20 $target  tmpwords.txt; done > tmp2.txt 
        shuf  tmp2.txt > tmp3.txt
        cat ./langdata/$lang/$lang.training_text  tmp3.txt > tmp4.txt
        shuf tmp4.txt > tmp5.txt
        fmt -w 150 < tmp5.txt > ./langdata_fast/$lang/$lang.plus.training_text
       rm tmp*.*
    done    
stweil commented 6 years ago

The data from tessdata_fast has a lot of problems – at least for deu, like bugs in the wordlists and in the unicharset. What would we gain if we replace the current status with the new one?

Those people who are interested in the extracted data can extract them on demand from tessdata_fast or tessdata_best.

jbreiden commented 6 years ago

I am still working on the langdata thing.

Shreeshrii commented 6 years ago

@stweil Thanks for bringing the problem with tessdata_fast to attention.

If the wordlist have bugs, should the corresponding dawgs be replaced in the traineddata files in tessdata_fast with corrected ones? Would that improve recognition

Regarding unicharset, the new additions via updates to desired_characters are not reflected in these traineddata files which are from June 2017.