tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.45k stars 9.53k forks source link

user_words_suffix not working #1538

Closed devendrasr closed 6 years ago

devendrasr commented 6 years ago

We are trying to provide a user words file via available control params. Unfortunately I am getting below error -


Environment

Current Behavior:

Using below params to supply user words file -

tesseract --user_words_file /usr/share/tesseract-ocr/4.00/tessdata/eng.user-words    -psm=1 -l=eng source.ppm res11 txt 

I am getting error as -

read_params_file: parameter not found: P6

Is this supported in above tesseract version? I can see the support is mentioned in the help

user_words_file     A filename of user-provided words.
user_words_suffix            A suffix of user-provided words located in tessdata.
user_patterns_file            A filename of user-provided patterns.
user_patterns_suffix        A suffix of user-provided patterns located in tessdata.

Please note I have tried all possible options to supply the file - user_words | user_words_file | user_words_suffix | user_patterns_file | user_patterns_suffix

Please suggest the right way to achieve the same.

Thanks,

Shreeshrii commented 6 years ago
Usage:
  tesseract --help | --help-extra | --version
  tesseract --list-langs
  tesseract imagename outputbase [options...] [configfile...]

Try giving the parameters in the following order :

tesseract imagename outputbase [options...] [configfile...]

Use the latest beta version from GitHub rather than alpha.

Shreeshrii commented 6 years ago

Also see LSTM: User patterns do not work #403

devendrasr commented 6 years ago

@Shreeshrii Please find outputs for below command on my env - tesseract --help | --help-extra | --version

root@9cadf37d2e9c:/work/tess-words-research# tesseract --help | --help-extra | --version
bash: --help-extra: command not found
bash: --version: command not found

tesseract --list-langs

List of available languages (160):
por
ceb
chi_tra_vert
sun
Hangul
pan
hrv
srp
slv
HanT_vert
tha
yid
Bengali
hun
Tamil
ton
kaz
Lao
kat_old
fin
nep
fao
ita_old
mlt
enm
mar
khm
Kannada
hin
aze
Tibetan
cat
bod
hat
isl
bel
uzb
kir
Thaana
Myanmar
chr
Gujarati
bul
bre
kor_vert
pus
msa
Syriac
ell
Canadian_Aboriginal
asm
jpn_vert
kat
kan
dan
fil
Cherokee
spa
cym
cos
tel
pol
iku
Fraktur
Devanagari
kur_ara
frk
rus
Malayalam
yor
amh
tur
guj
Arabic
lav
sqi
gle
afr
osd
tat
jpn
Cyrillic
Japanese_vert
chi_sim
Japanese
mon
syr
glg
fra
ltz
Sinhala
nld
mya
hye
Thai
snd
ron
jav
ukr
ori
tgk
que
aze_cyrl
uig
spa_old
bos
ita
HanS
lat
chi_sim_vert
gla
san
Gurmukhi
Khmer
est
srp_latn
deu
nor
tir
chi_tra
Armenian
Georgian
epo
vie
Telugu
dzo
fas
tam
div
urd
eng
sin
Latin
lit
mkd
HanS_vert
uzb_cyrl
fry
ces
Hangul_vert
Ethiopic
heb
ara
ind
kor
mal
Hebrew
Vietnamese
mri
Oriya
swa
eus
lao
oci
ben
slk
frm
Greek
swe
HanT

@Shreeshrii I am able to get user-words working with below tesseract version -

tesseract 4.00.00dev-672-g7e4f5fa
 leptonica-1.74.4
  libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8

 Found AVX2
 Found AVX
 Found SSE

Can you please point me to get this version installation. It would be a great help!

Please note the current tesseract version I am facing problem with is -

tesseract -v
tesseract 4.00.00alpha
 leptonica-1.74.4
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.1) : libpng 1.6.34 : libtiff 4.0.8 : zlib 1.2.11 : libwebp 0.6.0 : libopenjp2 2.2.0

 Found AVX
 Found SSE
Shreeshrii commented 6 years ago

--version etc are part of the latest code on github, tesseract-4.0.0-beta . not there in older versions.

tesseract 4.00.00dev-672-g7e4f5fa

That is https://github.com/tesseract-ocr/tesseract/commit/7e4f5faa72244955f8dcb81b516ce11b7a12a959

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Apr 30, 2018 at 6:54 PM, Devendra Singh notifications@github.com wrote:

@Shreeshrii https://github.com/Shreeshrii Please find outputs for below command on my env - tesseract --help | --help-extra | --version

root@9cadf37d2e9c:/work/tess-words-research# tesseract --help | --help-extra | --version bash: --help-extra: command not found bash: --version: command not found

tesseract --list-langs

List of available languages (160): por ceb chi_tra_vert sun Hangul pan hrv srp slv HanT_vert tha yid Bengali hun Tamil ton kaz Lao kat_old fin nep fao ita_old mlt enm mar khm Kannada hin aze Tibetan cat bod hat isl bel uzb kir Thaana Myanmar chr Gujarati bul bre kor_vert pus msa Syriac ell Canadian_Aboriginal asm jpn_vert kat kan dan fil Cherokee spa cym cos tel pol iku Fraktur Devanagari kur_ara frk rus Malayalam yor amh tur guj Arabic lav sqi gle afr osd tat jpn Cyrillic Japanese_vert chi_sim Japanese mon syr glg fra ltz Sinhala nld mya hye Thai snd ron jav ukr ori tgk que aze_cyrl uig spa_old bos ita HanS lat chi_sim_vert gla san Gurmukhi Khmer est srp_latn deu nor tir chi_tra Armenian Georgian epo vie Telugu dzo fas tam div urd eng sin Latin lit mkd HanS_vert uzb_cyrl fry ces Hangul_vert Ethiopic heb ara ind kor mal Hebrew Vietnamese mri Oriya swa eus lao oci ben slk frm Greek swe HanT

@Shreeshrii https://github.com/Shreeshrii I am able to get user-words working with below tesseract version -

tesseract 4.00.00dev-672-g7e4f5fa leptonica-1.74.4 libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8

Found AVX2 Found AVX Found SSE

Can you please point me to get this version installation. It would be a great help!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1538#issuecomment-385396883, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o0JXn1PNyouSVwPQzi2KAmVZ0-tkks5ttxCOgaJpZM4Tsgrg .

amitdo commented 6 years ago

-psm=1 -l=eng

=>

--psm 1 -l eng

Shreeshrii commented 6 years ago

@devendrasr Please let us know the differences in user-words usage between

tesseract 4.00.00dev-672-g7e4f5fa

tesseract 4.00.00alpha

and

tesseract 4.0.0-beta-1 or current GitHub code

Is there any regression in how it works?

Shreeshrii commented 6 years ago

read_params_file: Can't open 6

Use--psm 6instead of -psm 6

devendrasr commented 6 years ago

thanks @Shreeshrii after modifying the command to

tesseract source.ppm with-user-words  -l eng -psm 1 --user-words=eng.user-words

the command ran successfully but the results were not as expected. Somehow same commit setup in another docker container is not producing expected results.

--user-words is working with tesseract version 3 only.

Can you please make it clear if tesseract 4 with/without lstm has got support for --user-words or not?

Other than this I have tried to use below steps too (in a separate install) to get user-words injected in traineddata but it does't seem working :(

convert user-words to dawg file - 
wordlist2dawg eng.user-words eng.word-dawg eng.unicharset
combine word list to eng.traineddata - 
combine_tessdata -o eng.traineddata eng.word-dawg
amitdo commented 6 years ago

You might want to try 4.0.0-beta from master.

devendrasr commented 6 years ago

It does't seem working to me with above branch. Somehow it works when we install it using mix and match instructions of below links - link1 and link2 It must be working because we are installing in DEBUG MODE with this commit:7e4f5fa

Shreeshrii commented 6 years ago

OK, I think I have figured out why it is NOT working.

See https://github.com/tesseract-ocr/tesseract/blob/2645f72c4a9fecbb19dfe1dc04a94baf723c61b9/src/dict/dict.cpp#L214-295

and

https://github.com/tesseract-ocr/tesseract/blob/2645f72c4a9fecbb19dfe1dc04a94baf723c61b9/src/dict/dict.cpp#L297-316

void Dict::Load(const STRING &lang, TessdataManager *data_file) is for legacy tesseract and needs to use lang.unicharset for converting wordlist to dawg

void Dict::LoadLSTM(const STRING &lang, TessdataManager *data_file) is for LSTM model and needs to use lang.lstm-unicharset for converting wordlist to dawg

Lines https://github.com/tesseract-ocr/tesseract/blob/2645f72c4a9fecbb19dfe1dc04a94baf723c61b9/src/dict/dict.cpp#L249-287 apply for user-words and user-patterns and should be loaded for both Dict::Load and Dict::LoadLSTM

@stweil Is this right?

Shreeshrii commented 6 years ago

Ok, I added the lines to Dict::LoadLSTM and it works, with the the following:

# tesseract test.png stdout --tessdata-dir ./tessdata --oem 0 -c page_separator=''
Dnline

# tesseract test.png stdout --tessdata-dir ./tessdata --oem 0 -c page_separator='' bazaar
Online

test

Dnline is corrected to Online

EDIT: While it works for this particular image, haven't got it to work with others yet.

I now need to find an image that does not work with tessdata_best and tessdata_fast in order to test further.

amitdo commented 6 years ago

Nice find!

tesseract test.png stdout --tessdata-dir ./tessdata --oem 0 bazaar

--oem 0 is supposed to use Dict::Load().

Shreeshrii commented 6 years ago

--oem 0 is supposed to use Dict::Load().

You are right. Haven't got it to work with --oem 1 yet.

devendrasr commented 6 years ago

@Shreeshrii Thanks for the help! It's seems working to me using oem 0 with latest commit, but I see only a few words are getting detected out of my list. Say I have 10 spell mistakes in my text but only few of them are getting corrected.

Do we have any document/link to refer for the expected behaviour?

Thanks,

amitdo commented 6 years ago

The user_words file is just a hint given to the OCR engine.

codingforpleasure commented 6 years ago

I have been working on: tesseract 4.0.0-beta.3-249-g607e leptonica-1.76.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 :

and I have searched the web for few hours regarding user's words. I have written below the steps I took: Step 1: I have added the eng.user-words file <path-to-dir>/tessdata on each line a single word Step 2: I have added <path-to-dir>/tessdata/configs/bazaar for disabling the default dictionary and to use user-words, file content:

load_system_dawg     F
load_freq_dawg       F
user_words_suffix    user-words

Step 3: ran the command (I'm not sure it's the correct syntax, I think i tried them all ):

tesseract example.png stdout  with-user-words 
                                       -l eng
                                       --oem 1
                                       --psm 6 
                                       --user-words <path-to-dir>tessdata/eng.user-words

But it seems to fail something is apparently wrong, @amitdo, @Shreeshrii any suggestion would be helpful, do you have any? :+1:

Shreeshrii commented 6 years ago

https://github.com/tesseract-ocr/tesseract/issues/1538#issuecomment-387703855

zdenop commented 6 years ago

closed as duplicate to issue 403

jtlz2 commented 5 years ago

--oem 0 is supposed to use Dict::Load().

You are right. Haven't got it to work with --oem 1 yet.

@Shreeshrii Is that still the case?