Closed devendrasr closed 6 years ago
Usage:
tesseract --help | --help-extra | --version
tesseract --list-langs
tesseract imagename outputbase [options...] [configfile...]
Try giving the parameters in the following order :
tesseract imagename outputbase [options...] [configfile...]
Use the latest beta version from GitHub rather than alpha.
Also see LSTM: User patterns do not work #403
@Shreeshrii Please find outputs for below command on my env - tesseract --help | --help-extra | --version
root@9cadf37d2e9c:/work/tess-words-research# tesseract --help | --help-extra | --version
bash: --help-extra: command not found
bash: --version: command not found
tesseract --list-langs
List of available languages (160):
por
ceb
chi_tra_vert
sun
Hangul
pan
hrv
srp
slv
HanT_vert
tha
yid
Bengali
hun
Tamil
ton
kaz
Lao
kat_old
fin
nep
fao
ita_old
mlt
enm
mar
khm
Kannada
hin
aze
Tibetan
cat
bod
hat
isl
bel
uzb
kir
Thaana
Myanmar
chr
Gujarati
bul
bre
kor_vert
pus
msa
Syriac
ell
Canadian_Aboriginal
asm
jpn_vert
kat
kan
dan
fil
Cherokee
spa
cym
cos
tel
pol
iku
Fraktur
Devanagari
kur_ara
frk
rus
Malayalam
yor
amh
tur
guj
Arabic
lav
sqi
gle
afr
osd
tat
jpn
Cyrillic
Japanese_vert
chi_sim
Japanese
mon
syr
glg
fra
ltz
Sinhala
nld
mya
hye
Thai
snd
ron
jav
ukr
ori
tgk
que
aze_cyrl
uig
spa_old
bos
ita
HanS
lat
chi_sim_vert
gla
san
Gurmukhi
Khmer
est
srp_latn
deu
nor
tir
chi_tra
Armenian
Georgian
epo
vie
Telugu
dzo
fas
tam
div
urd
eng
sin
Latin
lit
mkd
HanS_vert
uzb_cyrl
fry
ces
Hangul_vert
Ethiopic
heb
ara
ind
kor
mal
Hebrew
Vietnamese
mri
Oriya
swa
eus
lao
oci
ben
slk
frm
Greek
swe
HanT
@Shreeshrii I am able to get user-words working with below tesseract version -
tesseract 4.00.00dev-672-g7e4f5fa
leptonica-1.74.4
libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8
Found AVX2
Found AVX
Found SSE
Can you please point me to get this version installation. It would be a great help!
Please note the current tesseract version I am facing problem with is -
tesseract -v
tesseract 4.00.00alpha
leptonica-1.74.4
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.1) : libpng 1.6.34 : libtiff 4.0.8 : zlib 1.2.11 : libwebp 0.6.0 : libopenjp2 2.2.0
Found AVX
Found SSE
--version etc are part of the latest code on github, tesseract-4.0.0-beta . not there in older versions.
tesseract 4.00.00dev-672-g7e4f5fa
That is https://github.com/tesseract-ocr/tesseract/commit/7e4f5faa72244955f8dcb81b516ce11b7a12a959
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Mon, Apr 30, 2018 at 6:54 PM, Devendra Singh notifications@github.com wrote:
@Shreeshrii https://github.com/Shreeshrii Please find outputs for below command on my env - tesseract --help | --help-extra | --version
root@9cadf37d2e9c:/work/tess-words-research# tesseract --help | --help-extra | --version bash: --help-extra: command not found bash: --version: command not found
tesseract --list-langs
List of available languages (160): por ceb chi_tra_vert sun Hangul pan hrv srp slv HanT_vert tha yid Bengali hun Tamil ton kaz Lao kat_old fin nep fao ita_old mlt enm mar khm Kannada hin aze Tibetan cat bod hat isl bel uzb kir Thaana Myanmar chr Gujarati bul bre kor_vert pus msa Syriac ell Canadian_Aboriginal asm jpn_vert kat kan dan fil Cherokee spa cym cos tel pol iku Fraktur Devanagari kur_ara frk rus Malayalam yor amh tur guj Arabic lav sqi gle afr osd tat jpn Cyrillic Japanese_vert chi_sim Japanese mon syr glg fra ltz Sinhala nld mya hye Thai snd ron jav ukr ori tgk que aze_cyrl uig spa_old bos ita HanS lat chi_sim_vert gla san Gurmukhi Khmer est srp_latn deu nor tir chi_tra Armenian Georgian epo vie Telugu dzo fas tam div urd eng sin Latin lit mkd HanS_vert uzb_cyrl fry ces Hangul_vert Ethiopic heb ara ind kor mal Hebrew Vietnamese mri Oriya swa eus lao oci ben slk frm Greek swe HanT
@Shreeshrii https://github.com/Shreeshrii I am able to get user-words working with below tesseract version -
tesseract 4.00.00dev-672-g7e4f5fa leptonica-1.74.4 libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8
Found AVX2 Found AVX Found SSE
Can you please point me to get this version installation. It would be a great help!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1538#issuecomment-385396883, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o0JXn1PNyouSVwPQzi2KAmVZ0-tkks5ttxCOgaJpZM4Tsgrg .
-psm=1 -l=eng
=>
--psm 1 -l eng
@devendrasr Please let us know the differences in user-words usage between
tesseract 4.00.00dev-672-g7e4f5fa
tesseract 4.00.00alpha
and
tesseract 4.0.0-beta-1 or current GitHub code
Is there any regression in how it works?
read_params_file: Can't open 6
Use--psm 6
instead of -psm 6
thanks @Shreeshrii after modifying the command to
tesseract source.ppm with-user-words -l eng -psm 1 --user-words=eng.user-words
the command ran successfully but the results were not as expected. Somehow same commit setup in another docker container is not producing expected results.
--user-words is working with tesseract version 3 only.
Can you please make it clear if tesseract 4 with/without lstm has got support for --user-words or not?
Other than this I have tried to use below steps too (in a separate install) to get user-words injected in traineddata but it does't seem working :(
convert user-words to dawg file -
wordlist2dawg eng.user-words eng.word-dawg eng.unicharset
combine word list to eng.traineddata -
combine_tessdata -o eng.traineddata eng.word-dawg
You might want to try 4.0.0-beta from master.
It does't seem working to me with above branch. Somehow it works when we install it using mix and match instructions of below links - link1 and link2 It must be working because we are installing in DEBUG MODE with this commit:7e4f5fa
OK, I think I have figured out why it is NOT working.
and
void Dict::Load(const STRING &lang, TessdataManager *data_file) is for legacy tesseract and needs to use lang.unicharset for converting wordlist to dawg
void Dict::LoadLSTM(const STRING &lang, TessdataManager *data_file) is for LSTM model and needs to use lang.lstm-unicharset for converting wordlist to dawg
Lines https://github.com/tesseract-ocr/tesseract/blob/2645f72c4a9fecbb19dfe1dc04a94baf723c61b9/src/dict/dict.cpp#L249-287 apply for user-words and user-patterns and should be loaded for both Dict::Load and Dict::LoadLSTM
@stweil Is this right?
Ok, I added the lines to Dict::LoadLSTM and it works, with the the following:
# tesseract test.png stdout --tessdata-dir ./tessdata --oem 0 -c page_separator=''
Dnline
# tesseract test.png stdout --tessdata-dir ./tessdata --oem 0 -c page_separator='' bazaar
Online
Dnline
is corrected to Online
EDIT: While it works for this particular image, haven't got it to work with others yet.
I now need to find an image that does not work with tessdata_best and tessdata_fast in order to test further.
Nice find!
tesseract test.png stdout --tessdata-dir ./tessdata --oem 0 bazaar
--oem 0
is supposed to use Dict::Load()
.
--oem 0 is supposed to use Dict::Load().
You are right. Haven't got it to work with --oem 1 yet.
@Shreeshrii Thanks for the help! It's seems working to me using oem 0 with latest commit, but I see only a few words are getting detected out of my list. Say I have 10 spell mistakes in my text but only few of them are getting corrected.
Do we have any document/link to refer for the expected behaviour?
Thanks,
The user_words file is just a hint given to the OCR engine.
I have been working on: tesseract 4.0.0-beta.3-249-g607e leptonica-1.76.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 :
and I have searched the web for few hours regarding user's words.
I have written below the steps I took:
Step 1: I have added the eng.user-words file <path-to-dir>/tessdata
on each line a single word
Step 2: I have added <path-to-dir>/tessdata/configs/bazaar
for disabling the default dictionary and to use user-words, file content:
load_system_dawg F
load_freq_dawg F
user_words_suffix user-words
Step 3: ran the command (I'm not sure it's the correct syntax, I think i tried them all ):
tesseract example.png stdout with-user-words
-l eng
--oem 1
--psm 6
--user-words <path-to-dir>tessdata/eng.user-words
But it seems to fail something is apparently wrong, @amitdo, @Shreeshrii any suggestion would be helpful, do you have any? :+1:
closed as duplicate to issue 403
--oem 0 is supposed to use Dict::Load().
You are right. Haven't got it to work with --oem 1 yet.
@Shreeshrii Is that still the case?
We are trying to provide a user words file via available control params. Unfortunately I am getting below error -
Environment
Current Behavior:
Using below params to supply user words file -
I am getting error as -
Is this supported in above tesseract version? I can see the support is mentioned in the help
Please note I have tried all possible options to supply the file - user_words | user_words_file | user_words_suffix | user_patterns_file | user_patterns_suffix
Please suggest the right way to achieve the same.
Thanks,