Open ghwn opened 5 years ago
the characters recognized by lstmeval and tesseract are different
I can confirm this with a test for Devanagari:
Loaded 1/1 lines (1-1) of document data/deva-lstmf/162.deva1.Sanskrit_2003,.exp0.lstmf
Loaded 1/1 lines (1-1) of document data/deva-lstmf/2214.deva1.Aksharyogini.exp0.lstmf
Truth:गूहितुं चित्रांश कुक्कुटनाडीयन्त्र असम्भवस्तु सतोऽनुपपत्तेः ॐ ॥२.३.९॥
OCR :गृहितुं चित्रांश कुकुटनाडीयन्त्र असम्भवस्तु सतोऽनुपपत्तेः ॐ ॥२.३.९॥
Truth:। ददाशदस्मै दारिद्र्याद्ध्रियम् यकृत्कोपः महाराज धिष्ण्येमे सर्वा
OCR :। ददाशदस्मै दारिद्र्याद्ध्रियम् यकृत्कोपः महाराज धिष्ण्येमे सर्वा
At iteration 0, stage 0, Eval Char error rate=2.8985507, Word error rate=14.285714
ubuntu@tesseract-ocr:~/tesstrain$ tesseract data/deva-boxtiff/162.deva1.Sanskrit_2003,.exp0.tif - -l devaLayer2.131 --tessdata-dir ./
Page 1
गूहितुं चित्रांश कुकुटनाडीयन्त्र असम्भवस्तु सतोऽनुपपत्तेः ॐ ॥२.३.९॥
ubuntu@tesseract-ocr:~/tesstrain$ tesseract data/deva-boxtiff/2214.deva1.Aksharyogini.exp0.tif - -l devaLayer2.131 --tessdata-dir ./
Page 1
। ददाशदस्मै दारिद्र्याद्ध्रियम् यकृत्कोपः महाराज धिष्ण्येमे सर्वा
lstmeval OCR:
OCR :गृहितुं चित्रांश कुकुटनाडीयन्त्र असम्भवस्तु सतोऽनुपपत्तेः ॐ ॥२.३.९॥
tesseract OCR:
गूहितुं चित्रांश कुकुटनाडीयन्त्र असम्भवस्तु सतोऽनुपपत्तेः ॐ ॥२.३.९॥
Word which is different: गृहितुं vs गूहितुं
Thank you!
@stweil Should this issue be kept open? Should lstmeval and tesseract give same output?
Yes, I think this needs more examination.
@stweil I bet that this is a PSM issue again. @JihwanLim @Shreeshrii Could you rerun your tests on the line level setting PSM to 13?
@wrznr I tried it with --psm 13
but it still gives same result. Only '^L' which is FORM FEED is shown in most cases.
@JihwanLim Could you provide us with a snippet from your test data? I'd like to reproduce your results...
@wrznr Sorry for making you wait.. I attached my Makefile
. You can see new targets such as prepare
, images
, labels
, fonts
, and etc in the file but probably you don't need to care about them because they are just for generating new TIFF images.
Makefile.zip
@JihwanLim Many thanks. I'll have a look into your data set within the week and get back to you here.
@wrznr Thank you!
@JihwanLim What I meant with snippet from your test data was a small set of image-text pairs in order to examine the deviant behavior of lstmeval
.
(Even if I download hangul.ttf
I am missing the scripts get_data.py
and most importantly hangul-image-generator.py
to reproduce your training setup.)
Okay I attached my project.
tesstrain.tar.gz
fonts.tar.gz
and place malgun.ttf
into tesstrain/fonts
.tesstrain/usr
.make unicharset
.
If you want only images and gt.txt files, enter make prepare
.And let me know if something is missing at any time, thank you. tesstrain.tar.gz fonts.tar.gz
Have you solved this problem? Since I met the same problem. I have a quite different OCR output compared with I use the "tesseract" command, using the same model. Could you please resond to me? Thanks
No. We have not. Thanks for pinging. I try to find some time for it next week!
No. We have not. Thanks for pinging. I try to find some time for it next week!
Thanks. I'm struggling with this problem for such a long time, but cannot find the reason/solution. Can you help me with this? Thank you so much
I think the problem is when you have to stop lstmtraining and convert to integer traineddata. you may have used traineddata generated by tesstrain.sh. you should better use traineddata best
lstmtraining \ --stop_training \ --convert_to_int \ --continue_from ../tesstutorial/impact_from_full/impact_checkpoint \ --traineddata tessdata/best/eng.traineddata \ --model_output ../tesstutorial/impact_from_full/eng_impact_int.traineddata
any updates on this? I have the same problem, lstmeval and tesseract with psm -13 and the same traineddata do not match.
It seems I have the same issue on the training for couple of old russian glyphs (which makes it plus char training from russian). Actually this issue slaughters all the fun from the tesseract since I suspect that recognition with this bug fixed would be dramatically better. Can issue be prioritized, please?
By the way, are there any embedded debug support for the tesseract
app which can be activated?
By the way, are there any embedded debug support for the
tesseract
app which can be activated?
yes, you can: build with debugging enabled and then enable any of the debug parameters you can see in tesseract --print-parameters
(the most important of which is debug_file
– must be non-empty to see any debug messages).
Why the characters recognized by
lstmeval
andtesseract
are different? Is it normal?
Yes, it's not unlikely, since the latter is much more complex – e.g. because it contains image preprocessing, page segmentation, multi-model/lang and legacy engine code. The basic function is the same though:
lstmeval
→lstmtester
→lstmtrainer
→LSTMTrainer::PrepareForwardBackward
→LSTMRecognizer::RecognizeLine
+ LabelsFromOutputs
tesseractmain
→TessBaseAPI::ProcessPage
→TessBaseAPI::Recognize
→Tesseract::recog_all_words
→Tesseract::classify_word_and_language
→
Tesseract::classify_word_pass1
→Tesseract::LSTMRecognizeWord
→LSTMRecognizer::RecognizeLine
+ LabelsFromOutputs
I concur with @wrznr in surmising this is a PSM issue, but the OP already refuted that by trying PSM 13 to no effect.
@bilal-rachik brings in model finalization (esp. float→int conversion) which could play a role, esp. since the differences IIUC are rather small. Can anyone confirm this by trying without --convert_to_int
?
It could also be related to thresholding or image normalization...
@bilal-rachik @bertsky Is this really a tesstrain issue?
Is this really a tesstrain issue?
You are right, this should probably be transferred to the tesseract repo.
is there any update here? I'm having this issue where I'm using eng.traineddata and I'm getting accurate results for some test .png's using tesseract, but nonsense using lstmeval. This is messing up my training. I'm wondering if, like mentioned above, I have some configs set incorrectly
@jhartungBE all we have at this point are suspicions (what to look for). Have you tried …
PSM=13
/ --psm 13
tessdata_best
/ without --convert_to_int
--configfile <(echo thresholding_method 2)
/ -c thresholding_method=2
… yet?
Thanks for the quick response. Yes I have tried the first two options, but not sure what you mean on the latter two. Here's a simple example that explains my issue. I have these two example image/text pairs in test-ground-truth. I can generate the box/lstmf files using "make lists MODEL_NAME=test PSM=7". I can then run "lstmeval --model eng_tessdata_best/eng.traineddata --eval_listfile data/test/all-lstmf" and I get
Loaded 1/1 lines (1-1) of document data/test-ground-truth/Message_20210917_163.lstmf
Loaded 1/1 lines (1-1) of document data/test-ground-truth/Message_20211012_281.lstmf
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Truth:any 19.25 legs?
OCR :L
Truth:I got nothing better then 1
OCR :BE te e c
At iteration 0, stage 0, Eval Char error rate=94.444444, Word error rate=100
When I run in my tesseract repo, using, for example:
tesseract Message_20211012_281.png test.txt --psm 7
I get perfect match of I got nothing better then 1
@jhartungBE, like I said in my first comment, the Tesseract standalone CLI has much more than just the bare recognition of lstmeval – and that includes a check and compensation for inverse colours, like in your example.
So that's another issue (in fact, it's no issue IMO).
Okay understood. Thank you. However, I'm failing to understand how I can train tesseract if this is the case and lstm training doesn't really apply to my images the same way that the tesseract engine will? Does this just mean I have to modify my images to pass to lstm training so that they are received the same way the LSTMRecognizer will receive them when I'm running tesseract?
Yes, that's what it means. Just install ImageMagick and do a convert input.png -negate output.png
Great, thanks. Appreciate your help
My system info:
Hi.
I am beginner and am trying to train some Korean character images for Korean recognition.
To understand how to train with Tesseract 4.0 LSTM, I have trained my data from scratch by following lines of Makefile in this Tesstrain step by step, and most of steps seemed to work fine until creating traineddata.
These steps are what I did until now. I manually followed the steps instead of running
make
:I made box files and unicharset by following this lines.
I made lstmf files by following this lines.
I made two split file lists for training and evaluation by following this lines.
Before combining lang model, I downloaded radical-stroke.txt by following this line, and 3 langdata files (kor.punc, kor.numbers, and kor.wordlist) from this link.
I didn't download kor.config file because it cause an error that chi_tra.traineddata is needed.
I combined lang model by following this lines.
Then I started LSTM training by following this lines.
I tested them. The results are like:
There seems to be no problem with the results.
I know WER is abnormally high but I thought it doesn't matter because I just wanted to check whether the characters recognized by
usr/bin/lstmeval
are equal with the characters recognized byusr/bin/tesseract
for a same image.I made traineddata output file.
Then I used tesseract with kor.malgun.exp197.tif. the TIF file was shown to '이' when I followed step 7 (testing with lstmeval). So I expected the same result.
lim@ubuntu:~/tools/tesstrain$ usr/bin/tesseract data/ground-truth/kor.malgun.exp197.tif stdout -l kor --psm 6 > result
But the real result was totally mess. As I was concerned, the recognized characters differed from each other.
Why the characters recognized by
lstmeval
andtesseract
are different? Is it normal?Thank you...