tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
640 stars 190 forks source link

Tesseract prints characters differ from lstmeval #110

Open ghwn opened 5 years ago

ghwn commented 5 years ago

My system info:

Hi.

I am beginner and am trying to train some Korean character images for Korean recognition.

To understand how to train with Tesseract 4.0 LSTM, I have trained my data from scratch by following lines of Makefile in this Tesstrain step by step, and most of steps seemed to work fine until creating traineddata.

These steps are what I did until now. I manually followed the steps instead of running make:

  1. I made box files and unicharset by following this lines.

  2. I made lstmf files by following this lines.

  3. I made two split file lists for training and evaluation by following this lines.

  4. Before combining lang model, I downloaded radical-stroke.txt by following this line, and 3 langdata files (kor.punc, kor.numbers, and kor.wordlist) from this link.

    I didn't download kor.config file because it cause an error that chi_tra.traineddata is needed.

  5. I combined lang model by following this lines.

  6. Then I started LSTM training by following this lines.

  7. I tested them. The results are like:

    lim@ubuntu:~/tools/tesstrain$ usr/bin/lstmeval --traineddata data/kor/kor.traineddata --model data/kor/checkpoints/kor_checkpoint --eval_listfile data/kor/list.eval
    data/kor/checkpoints/kor_checkpoint is not a recognition model, trying training checkpoint...
    Loaded 1/1 lines (1-1) of document data/ground-truth/kor.malgun.exp249.lstmf
    Loaded 1/1 lines (1-1) of document data/ground-truth/kor.malgun.exp228.lstmf
    Truth:먹
    OCR  :이
    Truth:독
    OCR  :이
    Loaded 1/1 lines (1-1) of document data/ground-truth/kor.malgun.exp197.lstmf
    Loaded 1/1 lines (1-1) of document data/ground-truth/kor.malgun.exp41.lstmf
    Truth:파
    OCR  :이
    Truth:신
    OCR  :열
    ... (skip)
    At iteration 0, stage 0, Eval Char error rate=133.33333, Word error rate=96.875

There seems to be no problem with the results.

I know WER is abnormally high but I thought it doesn't matter because I just wanted to check whether the characters recognized by usr/bin/lstmeval are equal with the characters recognized by usr/bin/tesseract for a same image.

  1. I made traineddata output file.

    lim@ubuntu:~/tools/tesstrain$ usr/bin/lstmtraining --stop_training \
    --continue_from data/kor/checkpoints/kor_checkpoint \
    --traineddata data/kor/kor.traineddata \
    --model_output usr/share/tessdata/kor.traineddata
  2. Then I used tesseract with kor.malgun.exp197.tif. the TIF file was shown to '이' when I followed step 7 (testing with lstmeval). So I expected the same result. lim@ubuntu:~/tools/tesstrain$ usr/bin/tesseract data/ground-truth/kor.malgun.exp197.tif stdout -l kor --psm 6 > result

But the real result was totally mess. As I was concerned, the recognized characters differed from each other.

Why the characters recognized by lstmeval and tesseract are different? Is it normal?

Thank you...

Shreeshrii commented 5 years ago

the characters recognized by lstmeval and tesseract are different

I can confirm this with a test for Devanagari:

Loaded 1/1 lines (1-1) of document data/deva-lstmf/162.deva1.Sanskrit_2003,.exp0.lstmf
Loaded 1/1 lines (1-1) of document data/deva-lstmf/2214.deva1.Aksharyogini.exp0.lstmf
Truth:गूहितुं चित्रांश कुक्कुटनाडीयन्त्र असम्भवस्तु सतोऽनुपपत्तेः ॐ ॥२.३.९॥
OCR  :गृहितुं चित्रांश कुकुटनाडीयन्त्र असम्भवस्तु सतोऽनुपपत्तेः ॐ ॥२.३.९॥
Truth:। ददाशदस्मै दारिद्र्याद्ध्रियम् यकृत्कोपः महाराज धिष्ण्येमे सर्वा
OCR  :। ददाशदस्मै दारिद्र्याद्ध्रियम् यकृत्कोपः महाराज धिष्ण्येमे सर्वा
At iteration 0, stage 0, Eval Char error rate=2.8985507, Word error rate=14.285714

ubuntu@tesseract-ocr:~/tesstrain$ tesseract data/deva-boxtiff/162.deva1.Sanskrit_2003,.exp0.tif - -l devaLayer2.131 --tessdata-dir ./
Page 1
गूहितुं चित्रांश कुकुटनाडीयन्त्र असम्भवस्तु सतोऽनुपपत्तेः ॐ ॥२.३.९॥
ubuntu@tesseract-ocr:~/tesstrain$ tesseract data/deva-boxtiff/2214.deva1.Aksharyogini.exp0.tif - -l devaLayer2.131 --tessdata-dir ./
Page 1
। ददाशदस्मै दारिद्र्याद्ध्रियम् यकृत्कोपः महाराज धिष्ण्येमे सर्वा

lstmeval OCR:

OCR  :गृहितुं चित्रांश कुकुटनाडीयन्त्र असम्भवस्तु सतोऽनुपपत्तेः ॐ ॥२.३.९॥

tesseract OCR:

गूहितुं चित्रांश कुकुटनाडीयन्त्र असम्भवस्तु सतोऽनुपपत्तेः ॐ ॥२.३.९॥

Word which is different: गृहितुं vs गूहितुं

ghwn commented 5 years ago

Thank you!

Shreeshrii commented 5 years ago

@stweil Should this issue be kept open? Should lstmeval and tesseract give same output?

stweil commented 5 years ago

Yes, I think this needs more examination.

wrznr commented 5 years ago

@stweil I bet that this is a PSM issue again. @JihwanLim @Shreeshrii Could you rerun your tests on the line level setting PSM to 13?

ghwn commented 5 years ago

@wrznr I tried it with --psm 13 but it still gives same result. Only '^L' which is FORM FEED is shown in most cases.

wrznr commented 5 years ago

@JihwanLim Could you provide us with a snippet from your test data? I'd like to reproduce your results...

ghwn commented 5 years ago

@wrznr Sorry for making you wait.. I attached my Makefile. You can see new targets such as prepare, images, labels, fonts, and etc in the file but probably you don't need to care about them because they are just for generating new TIFF images. Makefile.zip

wrznr commented 5 years ago

@JihwanLim Many thanks. I'll have a look into your data set within the week and get back to you here.

ghwn commented 5 years ago

@wrznr Thank you!

wrznr commented 5 years ago

@JihwanLim What I meant with snippet from your test data was a small set of image-text pairs in order to examine the deviant behavior of lstmeval.

(Even if I download hangul.ttf I am missing the scripts get_data.py and most importantly hangul-image-generator.py to reproduce your training setup.)

ghwn commented 5 years ago

Okay I attached my project.

  1. Extract tesstrain.tar.gz
  2. Required packages are here.
  3. Extract fonts.tar.gz and place malgun.ttf into tesstrain/fonts.
  4. Build your Tesseract into tesstrain/usr.
  5. Start from make unicharset. If you want only images and gt.txt files, enter make prepare.

And let me know if something is missing at any time, thank you. tesstrain.tar.gz fonts.tar.gz

MoleImg commented 4 years ago

Have you solved this problem? Since I met the same problem. I have a quite different OCR output compared with I use the "tesseract" command, using the same model. Could you please resond to me? Thanks

wrznr commented 4 years ago

No. We have not. Thanks for pinging. I try to find some time for it next week!

MoleImg commented 4 years ago

No. We have not. Thanks for pinging. I try to find some time for it next week!

Thanks. I'm struggling with this problem for such a long time, but cannot find the reason/solution. Can you help me with this? Thank you so much

bilal-rachik commented 4 years ago

I think the problem is when you have to stop lstmtraining and convert to integer traineddata. you may have used traineddata generated by tesstrain.sh. you should better use traineddata best

lstmtraining \ --stop_training \ --convert_to_int \ --continue_from ../tesstutorial/impact_from_full/impact_checkpoint \ --traineddata tessdata/best/eng.traineddata \ --model_output ../tesstutorial/impact_from_full/eng_impact_int.traineddata

red-canoe commented 3 years ago

any updates on this? I have the same problem, lstmeval and tesseract with psm -13 and the same traineddata do not match.

dvrogozh commented 3 years ago

It seems I have the same issue on the training for couple of old russian glyphs (which makes it plus char training from russian). Actually this issue slaughters all the fun from the tesseract since I suspect that recognition with this bug fixed would be dramatically better. Can issue be prioritized, please?

By the way, are there any embedded debug support for the tesseract app which can be activated?

bertsky commented 3 years ago

By the way, are there any embedded debug support for the tesseract app which can be activated?

yes, you can: build with debugging enabled and then enable any of the debug parameters you can see in tesseract --print-parameters (the most important of which is debug_file – must be non-empty to see any debug messages).

bertsky commented 3 years ago

Why the characters recognized by lstmeval and tesseract are different? Is it normal?

Yes, it's not unlikely, since the latter is much more complex – e.g. because it contains image preprocessing, page segmentation, multi-model/lang and legacy engine code. The basic function is the same though:

I concur with @wrznr in surmising this is a PSM issue, but the OP already refuted that by trying PSM 13 to no effect.

@bilal-rachik brings in model finalization (esp. float→int conversion) which could play a role, esp. since the differences IIUC are rather small. Can anyone confirm this by trying without --convert_to_int?

It could also be related to thresholding or image normalization...

wrznr commented 3 years ago

@bilal-rachik @bertsky Is this really a tesstrain issue?

bertsky commented 3 years ago

Is this really a tesstrain issue?

You are right, this should probably be transferred to the tesseract repo.

jhartungBE commented 3 years ago

is there any update here? I'm having this issue where I'm using eng.traineddata and I'm getting accurate results for some test .png's using tesseract, but nonsense using lstmeval. This is messing up my training. I'm wondering if, like mentioned above, I have some configs set incorrectly

bertsky commented 3 years ago

@jhartungBE all we have at this point are suspicions (what to look for). Have you tried …

… yet?

jhartungBE commented 3 years ago

Thanks for the quick response. Yes I have tried the first two options, but not sure what you mean on the latter two. Here's a simple example that explains my issue. I have these two example image/text pairs in test-ground-truth. I can generate the box/lstmf files using "make lists MODEL_NAME=test PSM=7". I can then run "lstmeval --model eng_tessdata_best/eng.traineddata --eval_listfile data/test/all-lstmf" and I get

Loaded 1/1 lines (1-1) of document data/test-ground-truth/Message_20210917_163.lstmf
Loaded 1/1 lines (1-1) of document data/test-ground-truth/Message_20211012_281.lstmf
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Truth:any 19.25 legs?
OCR  :L
Truth:I got nothing better then 1
OCR  :BE te e  c
At iteration 0, stage 0, Eval Char error rate=94.444444, Word error rate=100

When I run in my tesseract repo, using, for example: tesseract Message_20211012_281.png test.txt --psm 7 I get perfect match of I got nothing better then 1

Message_20211012_281.gt.txt Message_20211012_281 Message_20210917_163.gt.txt Message_20210917_163

bertsky commented 3 years ago

@jhartungBE, like I said in my first comment, the Tesseract standalone CLI has much more than just the bare recognition of lstmeval – and that includes a check and compensation for inverse colours, like in your example.

So that's another issue (in fact, it's no issue IMO).

jhartungBE commented 3 years ago

Okay understood. Thank you. However, I'm failing to understand how I can train tesseract if this is the case and lstm training doesn't really apply to my images the same way that the tesseract engine will? Does this just mean I have to modify my images to pass to lstm training so that they are received the same way the LSTMRecognizer will receive them when I'm running tesseract?

bertsky commented 3 years ago

Yes, that's what it means. Just install ImageMagick and do a convert input.png -negate output.png

jhartungBE commented 3 years ago

Great, thanks. Appreciate your help