tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
637 stars 188 forks source link

Training parts lists #131

Closed L1800Turbo closed 4 years ago

L1800Turbo commented 4 years ago

Hello, I hope this is the right place for my question..

I've got huge lists of part assignments I plan to import into a database. These lists are placed on microfiches, so that I had to scan them in a microfiche scanner. Unfortunalety the best quality still isn't near perfect, so that I plan to train tesseract to reduce the error rate.

Using tesstrain, the results already look quite good, but I often get letters recognized as numbers, maybe because of the ratio between number and letter. To make sure I did it the right way, I wanted to list, what I've done and if I might have made mistakes.

1) "Clean" Part lists and convert into monochrome 000407

2) Cut lists to into one column each, add a border around each cut17

3) Produced training data by this script: https://github.com/tesseract-ocr/tesstrain/issues/7#issuecomment-419714852 790364130 2-J07 049 -> cut2-002 exp0

4) Correct the text files and create pairs with file.tif -> file.gt.txt

5) Start tesstrain with make training START_MODEL=eng MODEL_NAME=microfiche

The output training file I get improves the recognition already a lot, although tesseract barely recognizes the letters in the middle colum like the "J" in 2-J07 mentioned above.

I read about a valid letters list, although I couldn't find it so far. And I get a warning about no dictionary. I'm not shure if this really affects the recognition.

Is there any tuning possibility to get the letters recognized better, or do I need more data? I've got around 300 lines for training so far.

Thank you!

The samples are not in the highest resolution, I scanned the images with 600dpi.

Jertlok commented 4 years ago

I am not sure if you are going to find this answer useful, but I will try to reply to what I know so far.

tesseract barely recognizes the letters in the middle column

Have you tried using another page segmentation level?

I read about a valid letters list

If I understood correctly, as last resort you could set a character whitelist in order to include all the characters you may find in your images. The configuration variable is tessedit_char_whitelist, so you could set it in order to only include upper-case letter plus a few special characters.

And I get a warning about no dictionary.

The models you get in output from tesstrain are by default without dictionary, in order to add a dictionary you might want to check this useful comment from Shreeshii.

wrznr commented 4 years ago

Apart from what Jertlok wrote, do the metadata of your images contain information on the resolution (i.e. 600 dpi, check it e.g. with exiftool)? If not, you may want to set it manually using --dpi (undocumented Tesseract option).

Try setting PSM to 13.

L1800Turbo commented 4 years ago

Thank you for the answers! I usually used psm 6, as the default one only recognizes the first column.

I will check you about the char_whitelist this evening, although I don't get any characters apart from the ones I would whitelist anyway.

Did a check on the dpi by exiftool, got 600 dpi.

PSM 13 seems to make it a little worse, as an example: cut2-001 exp0 Creates with: PSM 6: 790364120 2-007 048 PSM 13: 790364120 2-307 0348

The numbers are mostly recognized well. Only the letters seem to cause problems.

Jertlok commented 4 years ago

I just tried to scan your image with a model I am currently training (yeah, that font is pretty similar to what I have in my various ground-truth images) and I get a perfect match.

tesseract .\img.png stdout -l micraPlus_5.837_4429_16100
Failed to load any lstm-specific dictionaries for lang micraPlus_5.837_4429_16100!!

790364120 2-J07 048

The model has been derived from ita.

What I can suggest is trying to improve your ground-truth images as the letters you've got over there are pretty ambiguos and not really good for training (IMHO).

Here's my model (integer and float), just in case you might find it useful for your training: micraPlus_model.zip

Also, please note that this test has been done with the latest version of tesseract (master).

Shreeshrii commented 4 years ago

You should consider uploading your models to tessdata_contrib.

On Mon, Dec 9, 2019, 18:50 Giuseppe Maggio notifications@github.com wrote:

I just tried to scan your image with a model I am currently training (yeah, that font is pretty similar to what I have in my various ground-truth images) and I get a perfect match.

`tesseract .\img.png stdout -l micraPlus_5.837_4429_16100 Failed to load any lstm-specific dictionaries for lang micraPlus_5.837_4429_16100!!

790364120 2-J07 048`

The model has been derived from ita https://github.com/tesseract-ocr/tessdata_best/blob/master/ita.traineddata .

What I can suggest is trying to improve your ground-truth images as the letters you've got over there are pretty ambiguos and not really good for training (IMHO).

Here's my model (integer and float), just in case you might find it useful for your training: micraPlus_model.zip https://github.com/tesseract-ocr/tesstrain/files/3939824/micraPlus_model.zip

Also, please note that this test has been done with the latest version of tesseract (master).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesstrain/issues/131?email_source=notifications&email_token=ABG37I346IVXDOASW672MX3QXZAZ3A5CNFSM4JYFRRTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGJERJQ#issuecomment-563234982, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37IZRHNV35HH35TCPZYTQXZAZ3ANCNFSM4JYFRRTA .

L1800Turbo commented 4 years ago

I just tried to scan your image with a model I am currently training (yeah, that font is pretty similar to what I have in my various ground-truth images) and I get a perfect match.

tesseract .\img.png stdout -l micraPlus_5.837_4429_16100
Failed to load any lstm-specific dictionaries for lang micraPlus_5.837_4429_16100!!

790364120 2-J07 048

The model has been derived from ita.

What I can suggest is trying to improve your ground-truth images as the letters you've got over there are pretty ambiguos and not really good for training (IMHO).

Here's my model (integer and float), just in case you might find it useful for your training: micraPlus_model.zip

Also, please note that this test has been done with the latest version of tesseract (master).

Your trainig file already did a much better recognition on the letters, although it makes more mistakes in recognizing the numbers. Maybe it's a statistical thing as I only have one letter between the numbers for training? Is it maybe even possible to give a pattern to tesseract, as I know in advance where I get numbers and not?

wrznr commented 4 years ago

Maybe it's a statistical thing ...

The optimal distribution of a training set in relation to the materials to be recognized is still an open question. A systematical evaluation on data like you have would be very, very helpful!

possible to give a pattern to tesseract

Maybe not directly. But if you have a model which performs better on certain parts of your input and you know in advance where those parts are located you may just apply the more appropriate model there, right?

Shreeshrii commented 4 years ago

You can try with user_paterns to see if it helps in your case. See https://github.com/tesseract-ocr/tesseract/pull/2328

On Tue, Dec 10, 2019 at 11:22 AM L1800Turbo notifications@github.com wrote:

I just tried to scan your image with a model I am currently training (yeah, that font is pretty similar to what I have in my various ground-truth images) and I get a perfect match.

tesseract .\img.png stdout -l micraPlus_5.837_4429_16100 Failed to load any lstm-specific dictionaries for lang micraPlus_5.837_4429_16100!!

790364120 2-J07 048

The model has been derived from ita https://github.com/tesseract-ocr/tessdata_best/blob/master/ita.traineddata .

What I can suggest is trying to improve your ground-truth images as the letters you've got over there are pretty ambiguos and not really good for training (IMHO).

Here's my model (integer and float), just in case you might find it useful for your training: micraPlus_model.zip https://github.com/tesseract-ocr/tesstrain/files/3939824/micraPlus_model.zip

Also, please note that this test has been done with the latest version of tesseract (master).

Your trainig file already did a much better recognition on the letters, although it makes more mistakes in recognizing the numbers. Maybe it's a statistical thing as I only have one letter between the numbers for training? Is it maybe even possible to give a pattern to tesseract, as I know in advance where I get numbers and not?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesstrain/issues/131?email_source=notifications&email_token=ABG37I6BWBCXQBO67VQVTXTQX4VBFA5CNFSM4JYFRRTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGOBB5A#issuecomment-563876084, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37IZWKL3Z54ILWQP2X4DQX4VBFANCNFSM4JYFRRTA .

--


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

L1800Turbo commented 4 years ago

Maybe not directly. But if you have a model which performs better on certain parts of your input and you know in advance where those parts are located you may just apply the more appropriate model there, right?

So I'd cut the picture into further parts (columns), to especially recognize the middle part, or is there a more intelligent way?

The optimal distribution of a training set in relation to the materials to be recognized is still an open question. A systematical evaluation on data like you have would be very, very helpful!

Currently I have a small Perl script to analyze the data afterwards with regular expressions and point out whenever a line doesn't match, so that I correct it manually. This would be a great feature, to tell tesseract about this in advance and to let it "look a second time" if the pattern doesn't match.

wrznr commented 4 years ago

This would be a great feature

Not very likely t happen. Sry.

is there a more intelligent way?

Use your Perl script? I.e. let tesseract “look a second time” with the other model if the pattern doesn't match and extract the text only for those parts.

bertsky commented 4 years ago

Maybe not directly. But if you have a model which performs better on certain parts of your input and you know in advance where those parts are located you may just apply the more appropriate model there, right?

So I'd cut the picture into further parts (columns), to especially recognize the middle part, or is there a more intelligent way?

Exactly. If you know what pattern of numbers and letters to expect for a certain segment of your document, and you use the Tesseract API anyway (or split up the page into segment images and use the CLI), then you can tell Tesseract what to look for with the user_patterns feature mentioned above. It's just a hint in the current implementation though, not exclusive. (It acts like a dictionary.)

L1800Turbo commented 4 years ago

Yes, letting tesseract look into the data with another model is a good idea. I will try that.

Also, I was trying to use the user_patterns as described in https://github.com/tesseract-ocr/tesseract/wiki/APIExample-user_patterns Unfortunalety I couldn't get this to work with v4.1.1-rc2-17-g6343

My command is tesseract schnitt4.tif schnitt4 --user-patterns ../../Microfiche.pattern -c lstm_use_matrix=1 -l Microfiche --psm 6 bazaar

Microfiche.pattern looks like this: \d\d\d\d\d\d\d\d\d \d-\A\d\d\d \d\d\d \d-\A\d\d\d \d\d\d

Setting the params makes no difference to the output. I did some research and also tried it with a config file and tesseract schnitt4.tif schnitt4 -l Microfiche --psm 6 bazaar. But no difference. Typing a wrong pattern file path on purpose gave me an error message, so this parameter seems to be analyzed in some point.

bertsky commented 4 years ago

My command is tesseract schnitt4.tif schnitt4 --user-patterns ../../Microfiche.pattern -c lstm_use_matrix=1 -l Microfiche --psm 6 bazaar

What is bazaar in here? (If you copied it from the recipe in the man-page, it's meant as the (file) name of a config file, but you don't need a config file on the command line, since you can use --user-patterns. In fact, that config file could easily override that setting by referencing other pattern files – though I'm not certain of this.)

Also, you don't need lstm_use_matrix=1, since it's the default. (I just updated the wiki to reflect this.)

Additional parameters you could try are -c load_system_dawg=F -c load_freq_dawg=F – this disables the built-in dictionaries (if your model even contains them).

Microfiche.pattern looks like this: \d\d\d\d\d\d\d\d\d \d-\A\d\d\d \d\d\d \d-\A\d\d\d \d\d\d

This looks good for --psm 6.

So if this does not make any difference at all, then I'm afraid there's not much more you can do currently at runtime. (You must understand that user patterns – like any dictionary/dawg in Tesseract – are not applied exclusively, but as a hint only. I know how to make them exclusive, but in the current state of affairs all this would get us are rejections – missing characters. I have tried combining this with deep beam alternatives, but not succeeded so far.)

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.