Feature Request: Add support for 16-bit quantized LSTM models

lackner-codefy commented 1 month ago

Your Feature Request

For LSTM, there are currently fast 8-bit integer models, as well as the best models, probably using 32-bit floating point values.

While the fast models are indeed fast, they make a lot of errors in my specific use-case (with tesseract 5.3.0 and 5.4.1, mostly German language). I tested with the best models and they don't have this problem. However, they are also much slower, increasing the processing time considerably.

I'd like to have a better compromise between performance and accurracy. Something like a 16-bit integer model, which would (hopefully) still be pretty fast, but doesn't suffer from these random quality issues.

Would it be possible to implement support for 16-bit integer models? I'm aware that its not a trivial task since int_mode() is checked all over the place, and its also not trivial to write arch specific code to handle vector / matrix operations efficiently.

If its not within the scope of this project, what other tricks could be used to speed up the "best" model?

amitdo commented 1 month ago

While the fast models are indeed fast, they make a lot of errors in my specific use-case

The 'fast' models are not based on the 'best' models. They were trained with a smaller network and converted to int8.

There is an option to convert a 'best' model to an int8 model. This will give you a better accuracy compared to the 'fast' model.

amitdo commented 1 month ago

https://github.com/tesseract-ocr/tesseract/blob/main/doc/lstmtraining.1.asc

stweil commented 1 month ago

@lackner-codefy, did you also test with models from tessdata? Do they produce similar results as the "best" models?

And can you say more about your specific use case? For some use cases (especially for historic German prints) my models might be better than the official models: https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/.

stweil commented 1 month ago

probably using 32-bit floating point values

Tesseract 4 used double precision (64 bit) values. The current "best" models therefore still provide 64 bit values which are converted to float (32 bit) by Tesseract 5 (unless it was built to use double).

amitdo commented 1 month ago

About the tessdata repo stweil mentioned. The models there are a comination of two models: A model for the legacy pcr engine and a lstm model based on the 'best' model that was converted to int8.

With that model you can use the command line option --oem 1 which will tell tesseract to only use the lstm model.

lackner-codefy commented 1 month ago

@amitdo @stweil Thanks for all of your suggestions. Really appreciated! :pray:

I'll do some experiments to see if converting a best model to int8 gives some improvements.
I remember we were using tessdata before and that it performed worse. But for good measure, I'll verify that with another experiment.
Most of the documents in the collection are scans of printed documents. However, the quality of the scans can sometimes be quite poor. Some documents have a lot of JPEG artefacts and/or contain handwritten notes.
I'll check https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/. Since there seem to be multiple models, is there any in particular you had in mind? I'll make sure to use --oem ... to select the correct model.

tesseract-ocr / tesseract

Feature Request: Add support for 16-bit quantized LSTM models #4331

Your Feature Request