tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.53k stars 9.43k forks source link

Tesseract 3.05 should reject incompatible traineddata from 4.00 #1226

Closed jbarlow83 closed 6 years ago

jbarlow83 commented 6 years ago

Environment

Current Behavior:

If supplied with Tesseract 4.x's .traineddata files, Tesseract 3.x will attempt to use them and fail with a variety of error messages. The error messages give no clue as to the problem or solution.

Some of them are short such as read_params_file: parameter not found:.

In other cases Tesseract will spam the terminal with what appears to be a line-by-line dump of the entire .traineddata file:

[truncated for sanity]
ParamsModel::Incomplete line ConvNL\x84
ParamsModel::Incomplete line ?\xeaBN\xb8H\x8f\xd1?sd\xaf\xfc\xc5?|5Lƾ\x95?LNH\xe7\xfe\xee忈!\x9e\xc9\xcb-\xb1\xbfwP߸\xce~\x9c?\x94H\xf4W;\xc6?\xe9\x89\xe1\xe9\xab\xc2?'\xd0=\x9b\x81V\xab?\xe6\x81\xdc\xeaq\xe5\xbd?;2\xeb,\xa3s\xd3?\xa9ᛦ\x95\x98\xd0?S\xd7
...

Given the alpha status of Tesseract 4.x it seems some people are manually downloading Tesseract 4 data files and installing them in the wrong places by hand, or trying Tess4 and reverting to Tess3.

Steps to Reproduce:

  1. Begin with a clean install of Tesseract 3.05.01

  2. Manually deu.traineddata with the Tesseract 4.00.xx version such as https://github.com/tesseract-ocr/tessdata_best/blob/master/deu.traineddata

  3. Run tesseract -l deu testing/phototest.tif _ pdf

  4. Output is

    read_params_file: parameter not found: 

Expected Behavior:

Tesseract 3.x should refuse to use 4.x .traineddata files with a clear error message that the .traineddata files are incompatible.

zdenop commented 6 years ago

This is known problem - there was always problem if you use data files from higher version in tesseract. At the moment of it is responsibility of packager or user to install correct version of data files. I do not think there will be fix for 3.x => Development is focused on 4.x (master branch).

amitdo commented 6 years ago

I assume 4.00 has the same issue with '--oem 0'.