Missing big.traineddata

seeebek / EliteOCR

OCR tool for market screenshots in Elite: Dangerous

Other

67 stars 23 forks source link

Missing big.traineddata #25

Closed beldougie closed 9 years ago

beldougie commented 9 years ago

Hi @seeebek, do you have a file on your system within <tesseract>/share/tessdata named big.traineddata? I am getting errors that it doesn't exist (which it doesn't, I only have english and osd). Just wondering if it was created by windows or I need to obtain it from somewhere?

Cheers

demonbane commented 9 years ago

You can get around this by running it with TESSDATA_PREFIX=./ python EliteOCR.py. I haven't been able to figure out how to fix it in code yet unfortunately. But this does get the OCR working and it's smooth sailing from there.

seeebek commented 9 years ago

Normally tesseract should look for big.traineddata in the path of EliteOCR. It might be that the mac version is hardcoded just to the preset locations. It would be usefull to find out where it's looking and why it ignores the preset of the app(maybe there is some mac specific difference where just a small correction could solve this)

demonbane commented 9 years ago

I was able to track this down to the fact that Windows automatically includes ./ in its search paths, while OSX/Linux don't. That's why manually passing TESSDATA_PREFIX=./ worked. So I just added it to the startup process. It'll continue to work on Windows as before, but now it'll also work on OSX. (and possibly Linux as well, though that's still to be tested)

seeebek commented 9 years ago

that doesn't sound right yet. In ocrmethods.py there are those lines: api.Init(self.path.encode('windows-1252'), "big", tesseract.OEM_DEFAULT)

This is the setup of tesseract. The first argument is a string with the path to where "tessdata/big.traineddata" is. The path comes from settings.py (method: getPathToSelf).

If i was you I would test which string is available on every step of the chain (ocrmethods.py -> ocr.py -> settings.py). The problem might be as simple as problem with de/encoding or finding proper path by the mentioned method.

P.S. I really recommend not to set any system variables if you can avoid it.

demonbane commented 9 years ago

I did some more research on this, and it's a combination of two factors:

Tesseract 3.02 and earlier will use TESSDATA_PREFIX if it's available regardless of any other paths that are passed to the module
Some platforms (including Mac) will set TESSDATA_PREFIX to the parent directory of the module at build time, meaning the only way to change the path at runtime is to manually set TESSDATA_PREFIX.

So even though the API call in ocrmethods.py includes:

api.Init(self.path.encode('windows-1252'), "big", tesseract.OEM_DEFAULT)

If TESSDATA_PREFIX has been set anywhere, the first part of the argument will be ignored. This is why the only way to fix this currently is to manually set TESSDATA_PREFIX in EliteOCR. The good news is that this does not break the Windows compatibility in any way.

Once python-tesseract gets updated to tesseract 3.03, it should start respecting the path that's passed in at init time so at that time the change can be reverted if necessary.

seeebek commented 9 years ago

Ok, great to hear. End of this week I will try to include all your fixes into EliteOCR master branch.