openpaperwork / pyocr

A Python wrapper for Tesseract and Cuneiform -- Moved to Gnome's Gitlab
https://gitlab.gnome.org/World/OpenPaperwork/pyocr
931 stars 152 forks source link

Using libtesseract on Windows #90

Open ghost opened 6 years ago

ghost commented 6 years ago

I tried to use libtesseract302.dll (from https://github.com/mnadeem/ocr-tess4j-example), but

AttributeError: function 'TessBaseAPIGetDatapath' not found

then I tried to use libtesseract400.dll (from https://github.com/nguyenq/tess4j which depends on https://github.com/nguyenq/lept4j)

but it seems that libtesseract400.dll not in libtesseract.tesseract_raw.libnames

by the way, ctypes.cdll.LoadLibrary will search dll from environment variable PATH on Windows at least

https://github.com/openpaperwork/pyocr/blob/ce23c2492739bef2b5313d257b1705e605d8ebcd/src/pyocr/libtesseract/tesseract_raw.py#L31

I think it's easy to fix, but why not pack with libtesseract, maybe this will make it easier to use

jflesch commented 6 years ago

I tried to use libtesseract302.dll (from https://github.com/mnadeem/ocr-tess4j-example)

1) Windows support for libtesseract is based on contributions. I personally don't use it (I use pyocr.tesseract for my project on Windows). So the list of .dll to try to load is probably not up-to-date at all. Please don't hesitate to tell me if you need some new ones to be added.

2) Tesseract 3.02 is known for not working well with Pyocr (on GNU/Linux anyway). Even if the binding did work, is_available() would have return false. You should try with Tesseract >= 3.0.4.

3) I don't know where those repositories come from, but they seem intended to be use with tess4j (Java) (are they patched specifically for tess4j ?). Anyway, I think you should use some more official/direct sources for your Tesseract installation: https://github.com/tesseract-ocr/tesseract/wiki/Downloads ; https://github.com/tesseract-ocr/tesseract/wiki/Data-Files

4) AFAIK, Tesseract 4 is still in alpha. Pyocr supports it on Linux, but I cannot guarantee yet a good support on Windows at all.

I think it's easy to fix, but why not pack with libtesseract, maybe this will make it easier to use

Because if we go this way, for consistency, I would have to package also Tesseract.exe, Cuneiform, and data language files of both Tesseract and Cuneiform.

ghost commented 6 years ago

Thank you very much for granting me so much of your valuable time.

I don't know where those repositories come from

I just too lazy to complie libtesseract by myself, and search from github...

I try to use (3rd party - @parrot-office) in https://github.com/tesseract-ocr/tesseract/wiki/Downloads for win32, but it should use with many pvt.cppan.demo.xxx.dll _(:з」∠)_ maybe I should try to complie...

Please don't hesitate to tell me if you need some new ones to be added.

these names maybe can be added:

libtesseract304.dll
libtesseract305.dll
libtesseract400.dll
libtesseract.dll
jflesch commented 6 years ago

these names maybe can be added:

Done: 2d6ead7e9e3031d7b2efa3ccfdb37ece291a9b66