sirfz / tesserocr

A Python wrapper for the tesseract-ocr API
MIT License
1.99k stars 255 forks source link

Building on Windows #19

Closed ghost closed 1 year ago

ghost commented 8 years ago

I have been working on making some bots for programs that only run in windows and I was wondering if you had any pointers on compiling on windows. I was actually able to build tesserocr.lib but I cannot get past that step.

I used https://github.com/peirick/VS2015_Tesseract to build libtesseract and used that to satisfy all of the imports.

Thank you.

sirfz commented 8 years ago

Windows support is something I've wanted for tesserocr but since I haven't used Windows for a few years now I didn't get to work on it.

I would've assumed that if you successfully compiled tesseract and leptonica then tesserocr should be able to compile and install normally. What failed exactly in tesserocr's setup?

ghost commented 8 years ago

I compiled tesseract and leptonica easily just because there was a visual studio solution for it. I am not sure how to create something similar for this project. In a very messy way I was able to run the visual studio command line compiler and include the headers but I am having trouble linking.

I try: C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\amd64\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -I. -IC:\Users\Sean\Desktop\tesseract\VS2015_Tesseract -IC:\Users\Sean\Desktop\tesseract\VS2015_Tesseract\leptonica\src -IC:\Users\Sean\Desktop\tesseract\VS2015_Tesseract\tesseract_3.04\vs2010\port -IC:\Users\Sean\Desktop\tesseract\VS2015_Tesseract\tesseract_3.04\ccutil -IC:\Users\Sean\Desktop\tesseract\VS2015_Tesseract\tesseract_3.04\ccstruct -IC:\Users\Sean\Desktop\tesseract\VS2015_Tesseract\tesseract_3.04\ccmain -IC:\Anaconda2\include -IC:\Anaconda2\PC /Tptesserocr.cpp /Fobuild\temp.win-amd64-2.7\Release\tesserocr.obj And that step fails

sirfz commented 8 years ago

The compile arguments on Linux look like this:

gcc parameters (outputs object file):

-I/usr/local/include -I/usr/include/python2.7 -c tesserocr.cpp -o build/temp.linux-x86_64-2.7/tesserocr.o

/usr/local/include contains the tesseract and leptonica folders which contain their respective header files, and /usr/include/python2.7 contains the Python header files (either Python 2 or 3).

c++ parameters (outputs binary file):

-L/usr/local/lib -llept -ltesseract -o tesserocr.so

/usr/local/lib contains liblept.so and libtesseract.so (DLLs in your case) while lept and tesseract are the names of the libraries.

Perhaps you can try translate these parameters in your command (or the VS solution) to try to achieve a successful build. Hope this helps.

jeweinberg commented 7 years ago

I am very interested in using this package for Windows as well. Do you have plans in the future for fixing the problem with the cpp compiling with VS?

jeweinberg commented 7 years ago
(C:\Users\jeweinberg\AppData\Local\Continuum\Anaconda3) C:\Users\jeweinberg>pip
install tesserocr
Collecting tesserocr
  Using cached tesserocr-2.1.3.tar.gz
Building wheels for collected packages: tesserocr
  Running setup.py bdist_wheel for tesserocr ... error
  Complete output from command C:\Users\jeweinberg\AppData\Local\Continuum\Anaco
nda3\python.exe -u -c "import setuptools, tokenize;__file__='C:\\Users\\JEWEIN~1
\\AppData\\Local\\Temp\\pip-build-0rwemyrs\\tesserocr\\setup.py';f=getattr(token
ize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(
compile(code, __file__, 'exec'))" bdist_wheel -d C:\Users\JEWEIN~1\AppData\Local
\Temp\tmpraznvglgpip-wheel- --python-tag cp35:
  running bdist_wheel
  running build
  running build_ext
  Supporting tesseract v3.02
  Building with configs: {'libraries': ['tesseract', 'lept'], 'cython_compile_ti
me_env': {'TESSERACT_VERSION': 770}}
  cythoning tesserocr.pyx to tesserocr.cpp
  building 'tesserocr' extension
  creating build
  creating build\temp.win-amd64-3.5
  creating build\temp.win-amd64-3.5\Release
  C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\cl.exe /c /nologo /
Ox /W3 /GL /DNDEBUG /MD -IC:\Users\jeweinberg\AppData\Local\Continuum\Anaconda3\
include -IC:\Users\jeweinberg\AppData\Local\Continuum\Anaconda3\include /EHsc /T
ptesserocr.cpp /Fobuild\temp.win-amd64-3.5\Release\tesserocr.obj
  tesserocr.cpp
  c:\users\jeweinberg\appdata\local\continuum\anaconda3\include\pyconfig.h(68):
fatal error C1083: Cannot open include file: 'io.h': No such file or directory
  error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\bin
\\cl.exe' failed with exit status 2

  ----------------------------------------
  Failed building wheel for tesserocr
  Running setup.py clean for tesserocr
Failed to build tesserocr
Installing collected packages: tesserocr
  Running setup.py install for tesserocr ... error
    Complete output from command C:\Users\jeweinberg\AppData\Local\Continuum\Ana
conda3\python.exe -u -c "import setuptools, tokenize;__file__='C:\\Users\\JEWEIN
~1\\AppData\\Local\\Temp\\pip-build-0rwemyrs\\tesserocr\\setup.py';f=getattr(tok
enize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exe
c(compile(code, __file__, 'exec'))" install --record C:\Users\JEWEIN~1\AppData\L
ocal\Temp\pip-zia9yqa6-record\install-record.txt --single-version-externally-man
aged --compile:
    running install
    running build
    running build_ext
    Supporting tesseract v3.02
    Building with configs: {'cython_compile_time_env': {'TESSERACT_VERSION': 770
}, 'libraries': ['tesseract', 'lept']}
    skipping 'tesserocr.cpp' Cython extension (up-to-date)
    building 'tesserocr' extension
    creating build
    creating build\temp.win-amd64-3.5
    creating build\temp.win-amd64-3.5\Release
    C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\cl.exe /c /nologo
 /Ox /W3 /GL /DNDEBUG /MD -IC:\Users\jeweinberg\AppData\Local\Continuum\Anaconda
3\include -IC:\Users\jeweinberg\AppData\Local\Continuum\Anaconda3\include /EHsc
/Tptesserocr.cpp /Fobuild\temp.win-amd64-3.5\Release\tesserocr.obj
    tesserocr.cpp
    c:\users\jeweinberg\appdata\local\continuum\anaconda3\include\pyconfig.h(68)
: fatal error C1083: Cannot open include file: 'io.h': No such file or directory

    error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\b
in\\cl.exe' failed with exit status 2

    ----------------------------------------
Command "C:\Users\jeweinberg\AppData\Local\Continuum\Anaconda3\python.exe -u -c
"import setuptools, tokenize;__file__='C:\\Users\\JEWEIN~1\\AppData\\Local\\Temp
\\pip-build-0rwemyrs\\tesserocr\\setup.py';f=getattr(tokenize, 'open', open)(__f
ile__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__
, 'exec'))" install --record C:\Users\JEWEIN~1\AppData\Local\Temp\pip-zia9yqa6-r
ecord\install-record.txt --single-version-externally-managed --compile" failed w
ith error code 1 in C:\Users\JEWEIN~1\AppData\Local\Temp\pip-build-0rwemyrs\tess
erocr\

(C:\Users\jeweinberg\AppData\Local\Continuum\Anaconda3) C:\Users\jeweinberg>
sirfz commented 7 years ago

@jeweinberg I personally don't use Windows and not planning to any time soon so, as much as I'd like to, I don't have the time to work on it.

I'm hoping someone who really needs it can do it and share their knowledge here (and of course merge whatever necessary changes, if any). I've shared what I know in this post based on my Linux experience with the setup.

Looking at the error you posted, it seems you're missing the Python header files (fatal error C1083: Cannot open include file: 'io.h': No such file or directory), you need to include those as well as the headers of leptonica and tesseract.

dgvogol commented 7 years ago

I managed to have a windows build in a very messy way, here is what I did:

  1. Following the instructions here to build the tesseract library, using cmake and visual studio 2015. Note the cppan or cmake sometimes hangs for some reason, so you probably need try it several times, see here

  2. Manually copy out the header files from tesseract and leptonica, and group them into the following structure: include/tesseract/ include/leptonica/

  3. The tesseract305.exp and tesseract305.lib is in tesseract/build/Release, you can copy them out or point the build to this place

  4. patch the setup.py package_config() method to manually return the right compiling/linking options, like below:

    if sys.platform == "win32":
    def package_config():
        config = {}
        config['library_dirs'] = ['C:\\Sandbox\\tesseract\\build\\Release',
                                  'C:\\Users\\david\\.cppan\\storage\\lib\\04a83184\\Release']
        config['include_dirs'] = ['C:\\Sandbox\\tesserocr\\include']
        config['libraries'] = ['tesseract305',
                               'pvt.cppan.demo.leptonica-master',
                               "pvt.cppan.demo.gif-5.1.4",
                               "pvt.cppan.demo.jpeg-9.2.0",
                               "pvt.cppan.demo.openjpeg.openjp2-2.1.2",
                               "pvt.cppan.demo.png-1.6.26",
                               "pvt.cppan.demo.tiff-4.0.6",
                               "pvt.cppan.demo.zlib-1.2.8",
                               "pvt.cppan.demo.webp-0.5.1"]
        config['cython_compile_time_env'] = {'TESSERACT_VERSION': version_to_int("3.05.00")}
        return config
    else:
    def package_config():
        ......
  5. The tesseract/vs2010/port/gettimeofday.h will cause compile failure due to some kind of name conflict , but I don't know how to fix it decently. I did an ugly patch to rename the "struct timezone" to "struct mytimezone". Need to figure out a way to fix it. But the ugly patch will work for now

  6. I'm also failing to include the "tesseract305.dll" into the final wheel. Not an expert of python distutils. Need to find the solution.

  7. Attached my wheel file and the associated DLL tesserocr-2.1.3-cp35-cp35m-win_amd64.zip tesseract305.dll.zip

sirfz commented 7 years ago

Great effort @dgvogol thanks for sharing! Regarding the inclusion of tesseract305.dll have you added the file in the MANIFEST.in file? It should automatically be included in the wheel if you add it to the manifest afaik. Would be cool if we can release Windows wheels for Python 2 and 3.

Jim-Salmons commented 7 years ago

UPDATE: SUCCESS!!!! In frustration I went looking for "Plan B" and thought to start at the Tesseract-ocr GitHub (https://github.com/tesseract-ocr). Looking at the repositories there, I had an "A-ha!" moment when I saw the Tessdata repo and remembered that my initial wheel/dll installation did not report any available languages. So I downloaded that repo and keep poking around. Assuming that Tesserocr still might not work, I decided to "start from the ground up" and just get the Tesseract-ocr engine installed. So I found the reference to the Windows binary, Tesseract 3.05-dev available at UB Mannheim (https://github.com/UB-Mannheim/tesseract/wiki). All I did next was run the binary installer and place the 'eng' Tessdata in the directory where Tesserocr was looking. I then ran my "ppg2leaf_ferret" app and the threads started spewing (some) print page numbers for the leafs it was loading! I obviously have some tweaking to do, but AWESOME progress nonetheless!

Sorry that this note is so TL;DR, but I am hoping that Fayez or David G can help me...

I am a 66-year old, post-cancer Bonus Round independent Citizen Scientist doing applied research at the intersection of #DigitalHumanities and #CognitiveComputing. I don't know who David G is (@dgvogol - awesome work, thank you) but his contribution of a Windows-based wheel for Tesserocr is exactly what I need for a project I am working on to support eResearch and machine-learning at the Internet Archive. Unfortunately, the wheel and dll supplied here come "this close" to working for me. Perhaps Fayez (@sirfz) or David can provide some insights...

The wxPython/PIL app I have developed is to support the discovery and curation of "ppg2leaf" metadata -- that is, print-page numbers to "leafs" being the ItemID for image files of pages of documents (computer magazines are my primary focus) at the Archive. This physical page number to ItemID is the most foundational metadata for traceability between physical documents and their digitial manifestations. I currently have two papers submitted to #DATeCH2017 related to this work. (http://ddays.digitisation.eu/datech-2017/)

I have a nice multi-threaded app going that queues the high-resolution page images and, based on LEFT/RIGHT "handside", also creates a zoomed image of the corner of the page where the page number is exepected to be found. A few simple keystrokes and this critical metadata is created when missing (the all-too-frequent standard) or validated (when the Archive's regional scanning centers do the digitization and human scanners "spot"/assert page numbers as they work).

Of course it would be totally awesome -- and somehow I have to figure out how to get this working -- if I could OCR these zoomed page corner views to see if we can identify and pre-populate the printed-page number into its respective text-widget, if found.

I am on Windows 10 x64, with a nice Anaconda-based enviroment that I use for my various FactMiners research projects. I was able to install David's wheel and dll and was hoping for the best but have hit a no-go wall.

Trying to "go for the gold" as soon as I had the wheel installed, I added the import and these two "see if it is there" lines at the top of my file:

import tesserocr

print(tesserocr.tesseract_version())  # print tesseract-ocr version
print(tesserocr.get_languages())  # prints tessdata path and list of available languages

which produces these message lines as my app cranks up:

tesseract 3.05.00dev
 leptonica-1.74 (Dec 15 2016, 08:36:24) [MSC v.1900 LIB Release x64]
  libgif 5.1.4 : libjpeg 9b : libpng 1.6.26 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.5.1 : libopenjp2 2.1.2

('C:\\Dev\\Anaconda\\envs\\fmtk\\tessdata/', [])

One thing that looks a bit hinky is that the available languages shows an empty list...

At any rate, in the worker thread where I download into memory the jpeg images via PIL from the Archive and create the scaled down full-page image for the main panel and crop the full-resolution image down to the appropriate corner where the page number may be found, I added this line to see if I could OCR the zoomed corner image:

print("Zoom image OCR: " + tesserocr.image_to_text(cornerzoom_img))

Unfortunately, when I run this I get the following error for each of the threads:

  File "[snip path]/ia_ppg2leaf_ferret_app.py", line 87, in run
    print("Zoom image OCR: " + tesserocr.image_to_text(cornerzoom_img))
  File "tesserocr.pyx", line 2288, in tesserocr.image_to_text (tesserocr.cpp:20796)
        RuntimeError: Failed recognize picture

Also unfortunately, since this is a compiled wheel/dll, (using PyCharm) I cannot interactively debug what is going on other than to go look at the referenced line in the source.

Also note, that when I try to run the unit test, tests fail on the api.Init calls to a PyTessBaseAPI instance.

Like David (but surely moreso -- I spent the bulk of my career in Programming Nirvana as a Smalltalk developer), I know enough to be dangerous to myself in terms of Python and compiling C++ for Windows, etc. But I have great patience and persistence, so if there is anything folks can do to help me figure out how to get this working, that would be INCREDIBLE.

Finally, I believe that Tesserocr is a brilliant design (being threading/PIL friendly) and that its value and use will only increase as the domains of Digital Humanities and Cognitive Computing converge. And I believe David G's contribution of a "ready-to-rumble" wheel for Windows is a fantastic and much-needed contribution to this project.

Thank you both and any others for doing this valuable work.

Happy-Healthy Vibes, Jim

Jim-Salmons commented 7 years ago

A quick short follow-up... Tesserocr seems to be working pretty well to identify print page numbers in my zoomed corner images. The accuracy seems to go up considerably once we get to double-digit page numbers. Single digit page numbers seem to be almost invisible.

I saw a mention of "digits only" mode for Tesseract, maybe that will help if anyone has thoughts or pointers on this, including how to take advantage of this via Tesserocr. Also, I have not done any binarization or other filtering/tweaking to the retrieved leaf images from the Archive. Perhaps that would help, too.

Any suggestions about best practices for PIL Image prep would be greatly appreciated. Getting better recognition for single digits would be a big help, too.

ITMT, I am digging into various Tesseract docs and cross-referencing the Tesserocr source to do some configuration setting via SetPageSegMode and SetVariable. Will update here when I determine the most effective setup.

UPDATE QUESTION: Where/how to set the config for Tesserocr instances in a worker thread?

I know my questions/insights are naive based on my jumping into "doing" without adequate prep, but here goes... It looks like something close to the following would configure Tesserocr to do its best job on print page numbers (some of this seemingly redundant, but WTH if it works):

baseApi = PyTessBaseAPI() baseApi.Init() baseApi.SetPageSegMode(PSM.SINGLE_LINE) baseApi.SetVariable("tessedit_char_blacklist", ".,!?@#$%&*()<>_-+=/:;'\"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz") baseApi.SetVariable("tessedit_char_whitelist", "0123456789") baseApi.SetVariable("classify_bln_numeric_mode", "1")

Since my initial code simply calls the image_to_text helper method in the worker threads, I am not sure when/how to config the API for each thread.

sirfz commented 7 years ago

Hello Jim, glad you're using tesserocr and finding it useful :)

Regarding your initial issue, you can always specify the tessdata directory of your choice by passing it to the get_languages function or PyTessBaseAPI init (and pretty much every other helper function). Example:

tesserocr.get_languages('C:\\Dev\\Anaconda\\envs\\fmtk\\tessdata/')
api = tesserocr.PyTessBaseAPI(path='C:\\Dev\\Anaconda\\envs\\fmtk\\tessdata/')

As for your updated question, you can create a config file and load it using the ReadConfigFile method:

with tesserocr.PyTessBaseAPI(psm=tesserocr.PSM.SINGLE_LINE) as api:
    api.ReadConfigFile('C:\\path\\to\\config')
    # use api as usual...

You can check examples of config files here. Hope this helps.

update: you can use api.ReadConfigFile('digits') which reads the digits file from tessdata-path/configs/digits.

Extra tip: No need to call api.Init after initializing the api object since it calls it automatically with the given parameters passed while initializing the api instance. Only call Init if you wish to re-initialize with different parameters.

Since these issues are no longer related to "Building on Windows", please open a new issue for future questions or if you wish to further discus your updated question.

Jim-Salmons commented 7 years ago

Hi Fayez!

Thank you for the quick and helpful reply. I'll be using the load config file tip next. At the moment, I am having greatest success with a PSM setting of RAW_LINE on the call to the helper method, and pass that result through a regex to grab just the digits. I then evaluate the this OCR page number assertion digit string within the context of its nearest neighbors in the page-image queue.

In the #DATeCH2017 paper I mentioned above, my wife and I looked at about 1.4M "leaf" images of computer magazines at the Internet Archive and found only 26 instances of print-page numbers to leaf IDs in the issues' associated _scandata.xml metadata files. So getting FactMiners' ppg2leaf_ferret as "bot-able" as possible is a goal along with the user interface being a "smart assistant" tool for doing ground-truth metadata discovery and curation.

I cannot tell you how much of a "power boost" your framework and DavidG's wheel/dll gave me for our current project.

Thank you for doing Tesserocr, and a special shout-out to the elusive DavidG (@dgvogol) who did the heavy lifting of creating the Python wheel of this super framework.

Happy-Healthy Vibes, Jim

dgvogol commented 7 years ago

Finally get some time to work on the tesserocr window build. I've a patch done in here: https://github.com/dgvogol/tesserocr/tree/windows_build

The solution is not ideal: The tesseract executable and library will be built before the tesserocr build, using the "cppan" method suggested by the tesseract-ocr team. The problem is that I cannot know the tesseract-ocr DLL details before it is actually built, which I need to setup the "package_data" to include these DLL files into the binary distribution. The current work around is to build the "tesseract" before the distutils setup(...) is called. The side effect is that the "tesseract" will always be built even with the "python setup.py --help" command. Need a better solution.

With that being said, the windows build should work for python version 3.5+. You will need the cppan and cmake executable ready.

Update: Implemented a better way to include windows DLLs into the binary distribution.

sirfz commented 7 years ago

Hey @dgvogol, that's great work! Could you please integrate your changes into the tesseract4-branch's setup.py script? I believe you'll need to just add the windows-specific code inside the make_extension function, let me know if you need any help with that.

Jim-Salmons commented 7 years ago

Hello again @sirfz and @dgvogol ! :-)

I am in the process of trying to upgrade my Anaconda environment for my FactMiners/SoftalkApple projects as the leading edge of Python moves to the 3.6+ level. Of course this bites me on the Tesserocr Windows work-around of using David's binary wheel and DLL from above.

It appears that David is making interesting progress on the issues with the Windows build. Unfortunately, I am not knowledgeable enough nor do I have a Windows build pipeline to try this updated approach. Also, it appears that the supported Python versions do not include 3.6 yet.

ITMT, it looks like I will roll back my Anaconda environments to 3.5.3 to keep using David's wheel.

I truly believe that Tesserocr will be widely useful to folks with Windows machines who are not able to build it as hoped. The binary wheel with pip installation for Anaconda data science installs works okay. SO PLEASE, could you Good Folks (most likely David) compile a Windows wheel when Tesserocr is updated to Python 3.6+? I will gladly document my installation and testing process here and wherever it makes sense to help spread the word about what a great feature-set Tesserocr provides.

Thanks again to you both for your past and on-going efforts to evolve Tesserocr, especially for Windows users.

BTW, if there is an idiot-proof cookbook for doing a Windows build, I am happy, too, to take the time and effort to shake down the process as Tesserocr moves into 3.6+ territory.

Happy-Healthy Vibes, Jim

The-Gupta commented 7 years ago

@sirfz @dgvogol @Jim-Salmons Do we have a solution yet?

Also, maybe this is not appropriate here, but I'm asking. I want to extract texts from an image and convert into editable text while retaining text orientation and font attributes on Windows with Python/Anaconda. I'm using pytesseract (imageToString) with a little image preprocessing to get most of the texts correctly. I'm getting bounding box coordinates using hOCR. I googled about font attributes and found tesserOCR but I could not set it up. Could you please guide me? (can email me @ vishal.gupta@nitdelhi.ac.in)

thiagofmam commented 7 years ago

@dgvogol I make all steps to generate to Windows platform but i got this error:

LINK : warning LNK4098: defaultlib 'LIBCMT' conflicts with use of other libs; use /NODEFAULTLIB:library tesserocr.obj : error LNK2001: unresolved external symbol "public: thiscall tesseract::TessPDFRenderer::TessPDFRenderer(char const ,char const )" (??0TessPDFRenderer@tesseract@@QAE@PBD0@Z) tesserocr.obj : error LNK2001: unresolved external symbol "int cdecl gettimeofday(struct timeval ,struct mytimezone )" (?gettimeofday@@YAHPAUtimeval@@PAUmytimezone@@@Z) build\lib.win32-3.6\tesserocr.cp36-win32.pyd : fatal error LNK1120: 2 unresolved externals

Can you help me?

simonflueckiger commented 6 years ago

Thanks to the groundwork of @dgvogol, I was able to build tesserocr 2.2.2 for windows:

To install them, unzip and run pip install <package_name>.whl or use conda install -c simonflueckiger tesserocr

Anaconda-Server Badge

The timezone naming conflict is a real pain in the a**. In addition to patching the gettimeofday.h as it's done in the setup.py file from @dgvogol I also had to patch the gettimeofday.cpp, otherwise I would get the same linker error as @thiagofmam.

For some reason the Python 2.7 builds make the python.exe crash with a heap access violation in the api call GetAvailableLanguagesAsVector, which renders the wheel files for 2.7 useless... not sure if this is an issue on my side though (I built for 2.7 on 2 different machines both builds result in the same runtime error). All the other builds pass all of the 18 unit tests.

The libraries that are included in the .whl files:

NozomiIto commented 6 years ago

@simonflueckiger Thanks for your great work! Has the source code used for the whl already been merged to the tesserocr master? If not merged yet, could you merge it? I strongly want to use tesserocr on Windows with Python2.7 and tesseract4, so I'd like to try Windows build on my PC and try to investigate the Python2.7 whl build failure.

jeweinberg commented 6 years ago

I was able to build the whl on windows 8 with python 3.5. The api is working like a charm! Thanks alot

On Oct 11, 2017 11:55 AM, "simonflueckiger" notifications@github.com wrote:

Thanks to the groundwork of @dgvogol https://github.com/dgvogol, I was able to build tesserocr 2.2.2 for windows:

To install them, unzip and run pip install .whl

The timezone naming conflict is a real pain in the a**. In addition to patching the gettimeofday.h as it's done in the setup.py file from @dgvogol https://github.com/dgvogol I also had to patch the gettimeofday.cpp, otherwise I would get the same linker error as @thiagofmam https://github.com/thiagofmam.

For some reason the Python 2.7 builds make the python.exe crash with a heap access violation in the api call GetAvailableLanguagesAsVector https://github.com/sirfz/tesserocr/blob/c8464d13a0b3f5f33f40b274d4b8a0299b0c2841/tesserocr.pyx#L1407, which renders the wheel files for 2.7 useless... not sure if this is an issue on my side though (I built for 2.7 on 2 different machines both builds result in the same runtime error). All the other builds pass all of the 18 unit tests.

The libraries that are included in the .whl files:

  • libtesseract 3.5.1
  • leptonica 1.74.4
  • jpeg 9.2.0
  • png 1.6.30
  • tiff 4.0.8
  • webp 0.6.0
  • zlib 1.2.11
  • openjp2 2.1.2
  • lzma 5.2.3

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sirfz/tesserocr/issues/19#issuecomment-335875716, or mute the thread https://github.com/notifications/unsubscribe-auth/AWQexgQtgR_jDfb2zIVqhnMfPCS3EpEkks5srPLlgaJpZM4Jrke7 .

rleonard11 commented 6 years ago

@simonflueckiger Awesome. I conda installed the windows tesser ocr . However while testing I receive the following error 'Failed to inti API, possible an invalid tessdata path: C:\ProgramData\Anaconda3\' . Is this a path issue or am I missing something?

update: Tried changing path to where tessdata was located and received the same error.

simonflueckiger commented 6 years ago

@NozomiIto have a look at https://github.com/simonflueckiger/tesserocr-windows_build and https://ci.appveyor.com/project/simonflueckiger/tesserocr-windows-build. I'm really glad that you want to help me out with this issue! Would you still be interested to look into this together in a couple of weeks, when I have some spare time again?

@jeweinberg glad to hear it's working 😀

@rleonard11 can you try to point it to the parent directory of where your tessdata folder is located? E.g. with PyTessBaseAPI(path='C:/Program Files (x86)/Tesseract-OCR/') as api: ... when your tessdata is located in 'C:/Program Files (x86)/Tesseract-OCR/tessdata'

If this doesn't work, could you please list the contents of your tessdata folder?

rleonard11 commented 6 years ago

@simonflueckiger Tried pointing it to the folder with the following code:

with PyTessBaseAPI(psm = PSM.AUTO_OSD, path = 'C:\Program Files (x86)\Tesseract-OCR\tessdata') as api:

Here is the contents of my tessdata folder:

configs (folder) tessconfigs (folder) eng.cub.bigrams eng.cube.fold eng.cube.lm eng.cube.nn eng.cube.params eng.cube.size eng.cube.word-freq eng.tesseract_cube.nn eng.traineddata eng.user-patterns eng.user-words osd.traineddata pdf.ttf

simonflueckiger commented 6 years ago

@rleonard11 can you give it a try without the tessdata? 'C:\Program Files (x86)\Tesseract-OCR\tessdata' -> 'C:\Program Files (x86)\Tesseract-OCR'

if this still doesn't work, lets talk on [tlk.io url expired]

NozomiIto commented 6 years ago

@simonflueckiger Thanks. I've already found your source code by myself and compiled it. I could compile with Python3.5 and tesseract4 by using pvt.cppan.demo.google.tesseract.tesseract-master and -DUSE_STD_NAMESPACE by teaking your code a little. I also tried Python2 compile, but I couldn't do it due to the compilation error. I think Python2 cython compilation requires Visual C++ compiler for python(virtually VC++ 2008), but tesseract4 requires C++11 library. So I gave up using Python2-tesseract4 combination. I'm still not sure if Python2-tesseract3 combination works or not..

yissachar commented 6 years ago

@simonflueckiger I am having the same problem as above:

Failed to init API, possibly an invalid tessdata path: C:\mypath\tessdata

Tried point above tessdata path but that fails as well.

mcs07 commented 6 years ago

I have found that using the path parameter to PyTessBaseAPI never works for me on Windows. I either have to use the TESSDATA_PREFIX environment variable or make sure the tessdata directory is in the default 'compiled in' location expected by tesseract.

I managed to put together conda recipes for leptonica, tesseract, and tesserocr that support windows/mac/linux. You can install using:

conda install -c mcs07 tesserocr

This works well for me on Windows, even if you don't set TESSDATA_PREFIX, but with one minor issue: The API (i.e. tesserocr) expects to find tessdata at the root of the conda environment, i.e. C:\Anaconda\envs\myenv\tessdata - so the recipe installs it there. However, when running tesseract on the command line, it expects to find tessdata alongside the executable in C:\Anaconda\envs\myenv\Library\bin - so (for now) the recipe duplicates tessdata to this location also.

yissachar commented 6 years ago

Thanks, setting TESSDATA_PREFIX worked for me.

sirfz commented 6 years ago

The path parameter isn't working with tesseract v4 on Linux either (also need to set TESSDATA_PREFIX environment variable).

Jim-Salmons commented 6 years ago

With too many great folks contributing to this important issue, I will just chime in that I am back from a post #DATeCH2017 hiatus and busy Summer. Now that I am getting back into the groove, I am rebuilding my Windows 10 dev box and will report my experience with the latest developments detailed above. Once I have a clear sense of what/how to recommend, I look forward to promoting this progress to my friends in the #digitalhumanities, especially folks in the #TDM (text- and data-mining) community.

BTW, as an unaffiliated #CitizenScientist, Tesserocr was instrumental to the metadata discovery and curation "ferret" scripts which were featured in my two papers accepted for a combined poster at #DATeCH2017. I have attached an image of the poster.

Thanks, again, to all for the attention and effort that has gone into this issue.

salmonsbabitsky_factminerssoftalk_poster

hsmallbone commented 6 years ago

Have been struggling to get this to compile for tesseract 4, even with @NozomiIto and @simonflueckiger's code/hints. Is it possible to get a wheel for tesseract 4?

NozomiIto commented 6 years ago

@hsmallbone It is very complicated work to compile with tesseract4, but it is possible. I put the compiled whl for Python3.5 and Tesseract4 at https://github.com/NozomiIto/tesserocr_whl. You can try it if you wish.

hsmallbone commented 6 years ago

@NozomiIto thanks for your wheel, but it's not for my platform unfortunately. I guess I will wait for a more reliable setup.py.

dgvogol commented 6 years ago

Finally have some time to work on the windows build. Thank @simonflueckiger for the gettimeofday.cpp patch. Details about the build:

  1. Create a dummy cppan project, and generate the build files using command "cppan --generate ."
  2. Search in the dummy project generated build files for libtesseract source code location
  3. patch the getimeofday.h and gettimeofday.cpp to work around the "timezone" name conflict
  4. Build the tesseract.exe using cppan command "cppan --build-packages pvt.cppan.demo.google.tesseract.tesseract-"
  5. Build the dummy project so I can find required DLLs at known location
  6. Collect header files from cppan source storage
  7. Build the tesserocr cython module, then put everything together

There still can be many improvements, but I'm happy with the overall build procedure. The build at least for Python3.6

A pull request is sent.

ricardomga commented 6 years ago

@simonflueckiger can you please update the wheels to tesseract 4.0. It would be very helpful. Thank you a lot for the contribution.

simonflueckiger commented 6 years ago

@ricardomga I will look into it and keep you posted.

ricardomga commented 6 years ago

@simonflueckiger thank you a lot!

simonflueckiger commented 6 years ago

@hsmallbone @ricardomga et voilà, freshly baked tesseract 4 wheels are here! Download them while they're still hot :cookie: And please let me know if they are working as intended!

from simonflueckiger/tesserocr-windows_build/releases:

To install them with pip

pip install <package_name>.whl

or install them directly from my Anaconda repository

conda install -c simonflueckiger/label/tesseract-4.0.0-master tesserocr

Contains the following binaries:

ricardomga commented 6 years ago

@simonflueckiger @sirfz Please add the windows .whl to Pypi. It would be really helpful, and I think it is simple to do it.

dickreuter commented 1 year ago

any solution for macos metal? Facing the same problem