Closed ghost closed 1 year ago
Windows support is something I've wanted for tesserocr but since I haven't used Windows for a few years now I didn't get to work on it.
I would've assumed that if you successfully compiled tesseract and leptonica then tesserocr should be able to compile and install normally. What failed exactly in tesserocr's setup?
I compiled tesseract and leptonica easily just because there was a visual studio solution for it. I am not sure how to create something similar for this project. In a very messy way I was able to run the visual studio command line compiler and include the headers but I am having trouble linking.
I try:
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\amd64\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -I. -IC:\Users\Sean\Desktop\tesseract\VS2015_Tesseract -IC:\Users\Sean\Desktop\tesseract\VS2015_Tesseract\leptonica\src -IC:\Users\Sean\Desktop\tesseract\VS2015_Tesseract\tesseract_3.04\vs2010\port -IC:\Users\Sean\Desktop\tesseract\VS2015_Tesseract\tesseract_3.04\ccutil -IC:\Users\Sean\Desktop\tesseract\VS2015_Tesseract\tesseract_3.04\ccstruct -IC:\Users\Sean\Desktop\tesseract\VS2015_Tesseract\tesseract_3.04\ccmain -IC:\Anaconda2\include -IC:\Anaconda2\PC /Tptesserocr.cpp /Fobuild\temp.win-amd64-2.7\Release\tesserocr.obj
And that step fails
The compile arguments on Linux look like this:
gcc
parameters (outputs object file):
-I/usr/local/include -I/usr/include/python2.7 -c tesserocr.cpp -o build/temp.linux-x86_64-2.7/tesserocr.o
/usr/local/include
contains the tesseract
and leptonica
folders which contain their respective header files, and /usr/include/python2.7
contains the Python header files (either Python 2 or 3).
c++
parameters (outputs binary file):
-L/usr/local/lib -llept -ltesseract -o tesserocr.so
/usr/local/lib
contains liblept.so
and libtesseract.so
(DLLs in your case) while lept
and tesseract
are the names of the libraries.
Perhaps you can try translate these parameters in your command (or the VS solution) to try to achieve a successful build. Hope this helps.
I am very interested in using this package for Windows as well. Do you have plans in the future for fixing the problem with the cpp compiling with VS?
(C:\Users\jeweinberg\AppData\Local\Continuum\Anaconda3) C:\Users\jeweinberg>pip
install tesserocr
Collecting tesserocr
Using cached tesserocr-2.1.3.tar.gz
Building wheels for collected packages: tesserocr
Running setup.py bdist_wheel for tesserocr ... error
Complete output from command C:\Users\jeweinberg\AppData\Local\Continuum\Anaco
nda3\python.exe -u -c "import setuptools, tokenize;__file__='C:\\Users\\JEWEIN~1
\\AppData\\Local\\Temp\\pip-build-0rwemyrs\\tesserocr\\setup.py';f=getattr(token
ize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(
compile(code, __file__, 'exec'))" bdist_wheel -d C:\Users\JEWEIN~1\AppData\Local
\Temp\tmpraznvglgpip-wheel- --python-tag cp35:
running bdist_wheel
running build
running build_ext
Supporting tesseract v3.02
Building with configs: {'libraries': ['tesseract', 'lept'], 'cython_compile_ti
me_env': {'TESSERACT_VERSION': 770}}
cythoning tesserocr.pyx to tesserocr.cpp
building 'tesserocr' extension
creating build
creating build\temp.win-amd64-3.5
creating build\temp.win-amd64-3.5\Release
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\cl.exe /c /nologo /
Ox /W3 /GL /DNDEBUG /MD -IC:\Users\jeweinberg\AppData\Local\Continuum\Anaconda3\
include -IC:\Users\jeweinberg\AppData\Local\Continuum\Anaconda3\include /EHsc /T
ptesserocr.cpp /Fobuild\temp.win-amd64-3.5\Release\tesserocr.obj
tesserocr.cpp
c:\users\jeweinberg\appdata\local\continuum\anaconda3\include\pyconfig.h(68):
fatal error C1083: Cannot open include file: 'io.h': No such file or directory
error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\bin
\\cl.exe' failed with exit status 2
----------------------------------------
Failed building wheel for tesserocr
Running setup.py clean for tesserocr
Failed to build tesserocr
Installing collected packages: tesserocr
Running setup.py install for tesserocr ... error
Complete output from command C:\Users\jeweinberg\AppData\Local\Continuum\Ana
conda3\python.exe -u -c "import setuptools, tokenize;__file__='C:\\Users\\JEWEIN
~1\\AppData\\Local\\Temp\\pip-build-0rwemyrs\\tesserocr\\setup.py';f=getattr(tok
enize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exe
c(compile(code, __file__, 'exec'))" install --record C:\Users\JEWEIN~1\AppData\L
ocal\Temp\pip-zia9yqa6-record\install-record.txt --single-version-externally-man
aged --compile:
running install
running build
running build_ext
Supporting tesseract v3.02
Building with configs: {'cython_compile_time_env': {'TESSERACT_VERSION': 770
}, 'libraries': ['tesseract', 'lept']}
skipping 'tesserocr.cpp' Cython extension (up-to-date)
building 'tesserocr' extension
creating build
creating build\temp.win-amd64-3.5
creating build\temp.win-amd64-3.5\Release
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\cl.exe /c /nologo
/Ox /W3 /GL /DNDEBUG /MD -IC:\Users\jeweinberg\AppData\Local\Continuum\Anaconda
3\include -IC:\Users\jeweinberg\AppData\Local\Continuum\Anaconda3\include /EHsc
/Tptesserocr.cpp /Fobuild\temp.win-amd64-3.5\Release\tesserocr.obj
tesserocr.cpp
c:\users\jeweinberg\appdata\local\continuum\anaconda3\include\pyconfig.h(68)
: fatal error C1083: Cannot open include file: 'io.h': No such file or directory
error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\b
in\\cl.exe' failed with exit status 2
----------------------------------------
Command "C:\Users\jeweinberg\AppData\Local\Continuum\Anaconda3\python.exe -u -c
"import setuptools, tokenize;__file__='C:\\Users\\JEWEIN~1\\AppData\\Local\\Temp
\\pip-build-0rwemyrs\\tesserocr\\setup.py';f=getattr(tokenize, 'open', open)(__f
ile__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__
, 'exec'))" install --record C:\Users\JEWEIN~1\AppData\Local\Temp\pip-zia9yqa6-r
ecord\install-record.txt --single-version-externally-managed --compile" failed w
ith error code 1 in C:\Users\JEWEIN~1\AppData\Local\Temp\pip-build-0rwemyrs\tess
erocr\
(C:\Users\jeweinberg\AppData\Local\Continuum\Anaconda3) C:\Users\jeweinberg>
@jeweinberg I personally don't use Windows and not planning to any time soon so, as much as I'd like to, I don't have the time to work on it.
I'm hoping someone who really needs it can do it and share their knowledge here (and of course merge whatever necessary changes, if any). I've shared what I know in this post based on my Linux experience with the setup.
Looking at the error you posted, it seems you're missing the Python header files (fatal error C1083: Cannot open include file: 'io.h': No such file or directory
), you need to include those as well as the headers of leptonica and tesseract.
I managed to have a windows build in a very messy way, here is what I did:
Following the instructions here to build the tesseract library, using cmake and visual studio 2015. Note the cppan or cmake sometimes hangs for some reason, so you probably need try it several times, see here
Manually copy out the header files from tesseract and leptonica, and group them into the following structure:
include/tesseract/
The tesseract305.exp and tesseract305.lib is in tesseract/build/Release, you can copy them out or point the build to this place
patch the setup.py package_config() method to manually return the right compiling/linking options, like below:
if sys.platform == "win32":
def package_config():
config = {}
config['library_dirs'] = ['C:\\Sandbox\\tesseract\\build\\Release',
'C:\\Users\\david\\.cppan\\storage\\lib\\04a83184\\Release']
config['include_dirs'] = ['C:\\Sandbox\\tesserocr\\include']
config['libraries'] = ['tesseract305',
'pvt.cppan.demo.leptonica-master',
"pvt.cppan.demo.gif-5.1.4",
"pvt.cppan.demo.jpeg-9.2.0",
"pvt.cppan.demo.openjpeg.openjp2-2.1.2",
"pvt.cppan.demo.png-1.6.26",
"pvt.cppan.demo.tiff-4.0.6",
"pvt.cppan.demo.zlib-1.2.8",
"pvt.cppan.demo.webp-0.5.1"]
config['cython_compile_time_env'] = {'TESSERACT_VERSION': version_to_int("3.05.00")}
return config
else:
def package_config():
......
The tesseract/vs2010/port/gettimeofday.h will cause compile failure due to some kind of name conflict , but I don't know how to fix it decently. I did an ugly patch to rename the "struct timezone" to "struct mytimezone". Need to figure out a way to fix it. But the ugly patch will work for now
I'm also failing to include the "tesseract305.dll" into the final wheel. Not an expert of python distutils. Need to find the solution.
Attached my wheel file and the associated DLL tesserocr-2.1.3-cp35-cp35m-win_amd64.zip tesseract305.dll.zip
Great effort @dgvogol thanks for sharing! Regarding the inclusion of tesseract305.dll
have you added the file in the MANIFEST.in file? It should automatically be included in the wheel if you add it to the manifest afaik. Would be cool if we can release Windows wheels for Python 2 and 3.
UPDATE: SUCCESS!!!! In frustration I went looking for "Plan B" and thought to start at the Tesseract-ocr GitHub (https://github.com/tesseract-ocr). Looking at the repositories there, I had an "A-ha!" moment when I saw the Tessdata repo and remembered that my initial wheel/dll installation did not report any available languages. So I downloaded that repo and keep poking around. Assuming that Tesserocr still might not work, I decided to "start from the ground up" and just get the Tesseract-ocr engine installed. So I found the reference to the Windows binary, Tesseract 3.05-dev available at UB Mannheim (https://github.com/UB-Mannheim/tesseract/wiki). All I did next was run the binary installer and place the 'eng' Tessdata in the directory where Tesserocr was looking. I then ran my "ppg2leaf_ferret" app and the threads started spewing (some) print page numbers for the leafs it was loading! I obviously have some tweaking to do, but AWESOME progress nonetheless!
Sorry that this note is so TL;DR, but I am hoping that Fayez or David G can help me...
I am a 66-year old, post-cancer Bonus Round independent Citizen Scientist doing applied research at the intersection of #DigitalHumanities and #CognitiveComputing. I don't know who David G is (@dgvogol - awesome work, thank you) but his contribution of a Windows-based wheel for Tesserocr is exactly what I need for a project I am working on to support eResearch and machine-learning at the Internet Archive. Unfortunately, the wheel and dll supplied here come "this close" to working for me. Perhaps Fayez (@sirfz) or David can provide some insights...
The wxPython/PIL app I have developed is to support the discovery and curation of "ppg2leaf" metadata -- that is, print-page numbers to "leafs" being the ItemID for image files of pages of documents (computer magazines are my primary focus) at the Archive. This physical page number to ItemID is the most foundational metadata for traceability between physical documents and their digitial manifestations. I currently have two papers submitted to #DATeCH2017 related to this work. (http://ddays.digitisation.eu/datech-2017/)
I have a nice multi-threaded app going that queues the high-resolution page images and, based on LEFT/RIGHT "handside", also creates a zoomed image of the corner of the page where the page number is exepected to be found. A few simple keystrokes and this critical metadata is created when missing (the all-too-frequent standard) or validated (when the Archive's regional scanning centers do the digitization and human scanners "spot"/assert page numbers as they work).
Of course it would be totally awesome -- and somehow I have to figure out how to get this working -- if I could OCR these zoomed page corner views to see if we can identify and pre-populate the printed-page number into its respective text-widget, if found.
I am on Windows 10 x64, with a nice Anaconda-based enviroment that I use for my various FactMiners research projects. I was able to install David's wheel and dll and was hoping for the best but have hit a no-go wall.
Trying to "go for the gold" as soon as I had the wheel installed, I added the import and these two "see if it is there" lines at the top of my file:
import tesserocr
print(tesserocr.tesseract_version()) # print tesseract-ocr version
print(tesserocr.get_languages()) # prints tessdata path and list of available languages
which produces these message lines as my app cranks up:
tesseract 3.05.00dev
leptonica-1.74 (Dec 15 2016, 08:36:24) [MSC v.1900 LIB Release x64]
libgif 5.1.4 : libjpeg 9b : libpng 1.6.26 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.5.1 : libopenjp2 2.1.2
('C:\\Dev\\Anaconda\\envs\\fmtk\\tessdata/', [])
One thing that looks a bit hinky is that the available languages shows an empty list...
At any rate, in the worker thread where I download into memory the jpeg images via PIL from the Archive and create the scaled down full-page image for the main panel and crop the full-resolution image down to the appropriate corner where the page number may be found, I added this line to see if I could OCR the zoomed corner image:
print("Zoom image OCR: " + tesserocr.image_to_text(cornerzoom_img))
Unfortunately, when I run this I get the following error for each of the threads:
File "[snip path]/ia_ppg2leaf_ferret_app.py", line 87, in run
print("Zoom image OCR: " + tesserocr.image_to_text(cornerzoom_img))
File "tesserocr.pyx", line 2288, in tesserocr.image_to_text (tesserocr.cpp:20796)
RuntimeError: Failed recognize picture
Also unfortunately, since this is a compiled wheel/dll, (using PyCharm) I cannot interactively debug what is going on other than to go look at the referenced line in the source.
Also note, that when I try to run the unit test, tests fail on the api.Init calls to a PyTessBaseAPI instance.
Like David (but surely moreso -- I spent the bulk of my career in Programming Nirvana as a Smalltalk developer), I know enough to be dangerous to myself in terms of Python and compiling C++ for Windows, etc. But I have great patience and persistence, so if there is anything folks can do to help me figure out how to get this working, that would be INCREDIBLE.
Finally, I believe that Tesserocr is a brilliant design (being threading/PIL friendly) and that its value and use will only increase as the domains of Digital Humanities and Cognitive Computing converge. And I believe David G's contribution of a "ready-to-rumble" wheel for Windows is a fantastic and much-needed contribution to this project.
Thank you both and any others for doing this valuable work.
Happy-Healthy Vibes, Jim
A quick short follow-up... Tesserocr seems to be working pretty well to identify print page numbers in my zoomed corner images. The accuracy seems to go up considerably once we get to double-digit page numbers. Single digit page numbers seem to be almost invisible.
I saw a mention of "digits only" mode for Tesseract, maybe that will help if anyone has thoughts or pointers on this, including how to take advantage of this via Tesserocr. Also, I have not done any binarization or other filtering/tweaking to the retrieved leaf images from the Archive. Perhaps that would help, too.
Any suggestions about best practices for PIL Image prep would be greatly appreciated. Getting better recognition for single digits would be a big help, too.
ITMT, I am digging into various Tesseract docs and cross-referencing the Tesserocr source to do some configuration setting via SetPageSegMode and SetVariable. Will update here when I determine the most effective setup.
UPDATE QUESTION: Where/how to set the config for Tesserocr instances in a worker thread?
I know my questions/insights are naive based on my jumping into "doing" without adequate prep, but here goes... It looks like something close to the following would configure Tesserocr to do its best job on print page numbers (some of this seemingly redundant, but WTH if it works):
baseApi = PyTessBaseAPI() baseApi.Init() baseApi.SetPageSegMode(PSM.SINGLE_LINE) baseApi.SetVariable("tessedit_char_blacklist", ".,!?@#$%&*()<>_-+=/:;'\"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz") baseApi.SetVariable("tessedit_char_whitelist", "0123456789") baseApi.SetVariable("classify_bln_numeric_mode", "1")
Since my initial code simply calls the image_to_text helper method in the worker threads, I am not sure when/how to config the API for each thread.
Hello Jim, glad you're using tesserocr and finding it useful :)
Regarding your initial issue, you can always specify the tessdata directory of your choice by passing it to the get_languages
function or PyTessBaseAPI
init (and pretty much every other helper function). Example:
tesserocr.get_languages('C:\\Dev\\Anaconda\\envs\\fmtk\\tessdata/')
api = tesserocr.PyTessBaseAPI(path='C:\\Dev\\Anaconda\\envs\\fmtk\\tessdata/')
As for your updated question, you can create a config file and load it using the ReadConfigFile
method:
with tesserocr.PyTessBaseAPI(psm=tesserocr.PSM.SINGLE_LINE) as api:
api.ReadConfigFile('C:\\path\\to\\config')
# use api as usual...
You can check examples of config files here. Hope this helps.
update: you can use api.ReadConfigFile('digits')
which reads the digits
file from tessdata-path/configs/digits
.
Extra tip: No need to call api.Init
after initializing the api object since it calls it automatically with the given parameters passed while initializing the api instance. Only call Init
if you wish to re-initialize with different parameters.
Since these issues are no longer related to "Building on Windows", please open a new issue for future questions or if you wish to further discus your updated question.
Hi Fayez!
Thank you for the quick and helpful reply. I'll be using the load config file tip next. At the moment, I am having greatest success with a PSM setting of RAW_LINE on the call to the helper method, and pass that result through a regex to grab just the digits. I then evaluate the this OCR page number assertion digit string within the context of its nearest neighbors in the page-image queue.
In the #DATeCH2017 paper I mentioned above, my wife and I looked at about 1.4M "leaf" images of computer magazines at the Internet Archive and found only 26 instances of print-page numbers to leaf IDs in the issues' associated _scandata.xml metadata files. So getting FactMiners' ppg2leaf_ferret as "bot-able" as possible is a goal along with the user interface being a "smart assistant" tool for doing ground-truth metadata discovery and curation.
I cannot tell you how much of a "power boost" your framework and DavidG's wheel/dll gave me for our current project.
Thank you for doing Tesserocr, and a special shout-out to the elusive DavidG (@dgvogol) who did the heavy lifting of creating the Python wheel of this super framework.
Happy-Healthy Vibes, Jim
Finally get some time to work on the tesserocr window build. I've a patch done in here: https://github.com/dgvogol/tesserocr/tree/windows_build
The solution is not ideal: The tesseract executable and library will be built before the tesserocr build, using the "cppan" method suggested by the tesseract-ocr team. The problem is that I cannot know the tesseract-ocr DLL details before it is actually built, which I need to setup the "package_data" to include these DLL files into the binary distribution. The current work around is to build the "tesseract" before the distutils setup(...) is called. The side effect is that the "tesseract" will always be built even with the "python setup.py --help" command. Need a better solution.
With that being said, the windows build should work for python version 3.5+. You will need the cppan and cmake executable ready.
Update: Implemented a better way to include windows DLLs into the binary distribution.
Hey @dgvogol, that's great work! Could you please integrate your changes into the tesseract4-branch's setup.py script? I believe you'll need to just add the windows-specific code inside the make_extension
function, let me know if you need any help with that.
Hello again @sirfz and @dgvogol ! :-)
I am in the process of trying to upgrade my Anaconda environment for my FactMiners/SoftalkApple projects as the leading edge of Python moves to the 3.6+ level. Of course this bites me on the Tesserocr Windows work-around of using David's binary wheel and DLL from above.
It appears that David is making interesting progress on the issues with the Windows build. Unfortunately, I am not knowledgeable enough nor do I have a Windows build pipeline to try this updated approach. Also, it appears that the supported Python versions do not include 3.6 yet.
ITMT, it looks like I will roll back my Anaconda environments to 3.5.3 to keep using David's wheel.
I truly believe that Tesserocr will be widely useful to folks with Windows machines who are not able to build it as hoped. The binary wheel with pip installation for Anaconda data science installs works okay. SO PLEASE, could you Good Folks (most likely David) compile a Windows wheel when Tesserocr is updated to Python 3.6+? I will gladly document my installation and testing process here and wherever it makes sense to help spread the word about what a great feature-set Tesserocr provides.
Thanks again to you both for your past and on-going efforts to evolve Tesserocr, especially for Windows users.
BTW, if there is an idiot-proof cookbook for doing a Windows build, I am happy, too, to take the time and effort to shake down the process as Tesserocr moves into 3.6+ territory.
Happy-Healthy Vibes, Jim
@sirfz @dgvogol @Jim-Salmons Do we have a solution yet?
Also, maybe this is not appropriate here, but I'm asking. I want to extract texts from an image and convert into editable text while retaining text orientation and font attributes on Windows with Python/Anaconda. I'm using pytesseract (imageToString) with a little image preprocessing to get most of the texts correctly. I'm getting bounding box coordinates using hOCR. I googled about font attributes and found tesserOCR but I could not set it up. Could you please guide me? (can email me @ vishal.gupta@nitdelhi.ac.in)
@dgvogol I make all steps to generate to Windows platform but i got this error:
LINK : warning LNK4098: defaultlib 'LIBCMT' conflicts with use of other libs; use /NODEFAULTLIB:library tesserocr.obj : error LNK2001: unresolved external symbol "public: thiscall tesseract::TessPDFRenderer::TessPDFRenderer(char const ,char const )" (??0TessPDFRenderer@tesseract@@QAE@PBD0@Z) tesserocr.obj : error LNK2001: unresolved external symbol "int cdecl gettimeofday(struct timeval ,struct mytimezone )" (?gettimeofday@@YAHPAUtimeval@@PAUmytimezone@@@Z) build\lib.win32-3.6\tesserocr.cp36-win32.pyd : fatal error LNK1120: 2 unresolved externals
Can you help me?
Thanks to the groundwork of @dgvogol, I was able to build tesserocr 2.2.2 for windows:
To install them, unzip and run pip install <package_name>.whl
or use conda install -c simonflueckiger tesserocr
The timezone naming conflict is a real pain in the a**. In addition to patching the gettimeofday.h as it's done in the setup.py file from @dgvogol I also had to patch the gettimeofday.cpp, otherwise I would get the same linker error as @thiagofmam.
For some reason the Python 2.7 builds make the python.exe crash with a heap access violation in the api call GetAvailableLanguagesAsVector, which renders the wheel files for 2.7 useless... not sure if this is an issue on my side though (I built for 2.7 on 2 different machines both builds result in the same runtime error). All the other builds pass all of the 18 unit tests.
The libraries that are included in the .whl files:
@simonflueckiger Thanks for your great work! Has the source code used for the whl already been merged to the tesserocr master? If not merged yet, could you merge it? I strongly want to use tesserocr on Windows with Python2.7 and tesseract4, so I'd like to try Windows build on my PC and try to investigate the Python2.7 whl build failure.
I was able to build the whl on windows 8 with python 3.5. The api is working like a charm! Thanks alot
On Oct 11, 2017 11:55 AM, "simonflueckiger" notifications@github.com wrote:
Thanks to the groundwork of @dgvogol https://github.com/dgvogol, I was able to build tesserocr 2.2.2 for windows:
- Python 3.6
- tesserocr-2.2.2-cp36-cp36m-win_amd64.zip https://github.com/sirfz/tesserocr/files/1376493/tesserocr-2.2.2-cp36-cp36m-win_amd64.zip
- tesserocr-2.2.2-cp36-cp36m-win32.zip https://github.com/sirfz/tesserocr/files/1376494/tesserocr-2.2.2-cp36-cp36m-win32.zip
- Python 3.5
- tesserocr-2.2.2-cp35-cp35m-win_amd64.zip https://github.com/sirfz/tesserocr/files/1376502/tesserocr-2.2.2-cp35-cp35m-win_amd64.zip
- tesserocr-2.2.2-cp35-cp35m-win32.zip https://github.com/sirfz/tesserocr/files/1376503/tesserocr-2.2.2-cp35-cp35m-win32.zip
- Python 2.7
- tesserocr-2.2.2-cp27-cp27m-win_amd64.zip https://github.com/sirfz/tesserocr/files/1376483/tesserocr-2.2.2-cp27-cp27m-win_amd64.zip (broken)
- tesserocr-2.2.2-cp27-cp27m-win32.zip https://github.com/sirfz/tesserocr/files/1376491/tesserocr-2.2.2-cp27-cp27m-win32.zip (broken)
To install them, unzip and run pip install
.whl The timezone naming conflict is a real pain in the a**. In addition to patching the gettimeofday.h as it's done in the setup.py file from @dgvogol https://github.com/dgvogol I also had to patch the gettimeofday.cpp, otherwise I would get the same linker error as @thiagofmam https://github.com/thiagofmam.
For some reason the Python 2.7 builds make the python.exe crash with a heap access violation in the api call GetAvailableLanguagesAsVector https://github.com/sirfz/tesserocr/blob/c8464d13a0b3f5f33f40b274d4b8a0299b0c2841/tesserocr.pyx#L1407, which renders the wheel files for 2.7 useless... not sure if this is an issue on my side though (I built for 2.7 on 2 different machines both builds result in the same runtime error). All the other builds pass all of the 18 unit tests.
The libraries that are included in the .whl files:
- libtesseract 3.5.1
- leptonica 1.74.4
- jpeg 9.2.0
- png 1.6.30
- tiff 4.0.8
- webp 0.6.0
- zlib 1.2.11
- openjp2 2.1.2
- lzma 5.2.3
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sirfz/tesserocr/issues/19#issuecomment-335875716, or mute the thread https://github.com/notifications/unsubscribe-auth/AWQexgQtgR_jDfb2zIVqhnMfPCS3EpEkks5srPLlgaJpZM4Jrke7 .
@simonflueckiger Awesome. I conda installed the windows tesser ocr . However while testing I receive the following error 'Failed to inti API, possible an invalid tessdata path: C:\ProgramData\Anaconda3\' . Is this a path issue or am I missing something?
update: Tried changing path to where tessdata was located and received the same error.
@NozomiIto have a look at https://github.com/simonflueckiger/tesserocr-windows_build and https://ci.appveyor.com/project/simonflueckiger/tesserocr-windows-build. I'm really glad that you want to help me out with this issue! Would you still be interested to look into this together in a couple of weeks, when I have some spare time again?
@jeweinberg glad to hear it's working 😀
@rleonard11 can you try to point it to the parent directory of where your tessdata folder is located? E.g.
with PyTessBaseAPI(path='C:/Program Files (x86)/Tesseract-OCR/') as api:
...
when your tessdata is located in 'C:/Program Files (x86)/Tesseract-OCR/tessdata'
If this doesn't work, could you please list the contents of your tessdata folder?
@simonflueckiger Tried pointing it to the folder with the following code:
with PyTessBaseAPI(psm = PSM.AUTO_OSD, path = 'C:\Program Files (x86)\Tesseract-OCR\tessdata') as api:
Here is the contents of my tessdata folder:
configs (folder) tessconfigs (folder) eng.cub.bigrams eng.cube.fold eng.cube.lm eng.cube.nn eng.cube.params eng.cube.size eng.cube.word-freq eng.tesseract_cube.nn eng.traineddata eng.user-patterns eng.user-words osd.traineddata pdf.ttf
@rleonard11 can you give it a try without the tessdata? 'C:\Program Files (x86)\Tesseract-OCR\tessdata' -> 'C:\Program Files (x86)\Tesseract-OCR'
if this still doesn't work, lets talk on [tlk.io url expired]
@simonflueckiger Thanks. I've already found your source code by myself and compiled it. I could compile with Python3.5 and tesseract4 by using pvt.cppan.demo.google.tesseract.tesseract-master and -DUSE_STD_NAMESPACE by teaking your code a little. I also tried Python2 compile, but I couldn't do it due to the compilation error. I think Python2 cython compilation requires Visual C++ compiler for python(virtually VC++ 2008), but tesseract4 requires C++11 library. So I gave up using Python2-tesseract4 combination. I'm still not sure if Python2-tesseract3 combination works or not..
@simonflueckiger I am having the same problem as above:
Failed to init API, possibly an invalid tessdata path: C:\mypath\tessdata
Tried point above tessdata path but that fails as well.
I have found that using the path
parameter to PyTessBaseAPI
never works for me on Windows. I either have to use the TESSDATA_PREFIX
environment variable or make sure the tessdata directory is in the default 'compiled in' location expected by tesseract.
I managed to put together conda recipes for leptonica, tesseract, and tesserocr that support windows/mac/linux. You can install using:
conda install -c mcs07 tesserocr
This works well for me on Windows, even if you don't set TESSDATA_PREFIX
, but with one minor issue: The API (i.e. tesserocr) expects to find tessdata at the root of the conda environment, i.e. C:\Anaconda\envs\myenv\tessdata
- so the recipe installs it there. However, when running tesseract
on the command line, it expects to find tessdata alongside the executable in C:\Anaconda\envs\myenv\Library\bin
- so (for now) the recipe duplicates tessdata to this location also.
Thanks, setting TESSDATA_PREFIX
worked for me.
The path
parameter isn't working with tesseract v4 on Linux either (also need to set TESSDATA_PREFIX
environment variable).
With too many great folks contributing to this important issue, I will just chime in that I am back from a post #DATeCH2017 hiatus and busy Summer. Now that I am getting back into the groove, I am rebuilding my Windows 10 dev box and will report my experience with the latest developments detailed above. Once I have a clear sense of what/how to recommend, I look forward to promoting this progress to my friends in the #digitalhumanities, especially folks in the #TDM (text- and data-mining) community.
BTW, as an unaffiliated #CitizenScientist, Tesserocr was instrumental to the metadata discovery and curation "ferret" scripts which were featured in my two papers accepted for a combined poster at #DATeCH2017. I have attached an image of the poster.
Thanks, again, to all for the attention and effort that has gone into this issue.
Have been struggling to get this to compile for tesseract 4, even with @NozomiIto and @simonflueckiger's code/hints. Is it possible to get a wheel for tesseract 4?
@hsmallbone It is very complicated work to compile with tesseract4, but it is possible. I put the compiled whl for Python3.5 and Tesseract4 at https://github.com/NozomiIto/tesserocr_whl. You can try it if you wish.
@NozomiIto thanks for your wheel, but it's not for my platform unfortunately. I guess I will wait for a more reliable setup.py.
Finally have some time to work on the windows build. Thank @simonflueckiger for the gettimeofday.cpp patch. Details about the build:
There still can be many improvements, but I'm happy with the overall build procedure. The build at least for Python3.6
A pull request is sent.
@simonflueckiger can you please update the wheels to tesseract 4.0. It would be very helpful. Thank you a lot for the contribution.
@ricardomga I will look into it and keep you posted.
@simonflueckiger thank you a lot!
@hsmallbone @ricardomga et voilà, freshly baked tesseract 4 wheels are here! Download them while they're still hot :cookie: And please let me know if they are working as intended!
from simonflueckiger/tesserocr-windows_build/releases:
To install them with pip
pip install <package_name>.whl
or install them directly from my Anaconda repository
conda install -c simonflueckiger/label/tesseract-4.0.0-master tesserocr
Contains the following binaries:
@simonflueckiger @sirfz Please add the windows .whl to Pypi. It would be really helpful, and I think it is simple to do it.
any solution for macos metal? Facing the same problem
I have been working on making some bots for programs that only run in windows and I was wondering if you had any pointers on compiling on windows. I was actually able to build tesserocr.lib but I cannot get past that step.
I used https://github.com/peirick/VS2015_Tesseract to build libtesseract and used that to satisfy all of the imports.
Thank you.