Closed SB2020-eye closed 3 years ago
Hi, I'm sorry but we are currently underway with major refactoring, and unintentionally seem to have broken main doing so. I hope we can conclude the overhaul within the next couple of weeks. Soon after that, this tool will also be included in our ocrd-galley, which would allow usage via "stable" Docker images.
I believe https://github.com/qurator-spk/eynollah/tree/778a4197a5ee99e8bbcfc86e8ae75cec96a3435e was still working for me, but unfortunately no experience trying any of this on Windows.
main
should be working after #12. Smoke test: eynollah --help
work? Can you share the image you're trying this on?
Thank you, @cneud and @kba.
@cneud , are you suggesting I download the version found at the link you gave?
@kba , I think you're asking me to run eynollah --help
from terminal after cd-ing into the repo root folder. If so, the same behavior occurs: no output, and after 3 or so seconds, a new command line appears ready to go.
Just in case, I should probably ask something crucial towards my goal with eynollah, to make sure I don't waste everyone's time. I am assuming the -si
argument results in image files for all the different segmented sections of the original image. Is that correct? And if so, are they lossless (ie, is anything lost in the process)?
Here's an example image.
@cneud , are you suggesting I download the version found at the link you gave?
@kba has kindly applied a fix to main, so (theoretically) the main branch should now build again.
I am assuming the -si argument results in image files for all the different segmented sections of the original image. Is that correct?
No I believe the -si
flag only extracts regions with image content, i.e. illustrations, pictures, fotos or similar that were identified by the layout analysis as "graphical elements".
You can however cut out any regions from the image after layout analysis based on their pixel coordinates in the PAGE-XML output, which will give you the segment images in the same resolution as the source image.
@cneud , are you suggesting I download the version found at the link you gave?
The main
branch works for me, so no, just make sure you are at the latest commit in main
.
@kba , I think you're asking me to run eynollah --help from terminal after cd-ing into the repo root folder. If so, the same behavior occurs: no output, and after 3 or so seconds, a new command line appears ready to go.
From the other issue, I infer you're using conda. If the conda env is active, you do not need to be in the repo folder. Are you sure, you have installed eynollah including its dependencies, i.e. conda activate yourenv; pip install .
or can you try with a fresh environment to make sure this is not the issue?
I am assuming the -si argument results in image files for all the different segmented sections of the original image. Is that correct? And if so, are they lossless (ie, is anything lost in the process)?
Yes, with the -si
option, cropped images of all the contours found by eynollah are written to that directory. GIGO, so this should not reduce image quality IIUC.
However, I see this as merely a debug function (@vahidrezanezhad correct me if I'm wrong), the important result is the PAGE-XML. From that (or any other) PAGE-XML you can use ocrd_segment, specifically ocrd-segment-extract-regions
and *-lines
to extract the cropped images afterwards. Even better would be, if you use this within a python project, to use the polygons in the PAGE-XML directly, so you don't lose that information in serialization which must be a bounding box.
Here's an example image
And here's eynollah would segment that page:
And without the image for clearer visuals:
The ruler confused the detection so the reading order is shoddy, should have cropped the printspace more vertically. But the regions and esp. lines (which are essential for OCR) are tight and accurate AFAICS.
Yes, with the -si option, cropped images of all the contours found by eynollah are written to that directory. GIGO, so this should not reduce image quality IIUC.
I was wrong, @cneud hat it right:
No I believe the -si flag only extracts regions with image content, i.e. illustrations, pictures, fotos or similar that were identified by the layout analysis as "graphical elements".
Regarding -si
, does this mean that I would need to work with PAGE-XML in order to get the cut-out images of text lines? I have some doubts about my abilities in that realm (never worked with XML, never heard of XSLT, can't even locate the dependencies needed for that repo, etc). Lol.
I actually don't need OCR per se -- just images of text lines (or, even better, words, if possible). This is toward a subsequent goal of cutting out images of just glyphs (with no background). eynollah is obviously constructed for purposes more sophisticated than just what I'm describing.
I actually already have something slicing out images of text lines for me -- docExtractor. But having found your sbb_binarization and getting such positive results, I came to eynollah since sbb_binarization doesn't seem to run in python 3.8.6, which the rest of my program (including docExtractor) is currently running in. And I just don't know how to get them to "talk" to each other. So I figured maybe I could replace docExtractor with eynollah and have everything run in python 3.7.0 environment. (Yes, @kba , it is indeed a conda environment.)
If this sounds like I'm making things overly complicated, I probably am! And I'd appreciate you saying so (plus any suggestions you might have). Or if eynollah seems to you like it's a rabbit trail for my particular purposes, please don't hesistate to say so. You are obviously doing good work here!
Regarding
-si
, does this mean that I would need to work with PAGE-XML in order to get the cut-out images of text lines? I have some doubts about my abilities in that realm (never worked with XML, never heard of XSLT, can't even locate the dependencies needed for that repo, etc). Lol.I actually don't need OCR per se -- just images of text lines (or, even better, words, if possible). This is toward a subsequent goal of cutting out images of just glyphs (with no background). eynollah is obviously constructed for purposes more sophisticated than just what I'm describing.
I actually already have something slicing out images of text lines for me -- docExtractor. But having found your sbb_binarization and getting such positive results, I came to eynollah since sbb_binarization doesn't seem to run in python 3.8.6, which the rest of my program (including docExtractor) is currently running in. And I just don't know how to get them to "talk" to each other. So I figured maybe I could replace docExtractor with eynollah and have everything run in python 3.7.0 environment. (Yes, @kba , it is indeed a conda environment.)
If this sounds like I'm making things overly complicated, I probably am! And I'd appreciate you saying so (plus any suggestions you might have). Or if eynollah seems to you like it's a rabbit trail for my particular purposes, please don't hesistate to say so. You are obviously doing good work here!
Hi there, -si option gives you this capability to crop and save images inside the document . This can be done using output xml data but to make it easier we have provided this option too (to crop and save them while you run eynollah).
I am assuming the -si argument results in image files for all the different segmented sections of the original image. Is that correct?
No I believe the
-si
flag only extracts regions with image content, i.e. illustrations, pictures, fotos or similar that were identified by the layout analysis as "graphical elements".You can however cut out any regions from the image after layout analysis based on their pixel coordinate
Correct. Thank you
If this sounds like I'm making things overly complicated, I probably am!
IIUC you want to create some sort of glyph repository, so you're not interested in the text detection but in getting lines and glyphs from the lines in a bitonal format.
You want to preprocess your page to crop it to the print space (which should get rid of opposing pages, rulers etc.), deskew/dewarp it (if lines aren't perfectly orthogonal to image or have water damage or have a deep joint) and then segment the page into lines. We have a multiple tools for that in OCR-D, see https://ocr-d.de/en/workflows. Then you can use an OCR engine like tesseract or calamari to do the recognition down to glyph level and just disregard the actual detected text and just use the bounding boxes of the glyphs to cut them out of the original image.
Yes, this would involve working with PAGE-XML. We do have a pythonic API for that in OCR-D/core though that can make this a bit easier, at the end of the day it's a hierarchical data structure like any other: Page -> TextLine -> Word -> Glyph -> Coords -> points.
But I suggest you drop by our chat at https://gitter.im/OCR-D/Lobby, say hi and describe your use case, it's easier to discuss there than in an issue.
-si option gives you this capability to crop and save images inside the document
@vahidrezanezhad just to make sure: with "save images" you mean "save graphic regions", correct?
-si option gives you this capability to crop and save images inside the document
@vahidrezanezhad just to make sure: with "save images" you mean "save graphic regions", correct?
Yes :)
The ruler confused the detection so the reading order is shoddy, should have cropped the printspace more vertically. But the regions and esp. lines (which are essential for OCR) are tight and accurate AFAICS.
As you mentioned, the reason for a bad reading order is the page detector (this simply happens since in GT we did not have such documents). But this is a general problem for reading order detection that can occur for documents with multi-columns and footnotes even though you have extracted printspace correctly.
and have a look at reading order
you see reading order still is not correct :)
main
should be working after #12.
I still had to the following to get main
working:
tensorflow-gpu 1.15
won't be found)tqdm
and seaborn
via pippip install keras==2.3.1
With these changes, I can successfully run the tool (on Ubuntu, not Windows though).
-si option gives you this capability to crop and save images inside the document
@vahidrezanezhad just to make sure: with "save images" you mean "save graphic regions", correct?
Yes :)
And does that mean "save graphic regions...as image files", or something else? Thanks.
If this sounds like I'm making things overly complicated, I probably am!
IIUC you want to create some sort of glyph repository, so you're not interested in the text detection but in getting lines and glyphs from the lines in a bitonal format.
You want to preprocess your page to crop it to the print space (which should get rid of opposing pages, rulers etc.), deskew/dewarp it (if lines aren't perfectly orthogonal to image or have water damage or have a deep joint) and then segment the page into lines. We have a multiple tools for that in OCR-D, see https://ocr-d.de/en/workflows. Then you can use an OCR engine like tesseract or calamari to do the recognition down to glyph level and just disregard the actual detected text and just use the bounding boxes of the glyphs to cut them out of the original image.
Yes, this would involve working with PAGE-XML. We do have a pythonic API for that in OCR-D/core though that can make this a bit easier, at the end of the day it's a hierarchical data structure like any other: Page -> TextLine -> Word -> Glyph -> Coords -> points.
But I suggest you drop by our chat at https://gitter.im/OCR-D/Lobby, say hi and describe your use case, it's easier to discuss there than in an issue.
Thanks. I just posted something.
install tqdm and seaborn via pip
I wonder why you need those. Are you sure you're up-to-date? These have been removed in 9596a44 and 801ccac resp.
downgrade keras pip install keras==2.3.1
Oh, yes, that's fixed in the refactoring but should be in main too, ef1e32e
And does that mean "save graphic regions...as image files", or something else? Thanks.
Yes, the graphic regions are saved as JPEG image files.
Yes this was on a clean clone of https://github.com/qurator-spk/eynollah/commit/c7d509bb2cfe12703e3321b393f603a6a9f900b5 - I still had to install both packages manually or eynollah would not run.
I also could not get any images extracted using -si
. Does this only work in combination with -fl=true
? @vahidrezanezhad
Also I am getting OOM exception due to Tensor shape...
every time I try to run eynollah with the -fl=true
parameter on my Geforce RTX2070S with 8 GB :(
Yes this was on a clean clone of c7d509b - I still had to install both packages manually or eynollah would not run.
I also could not get any images extracted using
-si
. Does this only work in combination with-fl=true
? @vahidrezanezhadAlso I am getting
OOM exception due to Tensor shape...
every time I try to run eynollah with the-fl=true
parameter on my Geforce RTX2070S with 8 GB :(
No. -si has nothing to do with -fl option. By -si a directory should be given.
Hmm, when I tried using e.g. eynollah -i 00000015.tif -o . -si .
I did not get any images extracted to that directory? I was using this image https://content.staatsbibliothek-berlin.de/dms/PPN626696453/1200/0/00000015.tif?original=true.
Hmm, when I tried using e.g.
eynollah -i 00000015.tif -o . -si .
I did not get any images extracted to that directory? I was using this image https://content.staatsbibliothek-berlin.de/dms/PPN626696453/1200/0/00000015.tif?original=true.
That might well be a regression on my part, investigating.
seaborn and tqdm
I am still confused about this. Can you try pip uninstall tqdm seaborn
and provide the stacktrace this causes please?
pipdeptree
shows this dependency tree for me:
eynollah==0.0.1 - imutils [required: >=0.5.3, installed: 0.5.3] - keras [required: >=2.3.1, installed: 2.3.1] - h5py [required: Any, installed: 2.10.0] - numpy [required: >=1.7, installed: 1.18.5] - six [required: Any, installed: 1.15.0] - keras-applications [required: >=1.0.6, installed: 1.0.8] - h5py [required: Any, installed: 2.10.0] - numpy [required: >=1.7, installed: 1.18.5] - six [required: Any, installed: 1.15.0] - numpy [required: >=1.9.1, installed: 1.18.5] - keras-preprocessing [required: >=1.0.5, installed: 1.1.0] - numpy [required: >=1.9.1, installed: 1.18.5] - six [required: >=1.9.0, installed: 1.15.0] - numpy [required: >=1.9.1, installed: 1.18.5] - pyyaml [required: Any, installed: 5.3.1] - scipy [required: >=0.14, installed: 1.4.1] - numpy [required: >=1.13.3, installed: 1.18.5] - six [required: >=1.9.0, installed: 1.15.0] - matplotlib [required: Any, installed: 3.3.1] - certifi [required: >=2020.06.20, installed: 2020.6.20] - cycler [required: >=0.10, installed: 0.10.0] - six [required: Any, installed: 1.15.0] - kiwisolver [required: >=1.0.1, installed: 1.2.0] - numpy [required: >=1.15, installed: 1.18.5] - pillow [required: >=6.2.0, installed: 7.2.0] - pyparsing [required: >=2.0.3,!=2.1.6,!=2.1.2,!=2.0.4, installed: 2.4.7] - python-dateutil [required: >=2.1, installed: 2.8.1] - six [required: >=1.5, installed: 1.15.0] - ocrd [required: >=2.20.1, installed: 2.22.3] - bagit [required: >=1.7.0, installed: 1.7.0] - bagit-profile [required: >=1.3.0, installed: 1.3.1] - bagit [required: Any, installed: 1.7.0] - requests [required: Any, installed: 2.24.0] - certifi [required: >=2017.4.17, installed: 2020.6.20] - chardet [required: >=3.0.2,<4, installed: 3.0.4] - idna [required: >=2.5,<3, installed: 2.10] - urllib3 [required: >=1.21.1,<1.26,!=1.25.1,!=1.25.0, installed: 1.25.10] - click [required: >=7, installed: 7.1.2] - Deprecated [required: ==1.2.0, installed: 1.2.0] - wrapt [required: >=1,<2, installed: 1.12.1] - Flask [required: Any, installed: 1.1.2] - click [required: >=5.1, installed: 7.1.2] - itsdangerous [required: >=0.24, installed: 1.1.0] - Jinja2 [required: >=2.10.1, installed: 2.11.2] - MarkupSafe [required: >=0.23, installed: 1.1.1] - Werkzeug [required: >=0.15, installed: 1.0.1] - jsonschema [required: Any, installed: 3.2.0] - attrs [required: >=17.4.0, installed: 20.2.0] - importlib-metadata [required: Any, installed: 2.0.0] - zipp [required: >=0.5, installed: 3.2.0] - pyrsistent [required: >=0.14.0, installed: 0.17.3] - setuptools [required: Any, installed: 50.3.0] - six [required: >=1.11.0, installed: 1.15.0] - lxml [required: Any, installed: 4.5.2] - ocrd-modelfactory [required: ==2.22.3, installed: 2.22.3] - lxml [required: Any, installed: 4.5.2] - ocrd-models [required: ==2.22.3, installed: 2.22.3] - lxml [required: Any, installed: 4.5.2] - ocrd-utils [required: ==2.22.3, installed: 2.22.3] - atomicwrites [required: >=1.3.0, installed: 1.4.0] - numpy [required: Any, installed: 1.18.5] - Pillow [required: >=7.2.0, installed: 7.2.0] - ocrd-utils [required: ==2.22.3, installed: 2.22.3] - atomicwrites [required: >=1.3.0, installed: 1.4.0] - numpy [required: Any, installed: 1.18.5] - Pillow [required: >=7.2.0, installed: 7.2.0] - ocrd-models [required: ==2.22.3, installed: 2.22.3] - lxml [required: Any, installed: 4.5.2] - ocrd-utils [required: ==2.22.3, installed: 2.22.3] - atomicwrites [required: >=1.3.0, installed: 1.4.0] - numpy [required: Any, installed: 1.18.5] - Pillow [required: >=7.2.0, installed: 7.2.0] - ocrd-utils [required: ==2.22.3, installed: 2.22.3] - atomicwrites [required: >=1.3.0, installed: 1.4.0] - numpy [required: Any, installed: 1.18.5] - Pillow [required: >=7.2.0, installed: 7.2.0] - ocrd-validators [required: ==2.22.3, installed: 2.22.3] - bagit [required: >=1.7.0, installed: 1.7.0] - bagit-profile [required: >=1.3.0, installed: 1.3.1] - bagit [required: Any, installed: 1.7.0] - requests [required: Any, installed: 2.24.0] - certifi [required: >=2017.4.17, installed: 2020.6.20] - chardet [required: >=3.0.2,<4, installed: 3.0.4] - idna [required: >=2.5,<3, installed: 2.10] - urllib3 [required: >=1.21.1,<1.26,!=1.25.1,!=1.25.0, installed: 1.25.10] - click [required: >=7, installed: 7.1.2] - jsonschema [required: Any, installed: 3.2.0] - attrs [required: >=17.4.0, installed: 20.2.0] - importlib-metadata [required: Any, installed: 2.0.0] - zipp [required: >=0.5, installed: 3.2.0] - pyrsistent [required: >=0.14.0, installed: 0.17.3] - setuptools [required: Any, installed: 50.3.0] - six [required: >=1.11.0, installed: 1.15.0] - ocrd-modelfactory [required: ==2.22.3, installed: 2.22.3] - lxml [required: Any, installed: 4.5.2] - ocrd-models [required: ==2.22.3, installed: 2.22.3] - lxml [required: Any, installed: 4.5.2] - ocrd-utils [required: ==2.22.3, installed: 2.22.3] - atomicwrites [required: >=1.3.0, installed: 1.4.0] - numpy [required: Any, installed: 1.18.5] - Pillow [required: >=7.2.0, installed: 7.2.0] - ocrd-utils [required: ==2.22.3, installed: 2.22.3] - atomicwrites [required: >=1.3.0, installed: 1.4.0] - numpy [required: Any, installed: 1.18.5] - Pillow [required: >=7.2.0, installed: 7.2.0] - ocrd-models [required: ==2.22.3, installed: 2.22.3] - lxml [required: Any, installed: 4.5.2] - ocrd-utils [required: ==2.22.3, installed: 2.22.3] - atomicwrites [required: >=1.3.0, installed: 1.4.0] - numpy [required: Any, installed: 1.18.5] - Pillow [required: >=7.2.0, installed: 7.2.0] - ocrd-utils [required: ==2.22.3, installed: 2.22.3] - atomicwrites [required: >=1.3.0, installed: 1.4.0] - numpy [required: Any, installed: 1.18.5] - Pillow [required: >=7.2.0, installed: 7.2.0] - pyyaml [required: Any, installed: 5.3.1] - shapely [required: Any, installed: 1.7.1] - opencv-python-headless [required: Any, installed: 4.4.0.44] - numpy [required: >=1.13.3, installed: 1.18.5] - pyyaml [required: Any, installed: 5.3.1] - requests [required: Any, installed: 2.24.0] - certifi [required: >=2017.4.17, installed: 2020.6.20] - chardet [required: >=3.0.2,<4, installed: 3.0.4] - idna [required: >=2.5,<3, installed: 2.10] - urllib3 [required: >=1.21.1,<1.26,!=1.25.1,!=1.25.0, installed: 1.25.10] - scikit-learn [required: >=0.23.2, installed: 0.23.2] - joblib [required: >=0.11, installed: 0.17.0] - numpy [required: >=1.13.3, installed: 1.18.5] - scipy [required: >=0.19.1, installed: 1.4.1] - numpy [required: >=1.13.3, installed: 1.18.5] - threadpoolctl [required: >=2.0.0, installed: 2.1.0] - tensorflow-gpu [required: >=1.15,<2, installed: 1.15.3] - absl-py [required: >=0.7.0, installed: 0.10.0] - six [required: Any, installed: 1.15.0] - astor [required: >=0.6.0, installed: 0.8.1] - gast [required: ==0.2.2, installed: 0.2.2] - google-pasta [required: >=0.1.6, installed: 0.2.0] - six [required: Any, installed: 1.15.0] - grpcio [required: >=1.8.6, installed: 1.31.0] - six [required: >=1.5.2, installed: 1.15.0] - keras-applications [required: >=1.0.8, installed: 1.0.8] - h5py [required: Any, installed: 2.10.0] - numpy [required: >=1.7, installed: 1.18.5] - six [required: Any, installed: 1.15.0] - numpy [required: >=1.9.1, installed: 1.18.5] - keras-preprocessing [required: >=1.0.5, installed: 1.1.0] - numpy [required: >=1.9.1, installed: 1.18.5] - six [required: >=1.9.0, installed: 1.15.0] - numpy [required: >=1.16.0,<2.0, installed: 1.18.5] - opt-einsum [required: >=2.3.2, installed: 3.3.0] - numpy [required: >=1.7, installed: 1.18.5] - protobuf [required: >=3.6.1, installed: 3.13.0] - setuptools [required: Any, installed: 50.3.0] - six [required: >=1.9, installed: 1.15.0] - six [required: >=1.10.0, installed: 1.15.0] - tensorboard [required: >=1.15.0,<1.16.0, installed: 1.15.0] - absl-py [required: >=0.4, installed: 0.10.0] - six [required: Any, installed: 1.15.0] - grpcio [required: >=1.6.3, installed: 1.31.0] - six [required: >=1.5.2, installed: 1.15.0] - markdown [required: >=2.6.8, installed: 3.2.2] - importlib-metadata [required: Any, installed: 2.0.0] - zipp [required: >=0.5, installed: 3.2.0] - numpy [required: >=1.12.0, installed: 1.18.5] - protobuf [required: >=3.6.0, installed: 3.13.0] - setuptools [required: Any, installed: 50.3.0] - six [required: >=1.9, installed: 1.15.0] - setuptools [required: >=41.0.0, installed: 50.3.0] - six [required: >=1.10.0, installed: 1.15.0] - werkzeug [required: >=0.11.15, installed: 1.0.1] - wheel [required: >=0.26, installed: 0.36.2] - tensorflow-estimator [required: ==1.15.1, installed: 1.15.1] - termcolor [required: >=1.1.0, installed: 1.1.0] - wheel [required: >=0.26, installed: 0.36.2] - wrapt [required: >=1.11.1, installed: 1.12.1]
So I do the following:
venv
and activate itNow when I try to run eynollah
it will complain about missing seaborn
eynollah -i PPN798786388_00000005.tif -o . -m ~/tmp/dev/qurator/models/eynollah ✔ 35s venv-qurator 12:43:50
Traceback (most recent call last):
File "/usr/local/bin/eynollah", line 11, in <module>
import seaborn as sns
ModuleNotFoundError: No module named 'seaborn'
So install seaborn
with pip and run again:
eynollah -i PPN798786388_00000005.tif -o . -m ~/tmp/dev/qurator/models/eynollah ✔ 4s venv-qurator 12:47:04
Traceback (most recent call last):
File "/usr/local/bin/eynollah", line 14, in <module>
from tqdm import tqdm
ModuleNotFoundError: No module named 'tqdm'
After installation of tqdm
, it runs fine.
pip uninstall tqdm seaborn
will give me
pip3 uninstall tqdm seaborn ✔ venv-qurator 12:47:43
Found existing installation: tqdm 4.56.0
Uninstalling tqdm-4.56.0:
Would remove:
/home/cnd/tmp/dev/qurator/tools/venv-qurator/bin/tqdm
/home/cnd/tmp/dev/qurator/tools/venv-qurator/lib/python3.6/site-packages/tqdm-4.56.0.dist-info/*
/home/cnd/tmp/dev/qurator/tools/venv-qurator/lib/python3.6/site-packages/tqdm/*
Proceed (y/n)? n
Found existing installation: seaborn 0.11.1
Uninstalling seaborn-0.11.1:
Would remove:
/home/cnd/tmp/dev/qurator/tools/venv-qurator/lib/python3.6/site-packages/seaborn-0.11.1.dist-info/*
/home/cnd/tmp/dev/qurator/tools/venv-qurator/lib/python3.6/site-packages/seaborn/*
Proceed (y/n)?
Output of pipdeptree -p eynollah
:
File "/usr/local/bin/eynollah", line 11, in
Wait, your venv is not /usr/local
, is it? Looks like you installed eynollah before without a virtualenv to /usr/local/bin/eynollah
- can you move/remove that file? which eynollah
should point to $VIRTUAL_ENV/bin/eynollah
.
Argh, you are right!
I deactivated the venv
and uninstalled eynollah.
Then I activated the venv
again and installed again via pip, now which eynollah
returns the correct path to the venv
/home/cnd/tmp/dev/qurator/tools/venv-qurator/bin/eynollah
but I am not getting any output anymore...(immediately exits with no message).
Apparently I had an older version installed to /usr/local/bin/eynollah
- thanks to @kba amazing debugging skills we were able to track this down eventually and now https://github.com/qurator-spk/eynollah/pull/18 works for me (without any need to install seaborn
or tqdm
and with working -si
parameter!).
Here's an update on my recent experience installing eynollah natively on Windows 10:
If I have a chance today, I will try a clean install w/ tf 2+, keras 2.4.3, and a relaxed requirement for eynollah to accept this configuration. I suspect it not to work due to the major refactoring in tf 2+. If anyone has a better idea to suggest, please don't hesitate to advise me.
ITMT, I have updated my Windows dev box to the latest Docker using WSL2. I'm in the process of learning how to config PyCharm Pro to do remote/virtual debuggable coding from my Windows IDE working on a live Docker image. I want to get this going as it will let me work more easily with OCR-D and similar research projects while still having PyCharm under Windows which includes the Kite Pro coding assistance. Kite is super helpful due to my severely limited keyboarding abilities following a July spinal cord injury.
Hi @Jim-Salmons, thanks for sharing! I am in a bit of a hurry, but chances to get eynollah
working will be much improved once we have completed the refactoring which should only take a few more weeks hopefully. Meanwhile, one must use a version of keras <2.4
(cf. https://github.com/qurator-spk/eynollah/pull/18) as newer versions will pull in Tensorflow 2 whereas the tool only works with Tensorflow version 1.15.x
. For TF2, the code would need to be adapted and the models retrained.
Hi Clemons @cneud - Thanks for the quick reply. Sounds like the best strategy is to wait for the next release. In the meantime I can get my sea-legs under me using PyCharm Pro under Windows on a Docker/WSL2 image. 🤪
File "/usr/local/bin/eynollah", line 11, in
Wait, your venv is not
/usr/local
, is it? Looks like you installed eynollah before without a virtualenv to/usr/local/bin/eynollah
- can you move/remove that file?which eynollah
should point to$VIRTUAL_ENV/bin/eynollah
.
Careful, which eynolah
can absolutely give $VIRTUAL_ENV/bin/eynollah
while you're still calling /usr/local/bin/eynollah
, because your shell might still be caching that eynollah
is /usr/local/bin/eynollah
. You need a rehash
or open a fresh terminal in that case. I have been bitten by this more than once...
(There a few subleties: If this was @cneud's problem, he had called the /usr/local
eynollah before activating the new venv/installing eynollah
, so that this whole confusion is possible... 🔍)
Thanks to everyone weighing in with input on this.
I made a fresh conda environment (Windows 10 OS) and took another go at it. I needed to install pip to install eynollah. But after some fails, I figured out that pip installed python 3.9. So I installed python 3.6. pip install .
worked after that -- but not until I manually made the changes referenced above (#18).
I used msys64 + mingw64 to install the models successfully.
Running
eynollah -i C:/Users/Scott/Desktop/Python2/K/eyn_test/F073r.jpg -o C:/Users/Scott/Desktop/Python2/K/eyn_test/results -m C:/users/scott/desktop/python2/eynollah/models_eynollah -si C:/users/scott/desktop/python2/K/eyn_test/results
I got the following
File "C:\ProgramData\Miniconda3\envs\eenv\Scripts\eynollah.exe\__main__.py", line 4, in <module>
File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\sbb_newspapers_org_image\cli.py", line 2, in <module>
from sbb_newspapers_org_image.eynollah import eynollah
File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\sbb_newspapers_org_image\eynollah.py", line 31, in <module>
from shapely import geometry
File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\shapely\geometry\__init__.py", line 4, in <module>
from .base import CAP_STYLE, JOIN_STYLE
from shapely.coords import CoordinateSequence
File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\shapely\coords.py", line 8, in <module>
from shapely.geos import lgeos
File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\shapely\geos.py", line 154, in <module>
_lgeos = CDLL(os.path.join(sys.prefix, 'Library', 'bin', 'geos_c.dll'))
File "c:\programdata\miniconda3\envs\eenv\lib\ctypes\__init__.py", line 348, in __init__
self._handle = _dlopen(self._name, mode)
OSError: [WinError 126] The specified module could not be found
After some dead ends, I googled the error line plus "shapely," and found a suggestion at another repo to simply
conda install -c conda-forge shapely
I no longer got that specific error. And eynollah --help
works. (Yay!)
However, running the same command above, now I get:
The system cannot find the path specified.
'identify' is not recognized as an internal or external command,
operable program or batch file.
Traceback (most recent call last):
File "c:\programdata\miniconda3\envs\eenv\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "c:\programdata\miniconda3\envs\eenv\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\ProgramData\Miniconda3\envs\eenv\Scripts\eynollah.exe\__main__.py", line 7, in <module>
File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\click\core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\click\core.py", line 782, in main
rv = self.invoke(ctx)
File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\click\core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\click\core.py", line 610, in invoke
return callback(*args, **kwargs)
File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\sbb_newspapers_org_image\cli.py", line 102, in main
headers_off,
File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\sbb_newspapers_org_image\eynollah.py", line 2978, in run
is_image_enhanced, img_org, img_res, num_col_classifier, num_column_is_classified = self.resize_and_enhance_image_with_column_classifier(is_image_enhanced)
File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\sbb_newspapers_org_image\eynollah.py", line 419, in resize_and_enhance_image_with_column_classifier
dpi = self.check_dpi()
File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\sbb_newspapers_org_image\eynollah.py", line 298, in check_dpi
return int(float(dpi))
ValueError: could not convert string to float:
I've triple-checked my paths, and they're fine. And I've poked around to try to understand where the ValueError is coming from. But yet to no avail. Any suggestions?
Thanks to everyone weighing in with input on this.
I made a fresh conda environment (Windows 10 OS) and took another go at it. I needed to install pip to install eynollah. But after some fails, I figured out that pip installed python 3.9. So I installed python 3.6.
pip install .
worked after that -- but not until I manually made the changes referenced above (#18).I used msys64 + mingw64 to install the models successfully.
Running
eynollah -i C:/Users/Scott/Desktop/Python2/K/eyn_test/F073r.jpg -o C:/Users/Scott/Desktop/Python2/K/eyn_test/results -m C:/users/scott/desktop/python2/eynollah/models_eynollah -si C:/users/scott/desktop/python2/K/eyn_test/results
I got the followingFile "C:\ProgramData\Miniconda3\envs\eenv\Scripts\eynollah.exe\__main__.py", line 4, in <module> File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\sbb_newspapers_org_image\cli.py", line 2, in <module> from sbb_newspapers_org_image.eynollah import eynollah File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\sbb_newspapers_org_image\eynollah.py", line 31, in <module> from shapely import geometry File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\shapely\geometry\__init__.py", line 4, in <module> from .base import CAP_STYLE, JOIN_STYLE from shapely.coords import CoordinateSequence File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\shapely\coords.py", line 8, in <module> from shapely.geos import lgeos File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\shapely\geos.py", line 154, in <module> _lgeos = CDLL(os.path.join(sys.prefix, 'Library', 'bin', 'geos_c.dll')) File "c:\programdata\miniconda3\envs\eenv\lib\ctypes\__init__.py", line 348, in __init__ self._handle = _dlopen(self._name, mode) OSError: [WinError 126] The specified module could not be found
After some dead ends, I googled the error line plus "shapely," and found a suggestion at another repo to simply
conda install -c conda-forge shapely
I no longer got that specific error. Andeynollah --help
works. (Yay!)However, running the same command above, now I get:
The system cannot find the path specified. 'identify' is not recognized as an internal or external command, operable program or batch file. Traceback (most recent call last): File "c:\programdata\miniconda3\envs\eenv\lib\runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "c:\programdata\miniconda3\envs\eenv\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\ProgramData\Miniconda3\envs\eenv\Scripts\eynollah.exe\__main__.py", line 7, in <module> File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\click\core.py", line 829, in __call__ return self.main(*args, **kwargs) File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\click\core.py", line 782, in main rv = self.invoke(ctx) File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\click\core.py", line 1066, in invoke return ctx.invoke(self.callback, **ctx.params) File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\click\core.py", line 610, in invoke return callback(*args, **kwargs) File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\sbb_newspapers_org_image\cli.py", line 102, in main headers_off, File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\sbb_newspapers_org_image\eynollah.py", line 2978, in run is_image_enhanced, img_org, img_res, num_col_classifier, num_column_is_classified = self.resize_and_enhance_image_with_column_classifier(is_image_enhanced) File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\sbb_newspapers_org_image\eynollah.py", line 419, in resize_and_enhance_image_with_column_classifier dpi = self.check_dpi() File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\sbb_newspapers_org_image\eynollah.py", line 298, in check_dpi return int(float(dpi)) ValueError: could not convert string to float:
I've triple-checked my paths, and they're fine. And I've poked around to try to understand where the ValueError is coming from. But yet to no avail. Any suggestions?
I think this is because of getting dpi of image on windows. I can temporarily add an exception to resolve your problem (this can affect the result).
I just updated eynollah. please check if it works or not.
take care that this can happen on linux if the image directory is false. So be sure that the given image directory is correct.
Thanks, @vahidrezanezhad . I checked the image path carefully again (fearful I had gotten it wrong and missed it over and over again), but it is correct. I might as well make sure -- are .jpg files okay? Any other restrictions on input images?
Of course jpg files are ok (all kind of images are valid). Did you pull the latest eynollah? This error should be because of getting dpi on windows. Check please with the latest version and give me a feedback .
...working on a new install in a new conda environment now...
Just consider that this is a temporary solution and it will disturb performance of the code.
Understood. :) When I run a command (even eynollah --help
), it just gives me a new command line after 4-5 seconds (no output). EDIT: Nevermind! Sorry. I haven't done make models
yet.
2ND EDIT: I spoke too soon (twice now). Models are in; doesn't make a difference (and I believe models shouldn't make a difference for eynollah --help
anyways, I now realize). There is still no output -- just a new command line.
(tl;dr the new version isn't working for me)
The DPI check is checked using the identify
CLI from image magick. It seems unnecessary to do that and could be done with Pillow or opencv. But https://github.com/qurator-spk/eynollah/commit/37431d4840b0486a001789b14856e211d36ff1ab should have given a workaround. I don't understand why eynollah --help
stopped working for you. What did you change?
I started from scratch with a new Anaconda environment. One difference this time was that instead of having to go back and install python 3.6
to replace 3.9, I did this from the outset:
conda create --name env e2 python=3.6
Then activate e2
git clone https://github.com/qurator-spk/eynollah.git
cd eynollah
(I can't recall if I had to conda install pip
at this point or not; but I think I did)
pip install .
At this point, it didn't work. (No output; fresh command line in 4-5 seconds. This includes running eynollah --help
.)
I used MSYS64/Mingw64 to cd into eynollah folder and run make models
. Same results.
I am glad to try it again. But before I do, I'll see if you see any red flags regarding what I did above.
Thanks!
No, that setup looks reasonable. Can you check out https://github.com/qurator-spk/eynollah/tree/refactor-cntd and install that? Among other things, this adds an overrideable log level switch --log-level
.
Then try running eynollah on some image with --log-level DEBUG
and post the output here.
Feel free to send me a DM in gitter to debug this further.
@kba , thanks for responding. Writing out my previous post, I thought it well worth going ahead and setting things up exactly as I had before, without the "shortcut" of conda create --name env e2 python=3.6
. It works now!
I will lay out my steps for installing (at least as of 2/17/2021) that evidently work for me, in case it helps anyone else:
(Windows 10 os, Anaconda environment)
(For my example, I will call my conda environment "my_env".)
conda create --name my_env
conda activate my_env
git clone https://github.com/qurator-spk/eynollah.git
cd eynollah
conda install pip
pip install .
conda install python=3.6
conda install shapely
This gets you to the point of having eynollah running... eynollah --help
should work.
(Nothing has changed for me to add the models to run eynollah. From the outset, I needed a way to run a make file in Windows 10. For me, I set up msys64 and used the mingw64 terminal to: 1. cd to the eynollah repository folder, and 2. run make models
. I am not very versed at all in this non-Python side of things, or else I would say more; I just did lots of googling, plodded through, and eventually it worked.)
:tada: good to hear it's working for you now and thanks for documenting the steps you needed. Good to close an issue with an actual solution.
About make models
, you don't need to go through make just for that, all that target does is
wget 'https://qurator-data.de/eynollah/models_eynollah.tar.gz'
tar xf models_eynollah.tar.gz
i.e. download the tarball and extract it, nothing fancy.
Once the OCR-D bindings are in place and https://github.com/OCR-D/core/pull/668 is merged, you will be able to download the models with ocrd resmgr download ocrd-eynollah '*'
.
"About make models, you don't need to go through make just for that" Lol. Shows you what I know! (But now I know that it's worth looking at the file -- even in a "foreign language" to me -- before downloading heavier-lifting stuff.)
Great to hear about the anticipated bindings!
Lastly -- just fyi (and the issue can definitely stay closed):
The system cannot find the path specified.
'identify' is not recognized as an internal or external command,
operable program or batch file.
(In case it's helpful, my most recent run from terminal -- command and full output -- can be found here. You'll see this message come up twice.)
"About make models, you don't need to go through make just for that" Lol. Shows you what I know! (But now I know that it's worth looking at the file -- even in a "foreign language" to me -- before downloading heavier-lifting stuff.)
Always a good idea. Be bold :)
- when running, I still get
The system cannot find the path specified. 'identify' is not recognized as an internal or external command, operable program or batch file.
(In case it's helpful, my most recent run from terminal -- command and full output -- can be found here. You'll see this message come up twice.)
I've removed the identify
call with OcrdExif in https://github.com/qurator-spk/eynollah/commit/8c603ae16d1074ec247c9956134cfc4f2b75481f so no need to have imagemagick installed anymore once that's been merged.
Just a quick placeholder note to be supplemented with a recipe detail to follow. I tried to do @SB2020-eye's recipe for getting eynollah running under Windows 10. I got close but no cigar. I incrementally addressed issues and eventually got this incredible library running on a page image from my project's interest in ground-truthing the 48 issues of Softalk magazine.
To update this issue, I wanted to ensure that I have a working/repeatable Windows installation process. With this goal of providing a repeatable install recipe, I also wanted to use @kba's latest refactor-cntd branch of the eynollah module to overcome the non-critical identify error.
The good news is that I do have a process with only one non-critical hiccup related to the 'si' cli parameter for saving extracted images from the target page. I'll detail the working recipe soon, but ITMT here's a quick question:
Q bkgnd: My current conda environment does not have tesseract OCR installed. I know I can easily get this into my current conda eynollah environment. The cli I have run is -
eynollah -i <path-to-target-img>\softalkv1n01sep1980_0007.jpg -o <path-to-target-img> -m <path-to-models_eynollah> -si <path-to-target-img>\imgs\
This runs and processes a PAGExml result (without the extracted saved images). The PAGExml file has the detailed coordinates for the text and image regions of the page, however as expected there is no OCR extracted text data in the TextLine/TextEquiv/Unicode elements that complement the Coords of those regions.
Here is a screenshot of the PAGxml result via PRImA's PAGE Viewer:
Q: With tesseract installed (which version BTW, or other recommended OCR engine), what cli statement will generate this current result with those TextLine/TextEquiv/Unicode elements' data included in the PAGExml file?
Thanks @kba, @bertsky, @SB2020-eye for your ongoing interest and assistance. :-)
one non-critical hiccup related to the 'si' cli parameter
There is a new switch --enable-plotting/-ep
that must be set for all the intermediary images to be written out. This needs some more work and then an update to the README.
what cli statement will generate this current result with those TextLine/TextEquiv/Unicode elements' data included in the PAGExml file?
Once the OCR-D interface (ocrd-eynollah
) is in place, you can use any of the OCR engines we wrapped to do the actual text recognition, though I am not sure how well they work on Windows. I don't think that any OCR engines support segmentation input in PAGE-XML natively at the moment.
If you don't want to wait for the OCR-D interface, you can take the output of eynollah and add it to an OCR-D workspace and then run any of the above-mentioned engines on it:
eynollah ... -o . -i image1.png # will write to image1.xml
ocrd workspace init
ocrd workspace add -G IMG -i IMG_1 -g page1 image1.png
ocrd workspace add -G SEG -i SEG_1 -g page1 image1.xml
ocrd-tesserocr-recognize -P segmentation_level none -P textequiv_level line
Great, thanks @kba, I'll try your recommendation. I'll also post the working Widows install recipe soon. Although I am super impressed with Calamari via my exposure to it through DATeCH, I'll stick to Tesseract ATM due to easier and previous successful use natively on Windows. BTW, I assume that the 5.0 alpha build is OCR-D compatible?
Hi. I am trying to get this running on Windows 10 using Visual Studio Code.
If cd into the repo and run a command like:
eynollah -i C:/Users/Scott/Desktop/Python2/Kpages/Pages/076v.jpg -o C:/Users/Scott/Desktop/Python2/Kpages -m C:/Users/Scott/Desktop/Python2/eynollah/models_eynollah -si C:/Users/Scott/Desktop/Python2/Kpages
it doesn't appear to run. A new command prompt comes up after a couple of seconds -- but no output and no error message.Any guidance would be appreciated.