qurator-spk / eynollah

Document Layout Analysis
Apache License 2.0
348 stars 29 forks source link

trying to get running... #17

Closed SB2020-eye closed 3 years ago

SB2020-eye commented 3 years ago

Hi. I am trying to get this running on Windows 10 using Visual Studio Code.

If cd into the repo and run a command like: eynollah -i C:/Users/Scott/Desktop/Python2/Kpages/Pages/076v.jpg -o C:/Users/Scott/Desktop/Python2/Kpages -m C:/Users/Scott/Desktop/Python2/eynollah/models_eynollah -si C:/Users/Scott/Desktop/Python2/Kpages it doesn't appear to run. A new command prompt comes up after a couple of seconds -- but no output and no error message.

Any guidance would be appreciated.

cneud commented 3 years ago

Hi, I'm sorry but we are currently underway with major refactoring, and unintentionally seem to have broken main doing so. I hope we can conclude the overhaul within the next couple of weeks. Soon after that, this tool will also be included in our ocrd-galley, which would allow usage via "stable" Docker images.

I believe https://github.com/qurator-spk/eynollah/tree/778a4197a5ee99e8bbcfc86e8ae75cec96a3435e was still working for me, but unfortunately no experience trying any of this on Windows.

kba commented 3 years ago

main should be working after #12. Smoke test: eynollah --help work? Can you share the image you're trying this on?

SB2020-eye commented 3 years ago

Thank you, @cneud and @kba.

@cneud , are you suggesting I download the version found at the link you gave?

@kba , I think you're asking me to run eynollah --help from terminal after cd-ing into the repo root folder. If so, the same behavior occurs: no output, and after 3 or so seconds, a new command line appears ready to go.

Just in case, I should probably ask something crucial towards my goal with eynollah, to make sure I don't waste everyone's time. I am assuming the -si argument results in image files for all the different segmented sections of the original image. Is that correct? And if so, are they lossless (ie, is anything lost in the process)?

Here's an example image. page181r-downsized

cneud commented 3 years ago

@cneud , are you suggesting I download the version found at the link you gave?

@kba has kindly applied a fix to main, so (theoretically) the main branch should now build again.

I am assuming the -si argument results in image files for all the different segmented sections of the original image. Is that correct?

No I believe the -si flag only extracts regions with image content, i.e. illustrations, pictures, fotos or similar that were identified by the layout analysis as "graphical elements".

You can however cut out any regions from the image after layout analysis based on their pixel coordinates in the PAGE-XML output, which will give you the segment images in the same resolution as the source image.

kba commented 3 years ago

@cneud , are you suggesting I download the version found at the link you gave?

The main branch works for me, so no, just make sure you are at the latest commit in main.

@kba , I think you're asking me to run eynollah --help from terminal after cd-ing into the repo root folder. If so, the same behavior occurs: no output, and after 3 or so seconds, a new command line appears ready to go.

From the other issue, I infer you're using conda. If the conda env is active, you do not need to be in the repo folder. Are you sure, you have installed eynollah including its dependencies, i.e. conda activate yourenv; pip install . or can you try with a fresh environment to make sure this is not the issue?

I am assuming the -si argument results in image files for all the different segmented sections of the original image. Is that correct? And if so, are they lossless (ie, is anything lost in the process)?

Yes, with the -si option, cropped images of all the contours found by eynollah are written to that directory. GIGO, so this should not reduce image quality IIUC.

However, I see this as merely a debug function (@vahidrezanezhad correct me if I'm wrong), the important result is the PAGE-XML. From that (or any other) PAGE-XML you can use ocrd_segment, specifically ocrd-segment-extract-regions and *-lines to extract the cropped images afterwards. Even better would be, if you use this within a python project, to use the polygons in the PAGE-XML directly, so you don't lose that information in serialization which must be a bounding box.

Here's an example image

And here's eynollah would segment that page:

image

And without the image for clearer visuals:

image

The ruler confused the detection so the reading order is shoddy, should have cropped the printspace more vertically. But the regions and esp. lines (which are essential for OCR) are tight and accurate AFAICS.

kba commented 3 years ago

Yes, with the -si option, cropped images of all the contours found by eynollah are written to that directory. GIGO, so this should not reduce image quality IIUC.

I was wrong, @cneud hat it right:

No I believe the -si flag only extracts regions with image content, i.e. illustrations, pictures, fotos or similar that were identified by the layout analysis as "graphical elements".

SB2020-eye commented 3 years ago

Regarding -si, does this mean that I would need to work with PAGE-XML in order to get the cut-out images of text lines? I have some doubts about my abilities in that realm (never worked with XML, never heard of XSLT, can't even locate the dependencies needed for that repo, etc). Lol.

I actually don't need OCR per se -- just images of text lines (or, even better, words, if possible). This is toward a subsequent goal of cutting out images of just glyphs (with no background). eynollah is obviously constructed for purposes more sophisticated than just what I'm describing.

I actually already have something slicing out images of text lines for me -- docExtractor. But having found your sbb_binarization and getting such positive results, I came to eynollah since sbb_binarization doesn't seem to run in python 3.8.6, which the rest of my program (including docExtractor) is currently running in. And I just don't know how to get them to "talk" to each other. So I figured maybe I could replace docExtractor with eynollah and have everything run in python 3.7.0 environment. (Yes, @kba , it is indeed a conda environment.)

If this sounds like I'm making things overly complicated, I probably am! And I'd appreciate you saying so (plus any suggestions you might have). Or if eynollah seems to you like it's a rabbit trail for my particular purposes, please don't hesistate to say so. You are obviously doing good work here!

vahidrezanezhad commented 3 years ago

Regarding -si, does this mean that I would need to work with PAGE-XML in order to get the cut-out images of text lines? I have some doubts about my abilities in that realm (never worked with XML, never heard of XSLT, can't even locate the dependencies needed for that repo, etc). Lol.

I actually don't need OCR per se -- just images of text lines (or, even better, words, if possible). This is toward a subsequent goal of cutting out images of just glyphs (with no background). eynollah is obviously constructed for purposes more sophisticated than just what I'm describing.

I actually already have something slicing out images of text lines for me -- docExtractor. But having found your sbb_binarization and getting such positive results, I came to eynollah since sbb_binarization doesn't seem to run in python 3.8.6, which the rest of my program (including docExtractor) is currently running in. And I just don't know how to get them to "talk" to each other. So I figured maybe I could replace docExtractor with eynollah and have everything run in python 3.7.0 environment. (Yes, @kba , it is indeed a conda environment.)

If this sounds like I'm making things overly complicated, I probably am! And I'd appreciate you saying so (plus any suggestions you might have). Or if eynollah seems to you like it's a rabbit trail for my particular purposes, please don't hesistate to say so. You are obviously doing good work here!

Hi there, -si option gives you this capability to crop and save images inside the document . This can be done using output xml data but to make it easier we have provided this option too (to crop and save them while you run eynollah).

vahidrezanezhad commented 3 years ago

I am assuming the -si argument results in image files for all the different segmented sections of the original image. Is that correct?

No I believe the -si flag only extracts regions with image content, i.e. illustrations, pictures, fotos or similar that were identified by the layout analysis as "graphical elements".

You can however cut out any regions from the image after layout analysis based on their pixel coordinate

Correct. Thank you

kba commented 3 years ago

If this sounds like I'm making things overly complicated, I probably am!

IIUC you want to create some sort of glyph repository, so you're not interested in the text detection but in getting lines and glyphs from the lines in a bitonal format.

You want to preprocess your page to crop it to the print space (which should get rid of opposing pages, rulers etc.), deskew/dewarp it (if lines aren't perfectly orthogonal to image or have water damage or have a deep joint) and then segment the page into lines. We have a multiple tools for that in OCR-D, see https://ocr-d.de/en/workflows. Then you can use an OCR engine like tesseract or calamari to do the recognition down to glyph level and just disregard the actual detected text and just use the bounding boxes of the glyphs to cut them out of the original image.

Yes, this would involve working with PAGE-XML. We do have a pythonic API for that in OCR-D/core though that can make this a bit easier, at the end of the day it's a hierarchical data structure like any other: Page -> TextLine -> Word -> Glyph -> Coords -> points.

But I suggest you drop by our chat at https://gitter.im/OCR-D/Lobby, say hi and describe your use case, it's easier to discuss there than in an issue.

kba commented 3 years ago

-si option gives you this capability to crop and save images inside the document

@vahidrezanezhad just to make sure: with "save images" you mean "save graphic regions", correct?

vahidrezanezhad commented 3 years ago

-si option gives you this capability to crop and save images inside the document

@vahidrezanezhad just to make sure: with "save images" you mean "save graphic regions", correct?

Yes :)

vahidrezanezhad commented 3 years ago

The ruler confused the detection so the reading order is shoddy, should have cropped the printspace more vertically. But the regions and esp. lines (which are essential for OCR) are tight and accurate AFAICS.

As you mentioned, the reason for a bad reading order is the page detector (this simply happens since in GT we did not have such documents). But this is a general problem for reading order detection that can occur for documents with multi-columns and footnotes even though you have extracted printspace correctly. main-qimg-b6abcab9f04b2c29fe571802d348a973

and have a look at reading order

footnotes_disorder

you see reading order still is not correct :)

cneud commented 3 years ago

main should be working after #12.

I still had to the following to get main working:

With these changes, I can successfully run the tool (on Ubuntu, not Windows though).

SB2020-eye commented 3 years ago

-si option gives you this capability to crop and save images inside the document

@vahidrezanezhad just to make sure: with "save images" you mean "save graphic regions", correct?

Yes :)

And does that mean "save graphic regions...as image files", or something else? Thanks.

SB2020-eye commented 3 years ago

If this sounds like I'm making things overly complicated, I probably am!

IIUC you want to create some sort of glyph repository, so you're not interested in the text detection but in getting lines and glyphs from the lines in a bitonal format.

You want to preprocess your page to crop it to the print space (which should get rid of opposing pages, rulers etc.), deskew/dewarp it (if lines aren't perfectly orthogonal to image or have water damage or have a deep joint) and then segment the page into lines. We have a multiple tools for that in OCR-D, see https://ocr-d.de/en/workflows. Then you can use an OCR engine like tesseract or calamari to do the recognition down to glyph level and just disregard the actual detected text and just use the bounding boxes of the glyphs to cut them out of the original image.

Yes, this would involve working with PAGE-XML. We do have a pythonic API for that in OCR-D/core though that can make this a bit easier, at the end of the day it's a hierarchical data structure like any other: Page -> TextLine -> Word -> Glyph -> Coords -> points.

But I suggest you drop by our chat at https://gitter.im/OCR-D/Lobby, say hi and describe your use case, it's easier to discuss there than in an issue.

Thanks. I just posted something.

kba commented 3 years ago

install tqdm and seaborn via pip

I wonder why you need those. Are you sure you're up-to-date? These have been removed in 9596a44 and 801ccac resp.

downgrade keras pip install keras==2.3.1

Oh, yes, that's fixed in the refactoring but should be in main too, ef1e32e

And does that mean "save graphic regions...as image files", or something else? Thanks.

Yes, the graphic regions are saved as JPEG image files.

cneud commented 3 years ago

Yes this was on a clean clone of https://github.com/qurator-spk/eynollah/commit/c7d509bb2cfe12703e3321b393f603a6a9f900b5 - I still had to install both packages manually or eynollah would not run.

I also could not get any images extracted using -si. Does this only work in combination with -fl=true? @vahidrezanezhad

Also I am getting OOM exception due to Tensor shape... every time I try to run eynollah with the -fl=true parameter on my Geforce RTX2070S with 8 GB :(

vahidrezanezhad commented 3 years ago

Yes this was on a clean clone of c7d509b - I still had to install both packages manually or eynollah would not run.

I also could not get any images extracted using -si. Does this only work in combination with -fl=true? @vahidrezanezhad

Also I am getting OOM exception due to Tensor shape... every time I try to run eynollah with the -fl=true parameter on my Geforce RTX2070S with 8 GB :(

No. -si has nothing to do with -fl option. By -si a directory should be given.

cneud commented 3 years ago

Hmm, when I tried using e.g. eynollah -i 00000015.tif -o . -si . I did not get any images extracted to that directory? I was using this image https://content.staatsbibliothek-berlin.de/dms/PPN626696453/1200/0/00000015.tif?original=true.

kba commented 3 years ago

Hmm, when I tried using e.g. eynollah -i 00000015.tif -o . -si . I did not get any images extracted to that directory? I was using this image https://content.staatsbibliothek-berlin.de/dms/PPN626696453/1200/0/00000015.tif?original=true.

That might well be a regression on my part, investigating.

seaborn and tqdm

I am still confused about this. Can you try pip uninstall tqdm seaborn and provide the stacktrace this causes please?

pipdeptree shows this dependency tree for me:

pipdeptree -p eynollah
eynollah==0.0.1
  - imutils [required: >=0.5.3, installed: 0.5.3]
  - keras [required: >=2.3.1, installed: 2.3.1]
    - h5py [required: Any, installed: 2.10.0]
      - numpy [required: >=1.7, installed: 1.18.5]
      - six [required: Any, installed: 1.15.0]
    - keras-applications [required: >=1.0.6, installed: 1.0.8]
      - h5py [required: Any, installed: 2.10.0]
        - numpy [required: >=1.7, installed: 1.18.5]
        - six [required: Any, installed: 1.15.0]
      - numpy [required: >=1.9.1, installed: 1.18.5]
    - keras-preprocessing [required: >=1.0.5, installed: 1.1.0]
      - numpy [required: >=1.9.1, installed: 1.18.5]
      - six [required: >=1.9.0, installed: 1.15.0]
    - numpy [required: >=1.9.1, installed: 1.18.5]
    - pyyaml [required: Any, installed: 5.3.1]
    - scipy [required: >=0.14, installed: 1.4.1]
      - numpy [required: >=1.13.3, installed: 1.18.5]
    - six [required: >=1.9.0, installed: 1.15.0]
  - matplotlib [required: Any, installed: 3.3.1]
    - certifi [required: >=2020.06.20, installed: 2020.6.20]
    - cycler [required: >=0.10, installed: 0.10.0]
      - six [required: Any, installed: 1.15.0]
    - kiwisolver [required: >=1.0.1, installed: 1.2.0]
    - numpy [required: >=1.15, installed: 1.18.5]
    - pillow [required: >=6.2.0, installed: 7.2.0]
    - pyparsing [required: >=2.0.3,!=2.1.6,!=2.1.2,!=2.0.4, installed: 2.4.7]
    - python-dateutil [required: >=2.1, installed: 2.8.1]
      - six [required: >=1.5, installed: 1.15.0]
  - ocrd [required: >=2.20.1, installed: 2.22.3]
    - bagit [required: >=1.7.0, installed: 1.7.0]
    - bagit-profile [required: >=1.3.0, installed: 1.3.1]
      - bagit [required: Any, installed: 1.7.0]
      - requests [required: Any, installed: 2.24.0]
        - certifi [required: >=2017.4.17, installed: 2020.6.20]
        - chardet [required: >=3.0.2,<4, installed: 3.0.4]
        - idna [required: >=2.5,<3, installed: 2.10]
        - urllib3 [required: >=1.21.1,<1.26,!=1.25.1,!=1.25.0, installed: 1.25.10]
    - click [required: >=7, installed: 7.1.2]
    - Deprecated [required: ==1.2.0, installed: 1.2.0]
      - wrapt [required: >=1,<2, installed: 1.12.1]
    - Flask [required: Any, installed: 1.1.2]
      - click [required: >=5.1, installed: 7.1.2]
      - itsdangerous [required: >=0.24, installed: 1.1.0]
      - Jinja2 [required: >=2.10.1, installed: 2.11.2]
        - MarkupSafe [required: >=0.23, installed: 1.1.1]
      - Werkzeug [required: >=0.15, installed: 1.0.1]
    - jsonschema [required: Any, installed: 3.2.0]
      - attrs [required: >=17.4.0, installed: 20.2.0]
      - importlib-metadata [required: Any, installed: 2.0.0]
        - zipp [required: >=0.5, installed: 3.2.0]
      - pyrsistent [required: >=0.14.0, installed: 0.17.3]
      - setuptools [required: Any, installed: 50.3.0]
      - six [required: >=1.11.0, installed: 1.15.0]
    - lxml [required: Any, installed: 4.5.2]
    - ocrd-modelfactory [required: ==2.22.3, installed: 2.22.3]
      - lxml [required: Any, installed: 4.5.2]
      - ocrd-models [required: ==2.22.3, installed: 2.22.3]
        - lxml [required: Any, installed: 4.5.2]
        - ocrd-utils [required: ==2.22.3, installed: 2.22.3]
          - atomicwrites [required: >=1.3.0, installed: 1.4.0]
          - numpy [required: Any, installed: 1.18.5]
          - Pillow [required: >=7.2.0, installed: 7.2.0]
      - ocrd-utils [required: ==2.22.3, installed: 2.22.3]
        - atomicwrites [required: >=1.3.0, installed: 1.4.0]
        - numpy [required: Any, installed: 1.18.5]
        - Pillow [required: >=7.2.0, installed: 7.2.0]
    - ocrd-models [required: ==2.22.3, installed: 2.22.3]
      - lxml [required: Any, installed: 4.5.2]
      - ocrd-utils [required: ==2.22.3, installed: 2.22.3]
        - atomicwrites [required: >=1.3.0, installed: 1.4.0]
        - numpy [required: Any, installed: 1.18.5]
        - Pillow [required: >=7.2.0, installed: 7.2.0]
    - ocrd-utils [required: ==2.22.3, installed: 2.22.3]
      - atomicwrites [required: >=1.3.0, installed: 1.4.0]
      - numpy [required: Any, installed: 1.18.5]
      - Pillow [required: >=7.2.0, installed: 7.2.0]
    - ocrd-validators [required: ==2.22.3, installed: 2.22.3]
      - bagit [required: >=1.7.0, installed: 1.7.0]
      - bagit-profile [required: >=1.3.0, installed: 1.3.1]
        - bagit [required: Any, installed: 1.7.0]
        - requests [required: Any, installed: 2.24.0]
          - certifi [required: >=2017.4.17, installed: 2020.6.20]
          - chardet [required: >=3.0.2,<4, installed: 3.0.4]
          - idna [required: >=2.5,<3, installed: 2.10]
          - urllib3 [required: >=1.21.1,<1.26,!=1.25.1,!=1.25.0, installed: 1.25.10]
      - click [required: >=7, installed: 7.1.2]
      - jsonschema [required: Any, installed: 3.2.0]
        - attrs [required: >=17.4.0, installed: 20.2.0]
        - importlib-metadata [required: Any, installed: 2.0.0]
          - zipp [required: >=0.5, installed: 3.2.0]
        - pyrsistent [required: >=0.14.0, installed: 0.17.3]
        - setuptools [required: Any, installed: 50.3.0]
        - six [required: >=1.11.0, installed: 1.15.0]
      - ocrd-modelfactory [required: ==2.22.3, installed: 2.22.3]
        - lxml [required: Any, installed: 4.5.2]
        - ocrd-models [required: ==2.22.3, installed: 2.22.3]
          - lxml [required: Any, installed: 4.5.2]
          - ocrd-utils [required: ==2.22.3, installed: 2.22.3]
            - atomicwrites [required: >=1.3.0, installed: 1.4.0]
            - numpy [required: Any, installed: 1.18.5]
            - Pillow [required: >=7.2.0, installed: 7.2.0]
        - ocrd-utils [required: ==2.22.3, installed: 2.22.3]
          - atomicwrites [required: >=1.3.0, installed: 1.4.0]
          - numpy [required: Any, installed: 1.18.5]
          - Pillow [required: >=7.2.0, installed: 7.2.0]
      - ocrd-models [required: ==2.22.3, installed: 2.22.3]
        - lxml [required: Any, installed: 4.5.2]
        - ocrd-utils [required: ==2.22.3, installed: 2.22.3]
          - atomicwrites [required: >=1.3.0, installed: 1.4.0]
          - numpy [required: Any, installed: 1.18.5]
          - Pillow [required: >=7.2.0, installed: 7.2.0]
      - ocrd-utils [required: ==2.22.3, installed: 2.22.3]
        - atomicwrites [required: >=1.3.0, installed: 1.4.0]
        - numpy [required: Any, installed: 1.18.5]
        - Pillow [required: >=7.2.0, installed: 7.2.0]
      - pyyaml [required: Any, installed: 5.3.1]
      - shapely [required: Any, installed: 1.7.1]
    - opencv-python-headless [required: Any, installed: 4.4.0.44]
      - numpy [required: >=1.13.3, installed: 1.18.5]
    - pyyaml [required: Any, installed: 5.3.1]
    - requests [required: Any, installed: 2.24.0]
      - certifi [required: >=2017.4.17, installed: 2020.6.20]
      - chardet [required: >=3.0.2,<4, installed: 3.0.4]
      - idna [required: >=2.5,<3, installed: 2.10]
      - urllib3 [required: >=1.21.1,<1.26,!=1.25.1,!=1.25.0, installed: 1.25.10]
  - scikit-learn [required: >=0.23.2, installed: 0.23.2]
    - joblib [required: >=0.11, installed: 0.17.0]
    - numpy [required: >=1.13.3, installed: 1.18.5]
    - scipy [required: >=0.19.1, installed: 1.4.1]
      - numpy [required: >=1.13.3, installed: 1.18.5]
    - threadpoolctl [required: >=2.0.0, installed: 2.1.0]
  - tensorflow-gpu [required: >=1.15,<2, installed: 1.15.3]
    - absl-py [required: >=0.7.0, installed: 0.10.0]
      - six [required: Any, installed: 1.15.0]
    - astor [required: >=0.6.0, installed: 0.8.1]
    - gast [required: ==0.2.2, installed: 0.2.2]
    - google-pasta [required: >=0.1.6, installed: 0.2.0]
      - six [required: Any, installed: 1.15.0]
    - grpcio [required: >=1.8.6, installed: 1.31.0]
      - six [required: >=1.5.2, installed: 1.15.0]
    - keras-applications [required: >=1.0.8, installed: 1.0.8]
      - h5py [required: Any, installed: 2.10.0]
        - numpy [required: >=1.7, installed: 1.18.5]
        - six [required: Any, installed: 1.15.0]
      - numpy [required: >=1.9.1, installed: 1.18.5]
    - keras-preprocessing [required: >=1.0.5, installed: 1.1.0]
      - numpy [required: >=1.9.1, installed: 1.18.5]
      - six [required: >=1.9.0, installed: 1.15.0]
    - numpy [required: >=1.16.0,<2.0, installed: 1.18.5]
    - opt-einsum [required: >=2.3.2, installed: 3.3.0]
      - numpy [required: >=1.7, installed: 1.18.5]
    - protobuf [required: >=3.6.1, installed: 3.13.0]
      - setuptools [required: Any, installed: 50.3.0]
      - six [required: >=1.9, installed: 1.15.0]
    - six [required: >=1.10.0, installed: 1.15.0]
    - tensorboard [required: >=1.15.0,<1.16.0, installed: 1.15.0]
      - absl-py [required: >=0.4, installed: 0.10.0]
        - six [required: Any, installed: 1.15.0]
      - grpcio [required: >=1.6.3, installed: 1.31.0]
        - six [required: >=1.5.2, installed: 1.15.0]
      - markdown [required: >=2.6.8, installed: 3.2.2]
        - importlib-metadata [required: Any, installed: 2.0.0]
          - zipp [required: >=0.5, installed: 3.2.0]
      - numpy [required: >=1.12.0, installed: 1.18.5]
      - protobuf [required: >=3.6.0, installed: 3.13.0]
        - setuptools [required: Any, installed: 50.3.0]
        - six [required: >=1.9, installed: 1.15.0]
      - setuptools [required: >=41.0.0, installed: 50.3.0]
      - six [required: >=1.10.0, installed: 1.15.0]
      - werkzeug [required: >=0.11.15, installed: 1.0.1]
      - wheel [required: >=0.26, installed: 0.36.2]
    - tensorflow-estimator [required: ==1.15.1, installed: 1.15.1]
    - termcolor [required: >=1.1.0, installed: 1.1.0]
    - wheel [required: >=0.26, installed: 0.36.2]
    - wrapt [required: >=1.11.1, installed: 1.12.1]
cneud commented 3 years ago

So I do the following:

  1. create a fresh venv and activate it
  2. update pip
  3. git clone https://github.com/qurator-spk/eynollah
  4. pip install .

Now when I try to run eynollah it will complain about missing seaborn

eynollah -i PPN798786388_00000005.tif -o . -m ~/tmp/dev/qurator/models/eynollah                                     ✔  35s   venv-qurator   12:43:50  
Traceback (most recent call last):
  File "/usr/local/bin/eynollah", line 11, in <module>
    import seaborn as sns
ModuleNotFoundError: No module named 'seaborn'

So install seaborn with pip and run again:

eynollah -i PPN798786388_00000005.tif -o . -m ~/tmp/dev/qurator/models/eynollah                                   ✔  4s   venv-qurator   12:47:04  
Traceback (most recent call last):
  File "/usr/local/bin/eynollah", line 14, in <module>
    from tqdm import tqdm
ModuleNotFoundError: No module named 'tqdm'

After installation of tqdm, it runs fine.

pip uninstall tqdm seaborn will give me

pip3 uninstall tqdm seaborn                                                                                              ✔  venv-qurator   12:47:43  
Found existing installation: tqdm 4.56.0
Uninstalling tqdm-4.56.0:
  Would remove:
    /home/cnd/tmp/dev/qurator/tools/venv-qurator/bin/tqdm
    /home/cnd/tmp/dev/qurator/tools/venv-qurator/lib/python3.6/site-packages/tqdm-4.56.0.dist-info/*
    /home/cnd/tmp/dev/qurator/tools/venv-qurator/lib/python3.6/site-packages/tqdm/*
Proceed (y/n)? n
Found existing installation: seaborn 0.11.1
Uninstalling seaborn-0.11.1:
  Would remove:
    /home/cnd/tmp/dev/qurator/tools/venv-qurator/lib/python3.6/site-packages/seaborn-0.11.1.dist-info/*
    /home/cnd/tmp/dev/qurator/tools/venv-qurator/lib/python3.6/site-packages/seaborn/*
Proceed (y/n)? 

Output of pipdeptree -p eynollah:

``` eynollah==0.0.1 - imutils [required: >=0.5.3, installed: 0.5.4] - keras [required: >=2.3.1, installed: 2.4.3] - h5py [required: Any, installed: 2.10.0] - numpy [required: >=1.7, installed: 1.18.5] - six [required: Any, installed: 1.15.0] - numpy [required: >=1.9.1, installed: 1.18.5] - pyyaml [required: Any, installed: 5.4.1] - scipy [required: >=0.14, installed: 1.5.4] - numpy [required: >=1.14.5, installed: 1.18.5] - matplotlib [required: Any, installed: 3.3.4] - cycler [required: >=0.10, installed: 0.10.0] - six [required: Any, installed: 1.15.0] - kiwisolver [required: >=1.0.1, installed: 1.3.1] - numpy [required: >=1.15, installed: 1.18.5] - pillow [required: >=6.2.0, installed: 8.1.0] - pyparsing [required: >=2.0.3,!=2.1.6,!=2.1.2,!=2.0.4, installed: 2.4.7] - python-dateutil [required: >=2.1, installed: 2.8.1] - six [required: >=1.5, installed: 1.15.0] - ocrd [required: >=2.20.1, installed: 2.22.3] - bagit [required: >=1.7.0, installed: 1.8.0] - bagit-profile [required: >=1.3.0, installed: 1.3.1] - bagit [required: Any, installed: 1.8.0] - requests [required: Any, installed: 2.25.1] - certifi [required: >=2017.4.17, installed: 2020.12.5] - chardet [required: >=3.0.2,<5, installed: 4.0.0] - idna [required: >=2.5,<3, installed: 2.10] - urllib3 [required: >=1.21.1,<1.27, installed: 1.26.3] - click [required: >=7, installed: 7.1.2] - Deprecated [required: ==1.2.0, installed: 1.2.0] - wrapt [required: >=1,<2, installed: 1.12.1] - Flask [required: Any, installed: 1.1.2] - click [required: >=5.1, installed: 7.1.2] - itsdangerous [required: >=0.24, installed: 1.1.0] - Jinja2 [required: >=2.10.1, installed: 2.11.3] - MarkupSafe [required: >=0.23, installed: 1.1.1] - Werkzeug [required: >=0.15, installed: 1.0.1] - jsonschema [required: Any, installed: 3.2.0] - attrs [required: >=17.4.0, installed: 20.3.0] - importlib-metadata [required: Any, installed: 3.4.0] - typing-extensions [required: >=3.6.4, installed: 3.7.4.3] - zipp [required: >=0.5, installed: 3.4.0] - pyrsistent [required: >=0.14.0, installed: 0.17.3] - setuptools [required: Any, installed: 53.0.0] - six [required: >=1.11.0, installed: 1.15.0] - lxml [required: Any, installed: 4.6.2] - ocrd-modelfactory [required: ==2.22.3, installed: 2.22.3] - lxml [required: Any, installed: 4.6.2] - ocrd-models [required: ==2.22.3, installed: 2.22.3] - lxml [required: Any, installed: 4.6.2] - ocrd-utils [required: ==2.22.3, installed: 2.22.3] - atomicwrites [required: >=1.3.0, installed: 1.4.0] - numpy [required: Any, installed: 1.18.5] - Pillow [required: >=7.2.0, installed: 8.1.0] - ocrd-utils [required: ==2.22.3, installed: 2.22.3] - atomicwrites [required: >=1.3.0, installed: 1.4.0] - numpy [required: Any, installed: 1.18.5] - Pillow [required: >=7.2.0, installed: 8.1.0] - ocrd-models [required: ==2.22.3, installed: 2.22.3] - lxml [required: Any, installed: 4.6.2] - ocrd-utils [required: ==2.22.3, installed: 2.22.3] - atomicwrites [required: >=1.3.0, installed: 1.4.0] - numpy [required: Any, installed: 1.18.5] - Pillow [required: >=7.2.0, installed: 8.1.0] - ocrd-utils [required: ==2.22.3, installed: 2.22.3] - atomicwrites [required: >=1.3.0, installed: 1.4.0] - numpy [required: Any, installed: 1.18.5] - Pillow [required: >=7.2.0, installed: 8.1.0] - ocrd-validators [required: ==2.22.3, installed: 2.22.3] - bagit [required: >=1.7.0, installed: 1.8.0] - bagit-profile [required: >=1.3.0, installed: 1.3.1] - bagit [required: Any, installed: 1.8.0] - requests [required: Any, installed: 2.25.1] - certifi [required: >=2017.4.17, installed: 2020.12.5] - chardet [required: >=3.0.2,<5, installed: 4.0.0] - idna [required: >=2.5,<3, installed: 2.10] - urllib3 [required: >=1.21.1,<1.27, installed: 1.26.3] - click [required: >=7, installed: 7.1.2] - jsonschema [required: Any, installed: 3.2.0] - attrs [required: >=17.4.0, installed: 20.3.0] - importlib-metadata [required: Any, installed: 3.4.0] - typing-extensions [required: >=3.6.4, installed: 3.7.4.3] - zipp [required: >=0.5, installed: 3.4.0] - pyrsistent [required: >=0.14.0, installed: 0.17.3] - setuptools [required: Any, installed: 53.0.0] - six [required: >=1.11.0, installed: 1.15.0] - ocrd-modelfactory [required: ==2.22.3, installed: 2.22.3] - lxml [required: Any, installed: 4.6.2] - ocrd-models [required: ==2.22.3, installed: 2.22.3] - lxml [required: Any, installed: 4.6.2] - ocrd-utils [required: ==2.22.3, installed: 2.22.3] - atomicwrites [required: >=1.3.0, installed: 1.4.0] - numpy [required: Any, installed: 1.18.5] - Pillow [required: >=7.2.0, installed: 8.1.0] - ocrd-utils [required: ==2.22.3, installed: 2.22.3] - atomicwrites [required: >=1.3.0, installed: 1.4.0] - numpy [required: Any, installed: 1.18.5] - Pillow [required: >=7.2.0, installed: 8.1.0] - ocrd-models [required: ==2.22.3, installed: 2.22.3] - lxml [required: Any, installed: 4.6.2] - ocrd-utils [required: ==2.22.3, installed: 2.22.3] - atomicwrites [required: >=1.3.0, installed: 1.4.0] - numpy [required: Any, installed: 1.18.5] - Pillow [required: >=7.2.0, installed: 8.1.0] - ocrd-utils [required: ==2.22.3, installed: 2.22.3] - atomicwrites [required: >=1.3.0, installed: 1.4.0] - numpy [required: Any, installed: 1.18.5] - Pillow [required: >=7.2.0, installed: 8.1.0] - pyyaml [required: Any, installed: 5.4.1] - shapely [required: Any, installed: 1.7.1] - opencv-python-headless [required: Any, installed: 4.5.1.48] - numpy [required: >=1.13.3, installed: 1.18.5] - pyyaml [required: Any, installed: 5.4.1] - requests [required: Any, installed: 2.25.1] - certifi [required: >=2017.4.17, installed: 2020.12.5] - chardet [required: >=3.0.2,<5, installed: 4.0.0] - idna [required: >=2.5,<3, installed: 2.10] - urllib3 [required: >=1.21.1,<1.27, installed: 1.26.3] - scikit-learn [required: >=0.23.2, installed: 0.24.1] - joblib [required: >=0.11, installed: 1.0.0] - numpy [required: >=1.13.3, installed: 1.18.5] - scipy [required: >=0.19.1, installed: 1.5.4] - numpy [required: >=1.14.5, installed: 1.18.5] - threadpoolctl [required: >=2.0.0, installed: 2.1.0] - tensorflow-gpu [required: >=1.15,<2, installed: 1.15.5] - absl-py [required: >=0.7.0, installed: 0.11.0] - six [required: Any, installed: 1.15.0] - astor [required: >=0.6.0, installed: 0.8.1] - gast [required: ==0.2.2, installed: 0.2.2] - google-pasta [required: >=0.1.6, installed: 0.2.0] - six [required: Any, installed: 1.15.0] - grpcio [required: >=1.8.6, installed: 1.35.0] - six [required: >=1.5.2, installed: 1.15.0] - h5py [required: <=2.10.0, installed: 2.10.0] - numpy [required: >=1.7, installed: 1.18.5] - six [required: Any, installed: 1.15.0] - keras-applications [required: >=1.0.8, installed: 1.0.8] - h5py [required: Any, installed: 2.10.0] - numpy [required: >=1.7, installed: 1.18.5] - six [required: Any, installed: 1.15.0] - numpy [required: >=1.9.1, installed: 1.18.5] - keras-preprocessing [required: >=1.0.5, installed: 1.1.2] - numpy [required: >=1.9.1, installed: 1.18.5] - six [required: >=1.9.0, installed: 1.15.0] - numpy [required: >=1.16.0,<1.19.0, installed: 1.18.5] - opt-einsum [required: >=2.3.2, installed: 3.3.0] - numpy [required: >=1.7, installed: 1.18.5] - protobuf [required: >=3.6.1, installed: 3.14.0] - six [required: >=1.9, installed: 1.15.0] - six [required: >=1.10.0, installed: 1.15.0] - tensorboard [required: >=1.15.0,<1.16.0, installed: 1.15.0] - absl-py [required: >=0.4, installed: 0.11.0] - six [required: Any, installed: 1.15.0] - grpcio [required: >=1.6.3, installed: 1.35.0] - six [required: >=1.5.2, installed: 1.15.0] - markdown [required: >=2.6.8, installed: 3.3.3] - importlib-metadata [required: Any, installed: 3.4.0] - typing-extensions [required: >=3.6.4, installed: 3.7.4.3] - zipp [required: >=0.5, installed: 3.4.0] - numpy [required: >=1.12.0, installed: 1.18.5] - protobuf [required: >=3.6.0, installed: 3.14.0] - six [required: >=1.9, installed: 1.15.0] - setuptools [required: >=41.0.0, installed: 53.0.0] - six [required: >=1.10.0, installed: 1.15.0] - werkzeug [required: >=0.11.15, installed: 1.0.1] - wheel [required: >=0.26, installed: 0.36.2] - tensorflow-estimator [required: ==1.15.1, installed: 1.15.1] - termcolor [required: >=1.1.0, installed: 1.1.0] - wheel [required: >=0.26, installed: 0.36.2] - wrapt [required: >=1.11.1, installed: 1.12.1] ```
kba commented 3 years ago

File "/usr/local/bin/eynollah", line 11, in

Wait, your venv is not /usr/local, is it? Looks like you installed eynollah before without a virtualenv to /usr/local/bin/eynollah - can you move/remove that file? which eynollah should point to $VIRTUAL_ENV/bin/eynollah.

cneud commented 3 years ago

Argh, you are right!

I deactivated the venv and uninstalled eynollah.

Then I activated the venv again and installed again via pip, now which eynollah returns the correct path to the venv /home/cnd/tmp/dev/qurator/tools/venv-qurator/bin/eynollah but I am not getting any output anymore...(immediately exits with no message).

cneud commented 3 years ago

Apparently I had an older version installed to /usr/local/bin/eynollah - thanks to @kba amazing debugging skills we were able to track this down eventually and now https://github.com/qurator-spk/eynollah/pull/18 works for me (without any need to install seaborn or tqdm and with working -si parameter!).

Jim-Salmons commented 3 years ago

Here's an update on my recent experience installing eynollah natively on Windows 10:

If I have a chance today, I will try a clean install w/ tf 2+, keras 2.4.3, and a relaxed requirement for eynollah to accept this configuration. I suspect it not to work due to the major refactoring in tf 2+. If anyone has a better idea to suggest, please don't hesitate to advise me.

ITMT, I have updated my Windows dev box to the latest Docker using WSL2. I'm in the process of learning how to config PyCharm Pro to do remote/virtual debuggable coding from my Windows IDE working on a live Docker image. I want to get this going as it will let me work more easily with OCR-D and similar research projects while still having PyCharm under Windows which includes the Kite Pro coding assistance. Kite is super helpful due to my severely limited keyboarding abilities following a July spinal cord injury.

cneud commented 3 years ago

Hi @Jim-Salmons, thanks for sharing! I am in a bit of a hurry, but chances to get eynollah working will be much improved once we have completed the refactoring which should only take a few more weeks hopefully. Meanwhile, one must use a version of keras <2.4 (cf. https://github.com/qurator-spk/eynollah/pull/18) as newer versions will pull in Tensorflow 2 whereas the tool only works with Tensorflow version 1.15.x. For TF2, the code would need to be adapted and the models retrained.

Jim-Salmons commented 3 years ago

Hi Clemons @cneud - Thanks for the quick reply. Sounds like the best strategy is to wait for the next release. In the meantime I can get my sea-legs under me using PyCharm Pro under Windows on a Docker/WSL2 image. 🤪

mikegerber commented 3 years ago

File "/usr/local/bin/eynollah", line 11, in

Wait, your venv is not /usr/local, is it? Looks like you installed eynollah before without a virtualenv to /usr/local/bin/eynollah - can you move/remove that file? which eynollah should point to $VIRTUAL_ENV/bin/eynollah.

Careful, which eynolah can absolutely give $VIRTUAL_ENV/bin/eynollah while you're still calling /usr/local/bin/eynollah, because your shell might still be caching that eynollah is /usr/local/bin/eynollah. You need a rehash or open a fresh terminal in that case. I have been bitten by this more than once...

(There a few subleties: If this was @cneud's problem, he had called the /usr/local eynollah before activating the new venv/installing eynollah, so that this whole confusion is possible... 🔍)

SB2020-eye commented 3 years ago

Thanks to everyone weighing in with input on this.

I made a fresh conda environment (Windows 10 OS) and took another go at it. I needed to install pip to install eynollah. But after some fails, I figured out that pip installed python 3.9. So I installed python 3.6. pip install . worked after that -- but not until I manually made the changes referenced above (#18).

I used msys64 + mingw64 to install the models successfully.

Running eynollah -i C:/Users/Scott/Desktop/Python2/K/eyn_test/F073r.jpg -o C:/Users/Scott/Desktop/Python2/K/eyn_test/results -m C:/users/scott/desktop/python2/eynollah/models_eynollah -si C:/users/scott/desktop/python2/K/eyn_test/results I got the following

File "C:\ProgramData\Miniconda3\envs\eenv\Scripts\eynollah.exe\__main__.py", line 4, in <module>
  File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\sbb_newspapers_org_image\cli.py", line 2, in <module>
    from sbb_newspapers_org_image.eynollah import eynollah
  File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\sbb_newspapers_org_image\eynollah.py", line 31, in <module>
    from shapely import geometry
  File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\shapely\geometry\__init__.py", line 4, in <module>
    from .base import CAP_STYLE, JOIN_STYLE
    from shapely.coords import CoordinateSequence
  File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\shapely\coords.py", line 8, in <module>
    from shapely.geos import lgeos
  File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\shapely\geos.py", line 154, in <module>
    _lgeos = CDLL(os.path.join(sys.prefix, 'Library', 'bin', 'geos_c.dll'))
  File "c:\programdata\miniconda3\envs\eenv\lib\ctypes\__init__.py", line 348, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: [WinError 126] The specified module could not be found

After some dead ends, I googled the error line plus "shapely," and found a suggestion at another repo to simply conda install -c conda-forge shapely I no longer got that specific error. And eynollah --help works. (Yay!)

However, running the same command above, now I get:

The system cannot find the path specified.
'identify' is not recognized as an internal or external command,
operable program or batch file.
Traceback (most recent call last):
  File "c:\programdata\miniconda3\envs\eenv\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\programdata\miniconda3\envs\eenv\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\ProgramData\Miniconda3\envs\eenv\Scripts\eynollah.exe\__main__.py", line 7, in <module>
  File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\click\core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\click\core.py", line 782, in main
    rv = self.invoke(ctx)
  File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\click\core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\click\core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\sbb_newspapers_org_image\cli.py", line 102, in main
    headers_off,
  File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\sbb_newspapers_org_image\eynollah.py", line 2978, in run
    is_image_enhanced, img_org, img_res, num_col_classifier, num_column_is_classified = self.resize_and_enhance_image_with_column_classifier(is_image_enhanced)
  File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\sbb_newspapers_org_image\eynollah.py", line 419, in resize_and_enhance_image_with_column_classifier
    dpi = self.check_dpi()
  File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\sbb_newspapers_org_image\eynollah.py", line 298, in check_dpi
    return int(float(dpi))
ValueError: could not convert string to float:

I've triple-checked my paths, and they're fine. And I've poked around to try to understand where the ValueError is coming from. But yet to no avail. Any suggestions?

vahidrezanezhad commented 3 years ago

Thanks to everyone weighing in with input on this.

I made a fresh conda environment (Windows 10 OS) and took another go at it. I needed to install pip to install eynollah. But after some fails, I figured out that pip installed python 3.9. So I installed python 3.6. pip install . worked after that -- but not until I manually made the changes referenced above (#18).

I used msys64 + mingw64 to install the models successfully.

Running eynollah -i C:/Users/Scott/Desktop/Python2/K/eyn_test/F073r.jpg -o C:/Users/Scott/Desktop/Python2/K/eyn_test/results -m C:/users/scott/desktop/python2/eynollah/models_eynollah -si C:/users/scott/desktop/python2/K/eyn_test/results I got the following

File "C:\ProgramData\Miniconda3\envs\eenv\Scripts\eynollah.exe\__main__.py", line 4, in <module>
  File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\sbb_newspapers_org_image\cli.py", line 2, in <module>
    from sbb_newspapers_org_image.eynollah import eynollah
  File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\sbb_newspapers_org_image\eynollah.py", line 31, in <module>
    from shapely import geometry
  File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\shapely\geometry\__init__.py", line 4, in <module>
    from .base import CAP_STYLE, JOIN_STYLE
    from shapely.coords import CoordinateSequence
  File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\shapely\coords.py", line 8, in <module>
    from shapely.geos import lgeos
  File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\shapely\geos.py", line 154, in <module>
    _lgeos = CDLL(os.path.join(sys.prefix, 'Library', 'bin', 'geos_c.dll'))
  File "c:\programdata\miniconda3\envs\eenv\lib\ctypes\__init__.py", line 348, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: [WinError 126] The specified module could not be found

After some dead ends, I googled the error line plus "shapely," and found a suggestion at another repo to simply conda install -c conda-forge shapely I no longer got that specific error. And eynollah --help works. (Yay!)

However, running the same command above, now I get:

The system cannot find the path specified.
'identify' is not recognized as an internal or external command,
operable program or batch file.
Traceback (most recent call last):
  File "c:\programdata\miniconda3\envs\eenv\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\programdata\miniconda3\envs\eenv\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\ProgramData\Miniconda3\envs\eenv\Scripts\eynollah.exe\__main__.py", line 7, in <module>
  File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\click\core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\click\core.py", line 782, in main
    rv = self.invoke(ctx)
  File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\click\core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\click\core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\sbb_newspapers_org_image\cli.py", line 102, in main
    headers_off,
  File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\sbb_newspapers_org_image\eynollah.py", line 2978, in run
    is_image_enhanced, img_org, img_res, num_col_classifier, num_column_is_classified = self.resize_and_enhance_image_with_column_classifier(is_image_enhanced)
  File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\sbb_newspapers_org_image\eynollah.py", line 419, in resize_and_enhance_image_with_column_classifier
    dpi = self.check_dpi()
  File "c:\programdata\miniconda3\envs\eenv\lib\site-packages\sbb_newspapers_org_image\eynollah.py", line 298, in check_dpi
    return int(float(dpi))
ValueError: could not convert string to float:

I've triple-checked my paths, and they're fine. And I've poked around to try to understand where the ValueError is coming from. But yet to no avail. Any suggestions?

I think this is because of getting dpi of image on windows. I can temporarily add an exception to resolve your problem (this can affect the result).

vahidrezanezhad commented 3 years ago

I just updated eynollah. please check if it works or not.

vahidrezanezhad commented 3 years ago

take care that this can happen on linux if the image directory is false. So be sure that the given image directory is correct.

vahidrezanezhad commented 3 years ago

dpi_value_error_with_false_image_directory

SB2020-eye commented 3 years ago

Thanks, @vahidrezanezhad . I checked the image path carefully again (fearful I had gotten it wrong and missed it over and over again), but it is correct. I might as well make sure -- are .jpg files okay? Any other restrictions on input images?

vahidrezanezhad commented 3 years ago

Of course jpg files are ok (all kind of images are valid). Did you pull the latest eynollah? This error should be because of getting dpi on windows. Check please with the latest version and give me a feedback .

SB2020-eye commented 3 years ago

...working on a new install in a new conda environment now...

vahidrezanezhad commented 3 years ago

Just consider that this is a temporary solution and it will disturb performance of the code.

SB2020-eye commented 3 years ago

Understood. :) When I run a command (even eynollah --help), it just gives me a new command line after 4-5 seconds (no output). EDIT: Nevermind! Sorry. I haven't done make models yet.

2ND EDIT: I spoke too soon (twice now). Models are in; doesn't make a difference (and I believe models shouldn't make a difference for eynollah --help anyways, I now realize). There is still no output -- just a new command line.

SB2020-eye commented 3 years ago

(tl;dr the new version isn't working for me)

kba commented 3 years ago

The DPI check is checked using the identify CLI from image magick. It seems unnecessary to do that and could be done with Pillow or opencv. But https://github.com/qurator-spk/eynollah/commit/37431d4840b0486a001789b14856e211d36ff1ab should have given a workaround. I don't understand why eynollah --help stopped working for you. What did you change?

SB2020-eye commented 3 years ago

I started from scratch with a new Anaconda environment. One difference this time was that instead of having to go back and install python 3.6 to replace 3.9, I did this from the outset: conda create --name env e2 python=3.6 Then activate e2 git clone https://github.com/qurator-spk/eynollah.git cd eynollah (I can't recall if I had to conda install pip at this point or not; but I think I did) pip install . At this point, it didn't work. (No output; fresh command line in 4-5 seconds. This includes running eynollah --help.) I used MSYS64/Mingw64 to cd into eynollah folder and run make models. Same results.

I am glad to try it again. But before I do, I'll see if you see any red flags regarding what I did above.

Thanks!

kba commented 3 years ago

No, that setup looks reasonable. Can you check out https://github.com/qurator-spk/eynollah/tree/refactor-cntd and install that? Among other things, this adds an overrideable log level switch --log-level.

Then try running eynollah on some image with --log-level DEBUG and post the output here.

Feel free to send me a DM in gitter to debug this further.

SB2020-eye commented 3 years ago

@kba , thanks for responding. Writing out my previous post, I thought it well worth going ahead and setting things up exactly as I had before, without the "shortcut" of conda create --name env e2 python=3.6. It works now!

I will lay out my steps for installing (at least as of 2/17/2021) that evidently work for me, in case it helps anyone else:

(Windows 10 os, Anaconda environment) (For my example, I will call my conda environment "my_env".) conda create --name my_env conda activate my_env git clone https://github.com/qurator-spk/eynollah.git cd eynollah conda install pip pip install . conda install python=3.6 conda install shapely This gets you to the point of having eynollah running... eynollah --help should work.

(Nothing has changed for me to add the models to run eynollah. From the outset, I needed a way to run a make file in Windows 10. For me, I set up msys64 and used the mingw64 terminal to: 1. cd to the eynollah repository folder, and 2. run make models. I am not very versed at all in this non-Python side of things, or else I would say more; I just did lots of googling, plodded through, and eventually it worked.)

kba commented 3 years ago

:tada: good to hear it's working for you now and thanks for documenting the steps you needed. Good to close an issue with an actual solution.

About make models, you don't need to go through make just for that, all that target does is

wget 'https://qurator-data.de/eynollah/models_eynollah.tar.gz'
tar xf models_eynollah.tar.gz

i.e. download the tarball and extract it, nothing fancy.

Once the OCR-D bindings are in place and https://github.com/OCR-D/core/pull/668 is merged, you will be able to download the models with ocrd resmgr download ocrd-eynollah '*'.

SB2020-eye commented 3 years ago

"About make models, you don't need to go through make just for that" Lol. Shows you what I know! (But now I know that it's worth looking at the file -- even in a "foreign language" to me -- before downloading heavier-lifting stuff.)

Great to hear about the anticipated bindings!

Lastly -- just fyi (and the issue can definitely stay closed):

  1. eynollah definitely works (and works well!)
  2. when running, I still get
    The system cannot find the path specified.
    'identify' is not recognized as an internal or external command,
    operable program or batch file.

    (In case it's helpful, my most recent run from terminal -- command and full output -- can be found here. You'll see this message come up twice.)

kba commented 3 years ago

"About make models, you don't need to go through make just for that" Lol. Shows you what I know! (But now I know that it's worth looking at the file -- even in a "foreign language" to me -- before downloading heavier-lifting stuff.)

Always a good idea. Be bold :)

  1. when running, I still get
The system cannot find the path specified.
'identify' is not recognized as an internal or external command,
operable program or batch file.

(In case it's helpful, my most recent run from terminal -- command and full output -- can be found here. You'll see this message come up twice.)

I've removed the identify call with OcrdExif in https://github.com/qurator-spk/eynollah/commit/8c603ae16d1074ec247c9956134cfc4f2b75481f so no need to have imagemagick installed anymore once that's been merged.

Jim-Salmons commented 3 years ago

Just a quick placeholder note to be supplemented with a recipe detail to follow. I tried to do @SB2020-eye's recipe for getting eynollah running under Windows 10. I got close but no cigar. I incrementally addressed issues and eventually got this incredible library running on a page image from my project's interest in ground-truthing the 48 issues of Softalk magazine.

To update this issue, I wanted to ensure that I have a working/repeatable Windows installation process. With this goal of providing a repeatable install recipe, I also wanted to use @kba's latest refactor-cntd branch of the eynollah module to overcome the non-critical identify error.

The good news is that I do have a process with only one non-critical hiccup related to the 'si' cli parameter for saving extracted images from the target page. I'll detail the working recipe soon, but ITMT here's a quick question:

Q bkgnd: My current conda environment does not have tesseract OCR installed. I know I can easily get this into my current conda eynollah environment. The cli I have run is -

eynollah -i <path-to-target-img>\softalkv1n01sep1980_0007.jpg -o <path-to-target-img> -m <path-to-models_eynollah> -si <path-to-target-img>\imgs\

This runs and processes a PAGExml result (without the extracted saved images). The PAGExml file has the detailed coordinates for the text and image regions of the page, however as expected there is no OCR extracted text data in the TextLine/TextEquiv/Unicode elements that complement the Coords of those regions.

Here is a screenshot of the PAGxml result via PRImA's PAGE Viewer:

softalk_eynollah_page

Q: With tesseract installed (which version BTW, or other recommended OCR engine), what cli statement will generate this current result with those TextLine/TextEquiv/Unicode elements' data included in the PAGExml file?

Thanks @kba, @bertsky, @SB2020-eye for your ongoing interest and assistance. :-)

kba commented 3 years ago

one non-critical hiccup related to the 'si' cli parameter

There is a new switch --enable-plotting/-ep that must be set for all the intermediary images to be written out. This needs some more work and then an update to the README.

what cli statement will generate this current result with those TextLine/TextEquiv/Unicode elements' data included in the PAGExml file?

Once the OCR-D interface (ocrd-eynollah) is in place, you can use any of the OCR engines we wrapped to do the actual text recognition, though I am not sure how well they work on Windows. I don't think that any OCR engines support segmentation input in PAGE-XML natively at the moment.

If you don't want to wait for the OCR-D interface, you can take the output of eynollah and add it to an OCR-D workspace and then run any of the above-mentioned engines on it:

eynollah ... -o . -i image1.png # will write to image1.xml
ocrd workspace init
ocrd workspace add -G IMG -i IMG_1 -g page1 image1.png
ocrd workspace add -G SEG -i SEG_1 -g page1 image1.xml
ocrd-tesserocr-recognize -P segmentation_level none -P textequiv_level line
Jim-Salmons commented 3 years ago

Great, thanks @kba, I'll try your recommendation. I'll also post the working Widows install recipe soon. Although I am super impressed with Calamari via my exposure to it through DATeCH, I'll stick to Tesseract ATM due to easier and previous successful use natively on Windows. BTW, I assume that the 5.0 alpha build is OCR-D compatible?