qurator-spk / eynollah

Document Layout Analysis
Apache License 2.0
347 stars 29 forks source link

regression in OCR-D processor #106

Open bertsky opened 1 year ago

bertsky commented 1 year ago

Since the last update I am getting

09:12:50.405 INFO eynollah - INPUT FILE PHYS_0001 (1/3) 
09:12:50.972 INFO eynollah - Resizing and enhancing image...
09:12:50.972 INFO eynollah - Detected 300 DPI
09:13:13.284 INFO eynollah - Found 1 columns ([[1. 0. 0. 0. 0. 0.]])
09:13:13.292 INFO eynollah - Image was not enhanced.
09:13:13.304 INFO eynollah - Enhancing took 22.3s 
09:13:33.396 INFO eynollah - Textregion detection took 20.1s 
09:13:33.651 INFO eynollah - Graphics detection took 0.3s 
09:13:44.518 INFO eynollah - textline detection took 10.9s
09:13:47.943 INFO eynollah - slope_deskew: -0.12°
09:13:47.944 INFO eynollah - deskewing took 3.4s
09:13:48.005 INFO eynollah - detection of marginals took 0.1s
09:14:25.182 INFO eynollah - Job done in 94.2s
09:14:25.182 ERROR ocrd.processor.helpers.run_processor - Failure in processor 'ocrd-eynollah-segment'
Traceback (most recent call last):
  File "/build/core/ocrd/ocrd/processor/helpers.py", line 128, in run_processor
    processor.process()
  File "/build/eynollah/qurator/eynollah/processor.py", line 59, in process
    Eynollah(**eynollah_kwargs).run()
  File "/build/eynollah/qurator/eynollah/eynollah.py", line 3107, in run
    self.writer.write_pagexml(pcgts)
  File "/build/eynollah/qurator/eynollah/writer.py", line 138, in write_pagexml
    out_fname = os.path.join(self.dir_out, self.image_filename_stem) + ".xml"
  File "/usr/lib/python3.8/posixpath.py", line 76, in join
    a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType

Have not looked further yet.

bertsky commented 1 year ago

Clearly, this is related to https://github.com/qurator-spk/eynollah/pull/86/commits/c606391c312eceab9aa3ebff071bdf12a30b45cc (dir_in mode), but it's a huge changeset, and I don't understand it (also, the diff is suboptimal).

BTW, what was the reason for dir_in (and why not also put the loaded models into class members in the single-image mode, so most of the dir_in conditionals would have been unnecessary)? IMHO for efficiency it would be necessary to separate CPU-bound from GPU-bound computation and make them run quasi parallel (for example via queueing). The dir_in change does avoid reloading models, but still requires many transfers to/from GPU memory and waiting for pre- and postprocessing, which is CPU-bound. Also, it's not possible to write a OCR-D wrapper for this kind of API.

mikegerber commented 1 year ago

I have the same problem with 0.3.0:

TypeError: expected str, bytes or os.PathLike object, not NoneType
Traceback (most recent call last):
  File "/usr/local/share/pyenv/versions/3.7.17/bin/ocrd-eynollah-segment", line 8, in <module>
    sys.exit(main())
  File "/usr/local/share/pyenv/versions/3.7.17/lib/python3.7/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/share/pyenv/versions/3.7.17/lib/python3.7/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/share/pyenv/versions/3.7.17/lib/python3.7/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/share/pyenv/versions/3.7.17/lib/python3.7/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/share/pyenv/versions/3.7.17/lib/python3.7/site-packages/qurator/eynollah/ocrd_cli.py", line 8, in main
    return ocrd_cli_wrap_processor(EynollahProcessor, *args, **kwargs)
  File "/usr/local/share/pyenv/versions/3.7.17/lib/python3.7/site-packages/ocrd/decorators/__init__.py", line 126, in ocrd_cli_wrap_processor
    run_processor(processorClass, mets_url=mets, workspace=workspace, **kwargs)
  File "/usr/local/share/pyenv/versions/3.7.17/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 131, in run_processor
    raise err
  File "/usr/local/share/pyenv/versions/3.7.17/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 128, in run_processor
    processor.process()
  File "/usr/local/share/pyenv/versions/3.7.17/lib/python3.7/site-packages/qurator/eynollah/processor.py", line 59, in process
    Eynollah(**eynollah_kwargs).run()
  File "/usr/local/share/pyenv/versions/3.7.17/lib/python3.7/site-packages/qurator/eynollah/eynollah.py", line 3092, in run
    self.writer.write_pagexml(pcgts)
  File "/usr/local/share/pyenv/versions/3.7.17/lib/python3.7/site-packages/qurator/eynollah/writer.py", line 138, in write_pagexml
    out_fname = os.path.join(self.dir_out, self.image_filename_stem) + ".xml"
  File "/usr/local/share/pyenv/versions/3.7.17/lib/python3.7/posixpath.py", line 80, in join
    a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType

Reproducer:

#!/bin/sh
set -ex

test_id=`basename $0`
cd `mktemp -d /tmp/$test_id-XXXXX`

# Prepare processors
ocrd resmgr download ocrd-eynollah-segment default

# Prepare test workspace
wget https://qurator-data.de/examples/actevedef_718448162.first-page+binarization+segmentation.zip
unzip actevedef_718448162.first-page+binarization+segmentation.zip
cd actevedef_718448162.first-page+binarization+segmentation

# Run tests
ocrd-eynollah-segment -P models default -I OCR-D-IMG-BIN -O TEST-EYNOLLAH-SEG
bertsky commented 1 year ago

@mikegerber for a recent version usable in the OCR-D ecosystem, you need https://github.com/qurator-spk/eynollah/tree/706433c5049c63c6e16fee5f71d81a7e507abe06 which is part of https://github.com/qurator-spk/eynollah/pull/108

bertsky commented 1 year ago

Thanks but - I think - I don't currently have this problem - no conflicting qurator packages here. (And my recommendation to the team was to move away from the - useless - common qurator.* namespace.)

The most thorough test regarding the namespace inconsistency problem I know of is to install both sbb_binarize and eynollah in devel mode and then run --help on the OCR-D CLIs.

bertsky commented 1 year ago

@bertsky Is this particular problem even connected to the namespace problem? It seems unrelated.

@mikegerber no, but as as argued in #108 and above: you need the namespace problem fixed in the current ecosystem, while due to the present issue the master branch is dysfunctional since #86.

cneud commented 1 year ago

Hi, we are aware of the issue(s) with the OCR-D CLI. Please note that these all relate to the OCR-D CLI only, while the tool is otherwise perfectly usable - "completely unusable in certain circumstances" is a bit too harsh/inprecise, this is only the case when the "circumstances" are OCR-D.

I believe we had this discussion already in the past (even multiple times), but please take into account that the OCR-D CLI is not the top priority for our development here, as this is not an OCR-D project. In fact, we are currently hiring another position to work on this, so please be a bit patient.

We try to balance the project work plan, issues reported here and the availability of staff when planning the work and priorities, so please refrain from introducing new labels for urgency! But if it helps we can introduce an "OCR-D" label for all issues relating to that.

bertsky commented 1 year ago

I also don't see the point of urging this right now. We had a bit of a hustle when all this surfaced during the last ocrd_all release/repair sprint, but as explained above we did find a workable interim solution (ie. reverting #86 and #97 while cherry-picking #100).

cneud commented 8 months ago

I finally found some time to dive a little deeper into this issue, and now I must agree with @bertsky that https://github.com/qurator-spk/eynollah/commit/c606391c312eceab9aa3ebff071bdf12a30b45cc probably should not have been merged in the first place...alas, we now have it in main, so we will proceed to dissect the problematic changes from the larger changeset.