qurator-spk / dinglehopper

An OCR evaluation tool
Apache License 2.0
58 stars 12 forks source link

Regression with newest ocrd version #89

Closed mikegerber closed 4 months ago

mikegerber commented 10 months ago
---------------------------------------- Captured stderr call ----------------------------------------
12:34:41.863 ERROR ocrd.processor.helpers.run_processor - Failure in processor 'ocrd-dinglehopper'
Traceback (most recent call last):
  File "/home/b-mg106/.pyenv/versions/3.12.0/envs/tmp.dinglehopper.2023-10-23.issue-88-multimethod-dep/lib/python3.12/site-packages/ocrd/processor/helpers.py", line 131, in run_processor
    processor.process()
  File "/home/b-mg106/devel/dinglehopper/src/dinglehopper/ocrd_cli.py", line 41, in process
    gt_file = self.workspace.download_file(gt_file)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/b-mg106/.pyenv/versions/3.12.0/envs/tmp.dinglehopper.2023-10-23.issue-88-multimethod-dep/lib/python3.12/site-packages/ocrd/workspace.py", line 206, in download_file
    raise ValueError("OcrdFile {f} has neither 'url' nor 'local_filename', so cannot be downloaded")
ValueError: OcrdFile {f} has neither 'url' nor 'local_filename', so cannot be downloaded
----------------------------------------- Captured log call ------------------------------------------
ERROR    ocrd.processor.helpers.run_processor:helpers.py:133 Failure in processor 'ocrd-dinglehopper'
Traceback (most recent call last):
  File "/home/b-mg106/.pyenv/versions/3.12.0/envs/tmp.dinglehopper.2023-10-23.issue-88-multimethod-dep/lib/python3.12/site-packages/ocrd/processor/helpers.py", line 131, in run_processor
    processor.process()
  File "/home/b-mg106/devel/dinglehopper/src/dinglehopper/ocrd_cli.py", line 41, in process
    gt_file = self.workspace.download_file(gt_file)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/b-mg106/.pyenv/versions/3.12.0/envs/tmp.dinglehopper.2023-10-23.issue-88-multimethod-dep/lib/python3.12/site-packages/ocrd/workspace.py", line 206, in download_file
    raise ValueError("OcrdFile {f} has neither 'url' nor 'local_filename', so cannot be downloaded")
ValueError: OcrdFile {f} has neither 'url' nor 'local_filename', so cannot be downloaded
mikegerber commented 10 months ago
pytest -k integ_ocrd_cli

METS' fileSec looks like this:

  <mets:fileSec>
    <mets:fileGrp USE="OCR-D-GT-PAGE">
      <mets:file MIMETYPE="application/xml" ID="OCR-D-GT-PAGE_00000024">
        <mets:FLocat xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="OCR-D-GT-PAGE/00000024.page.xml"/>
      </mets:file>
    </mets:fileGrp>
    <mets:fileGrp USE="OCR-D-OCR-CALAMARI">
      <mets:file MIMETYPE="application/vnd.prima.page+xml" ID="OCR-D-OCR-CALAMARI_0001">
        <mets:FLocat xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="OCR-D-OCR-CALAMARI/OCR-D-OCR-CALAMARI_0001.xml"/>
      </mets:file>
    </mets:fileGrp>
    <mets:fileGrp USE="OCR-D-OCR-TESS">
      <mets:file MIMETYPE="application/vnd.prima.page+xml" ID="OCR-D-OCR-TESS_0001">
        <mets:FLocat xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="OCR-D-OCR-TESS/OCR-D-OCR-TESS_0001.xml"/>
      </mets:file>
    </mets:fileGrp>
  </mets:fileSec>

This used to work.

Maybe it's because xlink:href isn't really an URL? Or is it?

ocrd_model's ocrd_file.py looks like this is supposed to also have a LOCTYPE and OTHERLOCTYPE.

mikegerber commented 10 months ago

Our other "standard"/commonly used example files have the LOCTYPE, I'm trying those. The embedded test data may just be invalid and have been handled more graceful in earlier ocrd versions.

https://qurator-data.de/examples/actevedef_718448162.first-page+binarization+segmentation.zip has LOCTYPE

https://qurator-data.de/examples/actevedef_718448162.zip has LOCTYPE

https://qurator-data.de/examples/actevedef_718448162.first-page.zip has LOCTYPE

mikegerber commented 10 months ago

Adding LOCTYPE/OTHERLOCTYPE to the test data fixes the tests.

I'll commit the fix but leave this open until I can discuss it with @kba as I'm not sure if it's a regression in core/something that could conveniently be handled by core etc.

mikegerber commented 4 months ago

This was probably encountered elsewhere too, Closing.