ulb-sachsen-anhalt / digital-derivans

Derive new digitals from existing ones
MIT License
6 stars 2 forks source link

Fail to render PAGE on line-level #29

Closed M3ssman closed 1 year ago

M3ssman commented 2 years ago

Description

Actually, Derivans in not able to properly handle PAGE OCR which has text data just on line level.

Trying to do so yields the following Exception:

Exception in thread "main" java.lang.NullPointerException
    at de.ulb.digital.derivans.data.ocr.PAGEReader.toText(PAGEReader.java:104)
    at de.ulb.digital.derivans.data.ocr.PAGEReader.extractText(PAGEReader.java:81)
    at de.ulb.digital.derivans.data.ocr.PAGEReader.get(PAGEReader.java:55)
    at de.ulb.digital.derivans.DerivansPathResolver.enrichOCRFromFilesystem(DerivansPathResolver.java:150)
    at de.ulb.digital.derivans.Derivans.init(Derivans.java:124)
    at de.ulb.digital.derivans.Derivans.create(Derivans.java:169)
    at de.ulb.digital.derivans.App.main(App.java:38)
M3ssman commented 2 years ago

As the advanced filters, which exclude textual content like ===, don't hold on line level, it will actually very likely produce more coarse transcriptions as on word level.

M3ssman commented 2 years ago

@alexander-winkler If it runs fine actually, please let me know and I'll close this issue.