qurator-spk / eynollah

Document Layout Analysis
Apache License 2.0
340 stars 29 forks source link

Flag for OCR-D processor to periodically save mets.xml file (a suggestion) #82

Closed sjscotti closed 1 year ago

sjscotti commented 2 years ago

Hi I seem to be sporadically crashing eynollah on one of a large number of images when running it as an OCR-D processor. This may happen after a large number of images were processed - which takes many hours to run. Because eynollah currently updates the mets.xml file with the segmentation files created only when the processor completes, all the results from that run are missing from the mets.xml file so an OCR cannot be performed on the successful segmentations. The two alternatives seem to be: 1) debug why eynollah is crashing (or eliminate the image causing the crash) and rerun all the images again, or 2) edit the mets.xml by hand to include the info for the successful segmentations that were done before the crash. Is there another approach that can be used if this case occurs? If not, how about including a flag in the OCR-D processor so that it periodically updates the mets.xml file with the info from the successful segmentations. Thanks!

cneud commented 2 years ago

Sorry for the late reply @sjscotti, but this seems like an issue that should be solved in the @OCR-D context, so adding @kba here.

This could also be relevant for our current benchmarking in OCR-D - do you have a rough idea how many pages were processed before Eynollah crashed?

cneud commented 1 year ago

IIUC this will be fixed by https://github.com/OCR-D/core/pull/966 when OCR-D is used through the Web API.

cneud commented 1 year ago

Closing here as https://github.com/OCR-D/core/pull/966 has now been merged.