hnesk / browse-ocrd

An extensible viewer for OCR-D mets.xml files
MIT License
20 stars 9 forks source link

ViewPage: ignore AlternativeImage if not retrievable #37

Closed bertsky closed 2 years ago

bertsky commented 2 years ago

Currently, ocrd_browser.view.page tries to add an image version for each AlternativeImage referenced in the page. But that can lead to an uncaught FileNotFoundError during https://github.com/hnesk/browse-ocrd/blob/73fc4e29855b5067dd6927277ac5c336e5aa45d3/ocrd_browser/view/page.py#L82 if the file happens to be missing.

Furthermore, even when I catch this, if the missing path is the last image, it resurfaces as preferred version during https://github.com/hnesk/browse-ocrd/blob/73fc4e29855b5067dd6927277ac5c336e5aa45d3/ocrd_browser/model/page.py#L141, because the selection mechanism for image_from_page is not by file path but by features.

IMHO this should be more robust: An image file reference (irrespective if it is derived or original) that is nowhere to be found in the filesystem should simply be rendered with an all-white canvas of the same size.

bertsky commented 2 years ago

To the latter: perhaps in core we should offer a filter based on the filename directly (instead of the image features). Something like Workspace.image_from_page(... filename=None ...) and similar for image_from_segment. Of course, both parameters might interact, so if both filename and feature_filter / feature_selector are given, then one would have to verify that the image with that filename also fulfills these features.

What do you think, @kba?

bertsky commented 2 years ago

Oh, it's not as easy though: Even with a @filename filter, we cannot just use that to configure our ImageVersionSelector, because obviously the path name changes for each page. @hnesk IIUC your approach was to use the selector only to retrieve its image features, and then pass that on to get_image. But what if the image with the right features does not exist and cannot be downloaded? For core, we would have to try and download each AlternativeImage to rule that out (which seems like a waste of resources)...

hnesk commented 2 years ago

To the latter: perhaps in core we should offer a filter based on the filename directly (instead of the image features).

I build the ImageVersionSelector around the notion of features, because Workspace.image_from_page expects it that way, if there would a filename filter, I would happily use that.

Oh, it's not as easy though: Even with a @filename filter, we cannot just use that to configure our ImageVersionSelector, because obviously the path name changes for each page.

No problem here: The ImageVersionSelector gets configured for each page individually with the current AlternativeImages and their filenames.

I think your fix in #38 fixes the problem already. Do you have an example workspace?

bertsky commented 2 years ago

if there would a filename filter, I would happily use that.

No problem here: The ImageVersionSelector gets configured for each page individually with the current AlternativeImages and their filenames.

Oh, in that case – we should go for it. ocrd_browser would not have to mess with the image feature mechanism. See https://github.com/bertsky/core/tree/workspace-altimg-retrieve-existing for the preliminaries in core. The next step would be to (wait for approval+merge in core) and then utilize it in get_image.

I think your fix in #38 fixes the problem already. Do you have an example workspace?

No, only the first half – see above. For an example, it suffices to take any valid PAGE-XML and add an AlternativeImage to it with no @comments and a non-existing @filename.

hnesk commented 2 years ago

Ok, I got it now! It really is necessary to distinguish the images by filename, so it would be great if your branch gets merged.

bertsky commented 2 years ago

Ok, so let's wait for https://github.com/OCR-D/core/pull/845