Closed bertsky closed 2 years ago
To the latter: perhaps in core we should offer a filter based on the filename directly (instead of the image features). Something like Workspace.image_from_page(... filename=None ...)
and similar for image_from_segment
. Of course, both parameters might interact, so if both filename
and feature_filter
/ feature_selector
are given, then one would have to verify that the image with that filename also fulfills these features.
What do you think, @kba?
Oh, it's not as easy though: Even with a @filename
filter, we cannot just use that to configure our ImageVersionSelector
, because obviously the path name changes for each page. @hnesk IIUC your approach was to use the selector only to retrieve its image features, and then pass that on to get_image
. But what if the image with the right features does not exist and cannot be downloaded? For core, we would have to try and download each AlternativeImage to rule that out (which seems like a waste of resources)...
To the latter: perhaps in core we should offer a filter based on the filename directly (instead of the image features).
I build the ImageVersionSelector around the notion of features, because Workspace.image_from_page expects it that way, if there would a filename filter, I would happily use that.
Oh, it's not as easy though: Even with a @filename filter, we cannot just use that to configure our ImageVersionSelector, because obviously the path name changes for each page.
No problem here: The ImageVersionSelector gets configured for each page individually with the current AlternativeImages and their filenames.
I think your fix in #38 fixes the problem already. Do you have an example workspace?
if there would a filename filter, I would happily use that.
No problem here: The ImageVersionSelector gets configured for each page individually with the current AlternativeImages and their filenames.
Oh, in that case – we should go for it. ocrd_browser would not have to mess with the image feature mechanism. See https://github.com/bertsky/core/tree/workspace-altimg-retrieve-existing for the preliminaries in core. The next step would be to (wait for approval+merge in core) and then utilize it in get_image
.
I think your fix in #38 fixes the problem already. Do you have an example workspace?
No, only the first half – see above. For an example, it suffices to take any valid PAGE-XML and add an AlternativeImage to it with no @comments
and a non-existing @filename
.
Ok, I got it now! It really is necessary to distinguish the images by filename, so it would be great if your branch gets merged.
Ok, so let's wait for https://github.com/OCR-D/core/pull/845
Currently,
ocrd_browser.view.page
tries to add an image version for eachAlternativeImage
referenced in the page. But that can lead to an uncaughtFileNotFoundError
during https://github.com/hnesk/browse-ocrd/blob/73fc4e29855b5067dd6927277ac5c336e5aa45d3/ocrd_browser/view/page.py#L82 if the file happens to be missing.Furthermore, even when I catch this, if the missing path is the last image, it resurfaces as preferred
version
during https://github.com/hnesk/browse-ocrd/blob/73fc4e29855b5067dd6927277ac5c336e5aa45d3/ocrd_browser/model/page.py#L141, because the selection mechanism forimage_from_page
is not by file path but by features.IMHO this should be more robust: An image file reference (irrespective if it is derived or original) that is nowhere to be found in the filesystem should simply be rendered with an all-white canvas of the same size.