qurator-spk / dinglehopper

An OCR evaluation tool
Apache License 2.0
59 stars 13 forks source link

Skip when there is no file matching the pageId #34

Closed mikegerber closed 3 years ago

mikegerber commented 3 years ago

ocrd-dinglehopper should issue a warning and skip a page if there is no matching GT or OCR file for a page.

Reported by @mnoelte in Gitter: https://gitter.im/OCR-D/Lobby?at=5f76f0750dbbcf3dfa50648f

bertsky commented 3 years ago

See here for a recipe. You can omit the fallback search via matching imageFilename for efficiency. Then use something like this instead of the typical loop around self.input_files...

mikegerber commented 3 years ago

Side note: files[0] in that code might fail now that find_files() returns an iterator.

bertsky commented 3 years ago

Side note: files[0] in that code might fail now that find_files() returns an iterator.

Hell yes, that broke all our multi-input-fileGrp processors!

Must replace with find_all_files ASAP

mikegerber commented 3 years ago

Must replace with find_all_files ASAP

Is that an API function? https://ocr-d.de/core/search.html?q=find_all_files returns nothing

bertsky commented 3 years ago

Is that an API function? https://ocr-d.de/core/search.html?q=find_all_files returns nothing

It is. @kba I guess the apidoc must be regenerated?

@mikegerber ocrd_models.ocrd_mets.OcrdMets.find_all_files

kba commented 3 years ago

It is. @kba I guess the apidoc must be regenerated?

yep 😊 on it.