These are commented as FIXME at the end of hocr-check, I'll put them here for discussion.
[ ] containment of paragraphs, columns, etc.
[ ] ocr-recognized vs. actual tags
[ ] warn about text outside ocr_ elements
[ ] check title= attribute format
[ ] check that only the right attributes are present on the right elements
[ ] check for unrecognized ocr_ elements
[ ] check for significant overlaps
[ ] check that image files are not repeated
Keep this in check with hocr-spec (cross-reference maybe) and consider creating an XSD schema for use in ocr-fileformats (though these tend to be inflexible).
These are commented as
FIXME
at the end ofhocr-check
, I'll put them here for discussion.Keep this in check with hocr-spec (cross-reference maybe) and consider creating an XSD schema for use in ocr-fileformats (though these tend to be inflexible).