kba / hocr-spec

The hOCR Embedded OCR Workflow and Output Format
http://kba.github.io/hocr-spec/1.2/
72 stars 20 forks source link

ascii-string in the properties grammar BNF #97

Open wollmers opened 7 years ago

wollmers commented 7 years ago

The spec version 1.2 has

ascii-string     = +(%x01-FF - semicolon)  ; printable ascii without semicolon
delimited-string = doublequote ascii-string doublequote

delimited-string id mostly used in the titleattribute for filenames or links.

The spec for HTML 4.01 and XHTML has for the title string: CDATA depending on character encoding.

XML has a better definition:

Char       ::=      #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]  /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

First, the name ascii in the current hOCR definition %x01-FF is misleading, because ascii ends at x7F. Seems more to target at bytewise parsing, or 8-bit encodings, not unicode codepoints.

Seems it should better be defined as any char without semicolon and without doublequote.