kba / hocr-spec

The hOCR Embedded OCR Workflow and Output Format
http://kba.github.io/hocr-spec/1.2/
72 stars 20 forks source link

2.0: Replace title= props with data-ocr-* attributes #77

Open kba opened 7 years ago

kba commented 7 years ago

Reusing the title= attribute of HTML elements for OCR-specific values is bad practice. It's understandable since at the time of hOCR's initial development, there were few mechanisms to extend HTML, but in HTML5, there are quite a few.

In a (possible) next major revision of the standard, we could use data-ocr-* attributes for that purpose.

<span id="line1" class="ocr_line" title="bbox 0 0 100 100">...</span>

could be expressed as

<span id="line1" data-ocr-tag="line" data-ocr-bbox="[0,0,100,100]"> ... </span>

This is more verbose but it would make it much easier to specify behavior and work with the content, i.e. in Javascript, you could do:

var line = document.querySelector("#line1");
var bbox = JSON.parse(line.dataset.ocrBbox);
var width = ocrBbox[2] - ocrBbox[0];
zuphilip commented 7 years ago

I think the data-ocr-* attributes would be a good way to continue. But is there any reason to change the class as well? This is standard HTML and has very good support like document.getElementsByClassName("ocr_line").

kba commented 7 years ago

It would make it easier to map between formats (ALTO) and serializations, if the OCR application profile of the HTML would be uniform, i.e. you wouldn't force a naming convention on class, id or title.