kba / hocr-spec

The hOCR Embedded OCR Workflow and Output Format
http://kba.github.io/hocr-spec/1.2/
72 stars 20 forks source link

Major restructuring #84

Closed kba closed 7 years ago

kba commented 7 years ago

I went ahead and aggressively restructured and expanded the spec over the weekend. This is a big change and touches a lot of issues but since I was in the flow, I decided to just keep going.

Snapshot of current commit: https://rawgit.com/kba/hocr-spec/gen-defs/1.2/index.html

New top level structure

For the formal definition of elements/properties, created a YAML file that contains info on relations, examples, grammar, categories. Using python script and templates, generate definition lists for each element/property and include in spec.

Still lots to do but it's in a state where I'd love to get feedback.

amitdo commented 7 years ago

As a general note, it looks great and very professional !

zuphilip commented 7 years ago

Some remarks:

amitdo commented 7 years ago

I suggest to move down the 'Logical Elements' section. It is less significant than the other sections and no OCR engine that we know implements them currently.

amitdo commented 7 years ago

About the grouping of properties (like scan_res and x_scanner). My suggestion is to break this grouping, and add 'related properties' for some of the properties instead.

kba commented 7 years ago

The notes in section 2.2 are IMO more like examples.

How are they examples?

I am not sure if there are really properties that are "required" in a strong way. It looks that currently bbox is the only one property which is required everywhere. But actually if one uses poly then the bbox will not be used. Originally bbox is just a "generally recommended" property

Granted, bbox is not required for all elements, but it doesn't make sense to have an ocr_carea without bbox or poly. We could also link to 'bbox or poly' or similar.

I am a little skeptical that the classifictions for the properties are useful.

Can you elaborate? Originally, the spec listed the properties under the category of elements. That led to duplication (e.g. ocr_separator being in floats and typesetting). Now, they are grouped in those categories but can be listed in other categories as well. The list is just everything I could think of, but could be reduced. It makes sense IMHO to be able to say: "ocr_line/ocrx_line can contain any inline properties"

Maybe, we should rather try to indicate the elements on which this property can be used?

You get these if you click on the dfn in the heading for a property. From the perspective of a hOCR processor, it makes more sense to iterate the elements and parse the properties according to the element definition IMHO rather than the other way around.

zuphilip commented 7 years ago

Section 2.2: The abstract description is followed by a specific example with ocr_page, bbox, ocrp_poly. However, it is not yet showing what is described above. Maybe we can extend it to an example with a note, i.e.

An hOCR element (in the following: element) is any HTML tag with a class attribute that contains exactly one class name that starts with ocr_ or ocrx_. Non-OCR related HTML content must not use class names that begin with ocr_ or ocrx_.

Example: <span class="ocr_page"> Note: When referring to an HTML tag with class ocr_page, this spec uses the notation <ocr_page>

amitdo commented 7 years ago

An hOCR element (in the following: element) is...

Is that proper English?

https://www.quora.com/What-is-a-more-modern-way-to-say-hereinafter-referred-to-as

kba commented 7 years ago

These issues seem already pretty detailed and it's a big PR already. I'll merge this and create issues for the wording/notation/property classification if it's okay with you.

kba commented 7 years ago

@amitdo @zuphilip I created issues for those remarks that have not yet been adressed. Feel free to create more if I forgot something.