kba / hocr-spec

The hOCR Embedded OCR Workflow and Output Format
http://kba.github.io/hocr-spec/1.2/
72 stars 20 forks source link

Clarification for 4.14. The textangle property #101

Open not-implemented opened 7 years ago

not-implemented commented 7 years ago

The angle in degrees by which textual content has been rotate relative to the rest of the page (if not present, the angle is assumed to be zero); rotations are counter-clockwise, so an angle of 90 degrees is vertical text running from bottom to top in Latin script; note that this is different from reading order, which should be indicated using standard HTML properties

Still not clear for me, if "textual content" refers to the "text in the original image" or to "the text bboxes in OCR-result", where the rotation-direction will be the opposite. If the text on the original page is rotated anti-clockwise, the page (and therefore the OCR result/bboxes) has been rotated clockwise to get straight.

I guess, the textangle refers to the rotation on the original page, right? To be more specific: If the lines in the original image runs upwards, this value is positive?

By the way: If it is interesting for someone: I currently started a "Web based JavaScript GUI library for proofreading/editing hOCR": https://github.com/not-implemented/hocr-proofreader ... the most helpful feature for me to find OCR errors, is the switch between the original image and the hOCR-text rendered at the same position. But it's still a prototype and a lot of work to do ;-)

zuphilip commented 7 years ago

(Note: I can also just guess on the meaning.) I agree that there is the image before the OCR ("text in the original image") and the image after the first steps of the OCR process, where the boxes are overlayed. If the latter image is derived by a rotation then we can measure this angle in one or the other direction (however, technically also the midpoint of the rotation should be known). This would also be my first idea for this textangle property.

However here is an example hOCR from Tesseract where this property occurs (with the value of 90). I haven't yet identified the corresponding area in the images... It seems also that textangle can be a property of different elements not on the page-level, which is strange for me.

Your hocr-proofreader looks great 🌟! We tried something similar with the ocr-gt-tools and there is @kba's hocrjs.

amitdo commented 7 years ago

The layout analysis usually splits the page to 'blocks'.

The "textual content" refers to a 'block' that contains text.

amitdo commented 7 years ago

Tesseract can identify the rotation (0 / 90 / 180 270) of the page, and also the rotation of individual text blocks in the page. A text block can have a rotation that is different than other text blocks in the page.

not-implemented commented 7 years ago

@zuphilip / @amitdo ah okay ... good to know, I also thought, textangle makes sense only on page-level to rotate askew scanned pages (and that's the only use-case I need).

@zuphilip Cool, I did some googling about hocr-gui-editors some time ago and didn't find them. I see you had the same problems and similar ideas ;-) I still have to try out "ocr-gt-tools" ... or is there an online demo? Maybe we can combine the projects or at least the ideas ... hocr-proofreader is just a <500 line JS prototype for now, so I am open for anything ;-) Greets from Munich!

zuphilip commented 7 years ago

I still have to try out "ocr-gt-tools" ... or is there an online demo?

No, there is no online demo, but with the Dockerfile it should be easy to get it running locally, see https://github.com/UB-Mannheim/ocr-gt-tools/blob/master/INSTALL.md#docker-quickstart

Maybe we can combine the projects or at least the ideas ...

I would love to collaborate on code and ideas.

amitdo commented 7 years ago

@not-implemented

https://help.github.com/articles/licensing-a-repository/

amitdo commented 7 years ago

It seems that currently there is no way to express page skew with the hOCR format.

You can get this info with the Tesseract API.

not-implemented commented 7 years ago

@amitdo Oh yes, I forgot the licence. I added the MIT licence now.

Okay, then I misunderstood the textangle option. But then we need a pageskew option or something like that in the spec ;-) (I currently convert OmniPage XML files to hOCR - and OmniPage XML has also a "skew" attribute - and of course this information is needed to properly display the results)

amitdo commented 7 years ago

https://github.com/tesseract-ocr/tesseract/blob/7b5b16779ad4/ccmain/pageiterator.cpp#L509

zuphilip commented 7 years ago

I currently convert OmniPage XML files to hOCR...

Do you have this conversion as some script or XSLT file? We have some other transformations between different OCR file formats collected in ocr-fileformats...

not-implemented commented 7 years ago

Do you have this conversion as some script or XSLT file?

Yes, it's currently an XSLT implementation with some PHP code ... but it's still a work-in-progress. Then I first started implementing the GUI to get a feeling, which information is really needed to display the results properly (and that's why I opened this ticket about page skew ;-)).

I had a look at ocr-fileformat ... and yes, maybe it makes sense to integrate it there. Maybe I'm blindfolded, but where are your stylesheets? The "xslt" folder just has an "alto2.0__alto3.0.xsl".

zuphilip commented 7 years ago

maybe it makes sense to integrate it there

We would love to add it there, PRs are welcomed.

Maybe I'm blindfolded, but where are your stylesheets?

Actually, most of the stylesheets/scripts for validation and transformation are maintained outside and will be just in the installation process been integrated. See also https://github.com/UB-Mannheim/ocr-fileformat#license (further ideas and links can be found in the issues).

not-implemented commented 7 years ago

We would love to add it there, PRs are welcomed.

I'll have a deeper look into the project :-)

Also into ocr-gt-tools ... I got the Docker-Container running, but when pasting an URL, there is a permission denied error (dragging a file into it gives a "Keine URL erkannt" or similar error). I'll have a closer look at it, when I have some more time.