Open not-implemented opened 7 years ago
(Note: I can also just guess on the meaning.) I agree that there is the image before the OCR ("text in the original image") and the image after the first steps of the OCR process, where the boxes are overlayed. If the latter image is derived by a rotation then we can measure this angle in one or the other direction (however, technically also the midpoint of the rotation should be known). This would also be my first idea for this textangle
property.
However here is an example hOCR from Tesseract where this property occurs (with the value of 90). I haven't yet identified the corresponding area in the images... It seems also that textangle
can be a property of different elements not on the page-level, which is strange for me.
Your hocr-proofreader looks great 🌟! We tried something similar with the ocr-gt-tools and there is @kba's hocrjs.
The layout analysis usually splits the page to 'blocks'.
The "textual content" refers to a 'block' that contains text.
Tesseract can identify the rotation (0 / 90 / 180 270) of the page, and also the rotation of individual text blocks in the page. A text block can have a rotation that is different than other text blocks in the page.
@zuphilip / @amitdo ah okay ... good to know, I also thought, textangle
makes sense only on page-level to rotate askew scanned pages (and that's the only use-case I need).
@zuphilip Cool, I did some googling about hocr-gui-editors some time ago and didn't find them. I see you had the same problems and similar ideas ;-) I still have to try out "ocr-gt-tools" ... or is there an online demo? Maybe we can combine the projects or at least the ideas ... hocr-proofreader is just a <500 line JS prototype for now, so I am open for anything ;-) Greets from Munich!
I still have to try out "ocr-gt-tools" ... or is there an online demo?
No, there is no online demo, but with the Dockerfile it should be easy to get it running locally, see https://github.com/UB-Mannheim/ocr-gt-tools/blob/master/INSTALL.md#docker-quickstart
Maybe we can combine the projects or at least the ideas ...
I would love to collaborate on code and ideas.
@not-implemented
It seems that currently there is no way to express page skew with the hOCR format.
You can get this info with the Tesseract API.
@amitdo Oh yes, I forgot the licence. I added the MIT licence now.
Okay, then I misunderstood the textangle
option. But then we need a pageskew
option or something like that in the spec ;-) (I currently convert OmniPage XML files to hOCR - and OmniPage XML has also a "skew" attribute - and of course this information is needed to properly display the results)
I currently convert OmniPage XML files to hOCR...
Do you have this conversion as some script or XSLT file? We have some other transformations between different OCR file formats collected in ocr-fileformats...
Do you have this conversion as some script or XSLT file?
Yes, it's currently an XSLT implementation with some PHP code ... but it's still a work-in-progress. Then I first started implementing the GUI to get a feeling, which information is really needed to display the results properly (and that's why I opened this ticket about page skew ;-)).
I had a look at ocr-fileformat ... and yes, maybe it makes sense to integrate it there. Maybe I'm blindfolded, but where are your stylesheets? The "xslt" folder just has an "alto2.0__alto3.0.xsl".
maybe it makes sense to integrate it there
We would love to add it there, PRs are welcomed.
Maybe I'm blindfolded, but where are your stylesheets?
Actually, most of the stylesheets/scripts for validation and transformation are maintained outside and will be just in the installation process been integrated. See also https://github.com/UB-Mannheim/ocr-fileformat#license (further ideas and links can be found in the issues).
We would love to add it there, PRs are welcomed.
I'll have a deeper look into the project :-)
Also into ocr-gt-tools ... I got the Docker-Container running, but when pasting an URL, there is a permission denied error (dragging a file into it gives a "Keine URL erkannt" or similar error). I'll have a closer look at it, when I have some more time.
Still not clear for me, if "textual content" refers to the "text in the original image" or to "the text bboxes in OCR-result", where the rotation-direction will be the opposite. If the text on the original page is rotated anti-clockwise, the page (and therefore the OCR result/bboxes) has been rotated clockwise to get straight.
I guess, the textangle refers to the rotation on the original page, right? To be more specific: If the lines in the original image runs upwards, this value is positive?
By the way: If it is interesting for someone: I currently started a "Web based JavaScript GUI library for proofreading/editing hOCR": https://github.com/not-implemented/hocr-proofreader ... the most helpful feature for me to find OCR errors, is the switch between the original image and the hOCR-text rendered at the same position. But it's still a prototype and a lot of work to do ;-)