ocropus / hocr-tools

Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.
Other
364 stars 79 forks source link

How to support vertical text ? #54

Open dinosauria123 opened 8 years ago

dinosauria123 commented 8 years ago

Thank you for answering me every time.

I try to convert Japanese text image file to pdf by hocr-pdf. Japanese use both vertical and horizontal writing style.

hocr-pdf seems to not support vertical text. It shows last single letter position of the word. How to support vertical text ?

jp_vert.jpg jp_vert.jpg.json.txt jp_vert.hocr.txt jp_vert.pdf

dinosauria123 commented 8 years ago

Is It must use the CSS3 text layout attributes ?

In the case of Japanese, "vertical-rl".

stweil commented 8 years ago

Maybe. Do normal Japanese web pages also use that attribute for all vertical text? Then it is definitely needed in hOCR, too. But there might be more things missing in hocr-pdf.

What kind of output would you expect from hocr-line when it is applied to vertical text?

stweil commented 8 years ago

It would also be interesting to see what kind of PDF is generated for your test image by ABBYY Fine Reader and Tesseract's built-in PDF creator.

dinosauria123 commented 8 years ago

Thank you for your reply.

Maybe. Do normal Japanese web pages also use that attribute for all vertical text?

I think so. Most Japanese web page uses horizontal text, vertical text is little. But it is recommended to use the CSS3 text layout attributes for vertical text.

http://tategaki.github.io/commentaries/2016/01/04/commentary-know-how.html (In Japanese)

It would also be interesting to see what kind of PDF is generated for your test image by ABBYY Fine Reader and Tesseract's built-in PDF creator.

I made jp_vert.pdf in previous post by using my gcv2hocr and your hocr-pdf. hocr output from Tesseract seems to do not include text direction. I put the option -psm 5 but the result is the same.

out.hocr.txt (Tesseract output) jp_vert.pdf (Generated from Tesseract hocr)

I made another pdf jp_vert2.pdf by using web pdf comvert service. This web page can make searchable vertical writing pdf from Japanese text. https://shimeken.com/tex (In Japanese)

jp_vert2.pdf

What kind of output would you expect from hocr-line when it is applied to vertical text?

I want to know how to express text direction in hocr file, I could not find example. If you can convert pdf file to hocr, please convert jp_vert2.pdf to hocr. It may include how to express text direction in hocr file.

dinosauria123 commented 8 years ago

In "The hOCR Embedded OCR Workflow and Output Format",

OCR information and presentation information can be separated by putting the CSS info related to the CSS in an outer element with an ocr or ocrx class, and then overriding it for the presentation by nesting another SPAN with the actual presentation information inside that:

<span class="ocr_cinfo" style="ocr style"><span style="presentation style"> ... </span></span>

The CSS3 text layout attributes can be used when necessary. For example, CSS supports writing-mode, direction, glyph-orientation ISO15924-based script, text-indent, etc."

But I don't Know how to write direction in hocr.

kba commented 8 years ago

To tell the web browser to render the lines vertically, you can declare the appropriate CSS in the header, e.g. insert

<style>
    .ocr_line { writing-mode: vertical-lr; }
</style>

into the <head></head> section of the hOCR file. This renders nicely for me with you examples.

However: If the OCR engine is not aware that lines are to be segmented vertically and from right to left, this will garble the text. Page segmentation mode 5 (Assume a single uniform block of vertically aligned text) for tesseract is probably the right choice.

Here's how your example looks like without the writing-mode for me:

image

And here's how it looks like with writing-mode: vertical-rl;

image

dinosauria123 commented 8 years ago

I do not have experience to write Python code, it is difficult for me to understand hocr-pdf.

But I think the vertical coordinate of the text defined in hocr seems to ignore to make PDF in hocr-pdf code.

I can not see box[1] and box[3] and box_height in the hocr-pdf code...

To see the output of vertical text area of hocr-pdf from Japanese vertical text, it can set bottom left and bottom right text coordinate but can not set top left and right coodinate.

Is it difficult to set top left and right coordinate of the text area in hocr-pdf ?

dinosauria123 commented 7 years ago

Hi, I wrote this issue last year.

Tesseract (4.0 alpha) now supports Japanese vertical text as lang='jpn_vert' tag in output of hocr file. How do you think support lang='jpn_vert' tag in hocr-pdf ?

Now I build my old Japanese camera advertisement web page by using your hocr-pdf.

http://japanese-ww2-camera-ad.tk/

(You can see old Japanese camera ads by clicking blue text)

The json output of Google Cloud Vision OCR seems to does not include text orientation. I could not determine text orientation my gcv2hocr.

jp_vert.jpg jp_vert.jpg.pdf jp_vert.jpg.hocr.txt jp_vert.jpg.json.txt