add option to output <span class=ocr_word> elements to hocr

brobertson commented 5 years ago

This adds a'-w' switch to ocropus-hocr, which will cause it to generate elements containing each word's text and validly nested within the appropriate element. It depends on the .llocs files generated by ocropus-rpred. If these are not available, or the switch is not turned on, it uses the old behaviour.

It should be noted that text output from ocropus-hocr with and without the -w might differ. In particular, initial and final spaces are stripped from lines when the -w switch is on because this tends to generate poor bounding boxes.

amitdo commented 5 years ago

It depends on the .llocs files generated by ocropus-gpageseg

You mean ocropus-rpred.

brobertson commented 5 years ago

Thanks, I've edited the comments. I'll look into how I change this in the source code so that the changes pertain in the pull request.

kba commented 5 years ago

Looks good, thanks. Will there be whitespace between the word spans? To do html2txt, for screenreaders etc.?

brobertson commented 5 years ago

Yes, there is whitespace between the elements. There is not whitespace after the final and the that closes the hocr_line.

Apropos formatting, is there a beautifier command that I can run the code through to conform to this project?

amitdo commented 5 years ago

I think kraken also has something like this feature.

kba commented 5 years ago

Yes, there is whitespace between the elements. There is not whitespace after the final and the that closes the hocr_line.

👍

Apropos formatting, is there a beautifier command that I can run the code through to conform to this project?

PEP8. We discussed beautifying the whole code base but decided against it at the time, because it change every second line and make blameing harder.

amitdo commented 5 years ago

Do you take into account the fact that each 'loc' is just one spot that can be in the start / middle / end of glyph?

brobertson commented 5 years ago

Amit -- Excellent question.

Because of this issue, the assigned break between words is the midpoint of the space between them. (This is noted in the code comments.) This ensures, or tries to, that no part of a cc of a glyph is cut off. It does mean that the word bounding box has some extra space on either side and that each word bbox is adjacent to the next. I feel this is a good compromise, given the data available, since it can be used for retraining, cropping images of words and so forth.

I'll provide a visualization later today.

brobertson commented 5 years ago

out This is a visualization of an example of the word breaking behaviour.

The breaks that occur inside words, such as on the first line, are OCR errors: that is, ocropus-rpred finds a space there, so this code dutifully enters a word. Similarly, the marginal numbers, 655 and 665 are incorrect because of upstream errors. (I find that these marginal numbers sometimes get lost or clipped by gpageseg unless I really jam the column parameters.)

I'm processing a few thousand pages in the next couple of days, and I'll pass them through this process to ensure it doesn't throw errors and check the visualizations for good word breaks.

brobertson commented 5 years ago

This is the corresponding plaintext output, verifying my analysis of the errors above: aΑ ἄΕ φύει τ' ἄδηλα καὶ φανέντα κρύπτεται· κοὐκ ἔστ' ἄελπτον οὐδὲν, ἀλλ' ἀλίσκεται χώ δεινὸς ὅρκος χαἰ περισκελεῖς φρένες. Κἀγώ γὰρ, ὃς τὰ δείν' ἐκαρτέρουν τότε, 650 βαφῇ σιδηρος ῶς ἐθηλύνθην στόμα πρὸς τῆσδε τῆς γυναικός· οἰκτίρω δέ νιν χήραν παρ' ἐχθροῖς παῖδά τ' ὀρφανὸν λιπεῖν. Ἀλλ' εἴμι πρός τε λουτρὰ καὶ παρακτίους λειμῶνας, ὡς ἂν λύμαθ' ἀγνίσας ἐμὰ s μῆνιν βαρεῖαν ἐξαλύξωμαι θεᾶς· μολών τε χῶρον ἔνθ' ἀν ἀστιβῆ κίχω, κρύψω τόδ' ἔγχος τοὐμὸν, ἔχθισ.ον βελῶν, γαίας ὀρύξας ἔνθα μή τις ὅψεται· ἀλλ' αὐτὸ νὺξ Ἀιδης τε σῳζόντων κάτω. sn0 Σγὼ γὰρ ἐξ οὖ χειρὶ τοῦτ' ἐδεξάμην παρ' κτορος δώρημα δυσμενεστάτου, οὔπω τι κεδνὸν ἔσχον Aργείων πάρα· ἀλλ' ἔοτ' ἀληθὴς ἡ βροτῶν παροιμία· ἐχθρῶν ἄδωρα δῶρα κοὐκ δνήσιμα. Τοιγὰρ τὸ λοιπὸν εἰσόμεσθα μὲν θεοῖς εκειν, μαθησόμεσθα δ' Ἀτρείδας σέβειν. Ἀρχοντές εἰσιν, βσθ' ὑπεικτέον· τί μή ; Καὶ γὰρ τὰ δεινὰ καὶ τὰ καρτερώτατα τιμαῖς ὑπείκει· τοῦτο μὲν νιφοστιβεῖς 57ο χειμῶνες ἐκχωροῦσιν εὐκάρπῳ θέρει· ἐξίσταται δὲ νυκτὸς αἰανὴς κύκλος τῇ λευκοπώλῳ φέγγος ἡμέρᾳ φλέγειν· δεινῶν τ' ἄημα πνευμάτων ἐκοίμισε 34) 'ει LA, ποιεῖ Stoὸaeus ‖ 343 κοὐx LA, οx Stobaeus, Suidas ‖ 64 χαἰ Br., καὶ LA, Stcbaeus, Suidas ‖ 350 ἐκαρτέρουν τότε libri, γρ.

brobertson commented 5 years ago

For what it's worth, it's clear we could improve on this code to generate the 'true' bbox of the word by finding the smallest rectangle around all the ccs within the bbox provided by the routine offered in this pull request. If someone could recommend a library, preferably already imported by Ocropus, that does this or that would be best to modify to this purpose, I'd be happy to work on it for a future pull request.

amitdo commented 5 years ago

https://docs.scipy.org/doc/scipy/reference/ndimage.html

https://github.com/tmbdev/ocropy/blob/d3e5cc60b64d/ocrolib/morph.py

zuphilip commented 5 years ago

Sorry, I am late to look at this PR... Actually, there is another PR #283 by @JKamlah to extend the hocr output which will include word boxes but also probabilities.

ocropus-archive / DUP-ocropy

add option to output <span class=ocr_word> elements to hocr #314