Open brobertson opened 6 years ago
It depends on the .llocs files generated by ocropus-gpageseg
You mean ocropus-rpred
.
Thanks, I've edited the comments. I'll look into how I change this in the source code so that the changes pertain in the pull request.
Looks good, thanks. Will there be whitespace between the word spans? To do html2txt, for screenreaders etc.?
Yes, there is whitespace between the elements. There is not whitespace after the final and the that closes the hocr_line.
Apropos formatting, is there a beautifier command that I can run the code through to conform to this project?
I think kraken also has something like this feature.
Yes, there is whitespace between the elements. There is not whitespace after the final and the that closes the hocr_line.
👍
Apropos formatting, is there a beautifier command that I can run the code through to conform to this project?
PEP8. We discussed beautifying the whole code base but decided against it at the time, because it change every second line and make blameing harder.
Do you take into account the fact that each 'loc' is just one spot that can be in the start / middle / end of glyph?
Amit -- Excellent question.
Because of this issue, the assigned break between words is the midpoint of the space between them. (This is noted in the code comments.) This ensures, or tries to, that no part of a cc of a glyph is cut off. It does mean that the word bounding box has some extra space on either side and that each word bbox is adjacent to the next. I feel this is a good compromise, given the data available, since it can be used for retraining, cropping images of words and so forth.
I'll provide a visualization later today.
This is a visualization of an example of the word breaking behaviour.
The breaks that occur inside words, such as on the first line, are OCR errors: that is, ocropus-rpred finds a space there, so this code dutifully enters a word. Similarly, the marginal numbers, 655 and 665 are incorrect because of upstream errors. (I find that these marginal numbers sometimes get lost or clipped by gpageseg unless I really jam the column parameters.)
I'm processing a few thousand pages in the next couple of days, and I'll pass them through this process to ensure it doesn't throw errors and check the visualizations for good word breaks.
This is the corresponding plaintext output, verifying my analysis of the errors above: aΑ ἄΕ φύει τ' ἄδηλα καὶ φανέντα κρύπτεται· κοὐκ ἔστ' ἄελπτον οὐδὲν, ἀλλ' ἀλίσκεται χώ δεινὸς ὅρκος χαἰ περισκελεῖς φρένες. Κἀγώ γὰρ, ὃς τὰ δείν' ἐκαρτέρουν τότε, 650 βαφῇ σιδηρος ῶς ἐθηλύνθην στόμα πρὸς τῆσδε τῆς γυναικός· οἰκτίρω δέ νιν χήραν παρ' ἐχθροῖς παῖδά τ' ὀρφανὸν λιπεῖν. Ἀλλ' εἴμι πρός τε λουτρὰ καὶ παρακτίους λειμῶνας, ὡς ἂν λύμαθ' ἀγνίσας ἐμὰ s μῆνιν βαρεῖαν ἐξαλύξωμαι θεᾶς· μολών τε χῶρον ἔνθ' ἀν ἀστιβῆ κίχω, κρύψω τόδ' ἔγχος τοὐμὸν, ἔχθισ.ον βελῶν, γαίας ὀρύξας ἔνθα μή τις ὅψεται· ἀλλ' αὐτὸ νὺξ Ἀιδης τε σῳζόντων κάτω. sn0 Σγὼ γὰρ ἐξ οὖ χειρὶ τοῦτ' ἐδεξάμην παρ' κτορος δώρημα δυσμενεστάτου, οὔπω τι κεδνὸν ἔσχον Aργείων πάρα· ἀλλ' ἔοτ' ἀληθὴς ἡ βροτῶν παροιμία· ἐχθρῶν ἄδωρα δῶρα κοὐκ δνήσιμα. Τοιγὰρ τὸ λοιπὸν εἰσόμεσθα μὲν θεοῖς εκειν, μαθησόμεσθα δ' Ἀτρείδας σέβειν. Ἀρχοντές εἰσιν, βσθ' ὑπεικτέον· τί μή ; Καὶ γὰρ τὰ δεινὰ καὶ τὰ καρτερώτατα τιμαῖς ὑπείκει· τοῦτο μὲν νιφοστιβεῖς 57ο χειμῶνες ἐκχωροῦσιν εὐκάρπῳ θέρει· ἐξίσταται δὲ νυκτὸς αἰανὴς κύκλος τῇ λευκοπώλῳ φέγγος ἡμέρᾳ φλέγειν· δεινῶν τ' ἄημα πνευμάτων ἐκοίμισε 34) 'ει LA, ποιεῖ Stoὸaeus ‖ 343 κοὐx LA, οx Stobaeus, Suidas ‖ 64 χαἰ Br., καὶ LA, Stcbaeus, Suidas ‖ 350 ἐκαρτέρουν τότε libri, γρ.
For what it's worth, it's clear we could improve on this code to generate the 'true' bbox of the word by finding the smallest rectangle around all the ccs within the bbox provided by the routine offered in this pull request. If someone could recommend a library, preferably already imported by Ocropus, that does this or that would be best to modify to this purpose, I'd be happy to work on it for a future pull request.
Sorry, I am late to look at this PR... Actually, there is another PR #283 by @JKamlah to extend the hocr output which will include word boxes but also probabilities.
This adds a'-w' switch to ocropus-hocr, which will cause it to generate elements containing each word's text and validly nested within the appropriate element. It depends on the .llocs files generated by ocropus-rpred. If these are not available, or the switch is not turned on, it uses the old behaviour.
It should be noted that text output from ocropus-hocr with and without the -w might differ. In particular, initial and final spaces are stripped from lines when the -w switch is on because this tends to generate poor bounding boxes.