Closed foghawk closed 6 years ago
Hi foghwak. I was looking for this also. Do you know of any hOCR editor which does this already? Thanks!
@diegodlh I'm not aware of any GUI tool which both has this feature and outputs hOCR. I haven't looked closely at proprietary software.
I can't make any promises, but in my off time this week I've started trying to add textline (and paragraph) merge to (for a start) the GTK version, and I expect I'll attempt this afterward.
I'm a bit busy with other stuff at the moment, so pull requests would be greatly appreciated!
Logically the user should supply the text and bounding box.
What about word confidence? 0? 100? User-settable (what default)?
What about font properties? Family isn't too important, since it can be left empty and the user will supply it for export when necessary. But it seems unfair to ask the user to guess an appropriate size with no visual clues. Does gIR (or any of the libraries it uses) have the ability to deduce font size from dpi dimensions/font family/text, or at least dpi dimensions from font size/font family/text?
I'd say word confidence should be 100 since the user is likely to have high confidence in the word he entered ;)
For the font size, you should be able to convert the height in pixels of the bounding box to points, using the DPI stored in each HOCRPage.
Good point on the DPI. Unfortunately a) font size conventionally corresponds to the em height, conventionally measured from the bottom of the descenders to the top of the ascenders, which is a problem if the text that needs to be added is (say) all caps (with no descenders), and b) em-height may or may not correlate to the size of any actual glyphs in the font, so there's no guarantee that the visual bounding box will have any relation to the font's bounding box for the same text.
There's no solution for b) except for actually rendering the font and checking the visual size, which is why I asked. But point height would probably be a good enough guess, if not for a). Maybe ask the user to eyeball space for descenders even if none are present?
Hmm I don't see an easy solution besides rendering fonts and checking the metrics with QFontMetrics, but that's terribly inefficient. But I suppose having a first more or less good guess with the point height should be sufficient.
By the way, if you are adding the word to an existing text line, you could of course just use font family and size of a sibling, it available.
I thought about it, but I think in practice that will be rare—tesseract's going to miss words in diagrams or on covers, probably not as part of regular text. (Unless this ends up standing in for a split-words option, which it probably shouldn't in the long run.)
I'll see how bad QFontMetrics is. I'm inclined not to worry about efficiency too much here (it's not exactly a CPU-bound operation) if it gives significantly more accurate results than a point-height guess.
As an idea, if the use case is words that tesseract missed because they were part of some graphic which confused it, it might be an idea to pass the region the user selects to tesseract, and then use the result to populate the attributes.
@manisandro: The idea of re-recognizing is a good one, definitely worth implementing. On the other hand in my opinion it does not replace the need to manually create nodes, e.g. to mitigate recurring misdetection of a page's structure. I vote for both.
@foghawk Are you working on this / planning to do so? Just for planning, I'd like to push out a new version soonish.
@manisandro No, sorry, real life's picked up and I'm not likely to get anything done in the immediate future. Go ahead as you please.
@foghawk Ok thanks for the notice!
Implemented for Qt
Done also for Gtk
I‘m sorry to necrobump, but how exactly one can edit the hOCR source from within gImageReader? I can’t do that from gIR 3.3.1 (GTK).
I can’t select a portion from the image (a rectangle box), nor add another element in the tree (top part of the output pane), nor the source text (bottom part of the output pane).
Tesseract misses some words and I’d like to fix that.
Thanks in advance! :smiley:
Update: You need to right-click on the parent Textline
element in the top part of the output pane, then Add word
, select a part of the image (create a rectangle) and enter the word.
Add word elements (and associated textlines and paragraphs) in hOCR mode, allowing user to specify bounding box, text, and properties, for cases where tesseract performs poorly (e.g., diagrams and full-color images).