manisandro / gImageReader

A Gtk/Qt front-end to tesseract-ocr.
GNU General Public License v3.0
1.62k stars 189 forks source link

Add words in hOCR editor #148

Closed foghawk closed 6 years ago

foghawk commented 7 years ago

Add word elements (and associated textlines and paragraphs) in hOCR mode, allowing user to specify bounding box, text, and properties, for cases where tesseract performs poorly (e.g., diagrams and full-color images).

diegodlh commented 7 years ago

Hi foghwak. I was looking for this also. Do you know of any hOCR editor which does this already? Thanks!

foghawk commented 7 years ago

@diegodlh I'm not aware of any GUI tool which both has this feature and outputs hOCR. I haven't looked closely at proprietary software.

I can't make any promises, but in my off time this week I've started trying to add textline (and paragraph) merge to (for a start) the GTK version, and I expect I'll attempt this afterward.

manisandro commented 7 years ago

I'm a bit busy with other stuff at the moment, so pull requests would be greatly appreciated!

foghawk commented 6 years ago

Logically the user should supply the text and bounding box.

What about word confidence? 0? 100? User-settable (what default)?

What about font properties? Family isn't too important, since it can be left empty and the user will supply it for export when necessary. But it seems unfair to ask the user to guess an appropriate size with no visual clues. Does gIR (or any of the libraries it uses) have the ability to deduce font size from dpi dimensions/font family/text, or at least dpi dimensions from font size/font family/text?

manisandro commented 6 years ago

I'd say word confidence should be 100 since the user is likely to have high confidence in the word he entered ;)

For the font size, you should be able to convert the height in pixels of the bounding box to points, using the DPI stored in each HOCRPage.

foghawk commented 6 years ago

Good point on the DPI. Unfortunately a) font size conventionally corresponds to the em height, conventionally measured from the bottom of the descenders to the top of the ascenders, which is a problem if the text that needs to be added is (say) all caps (with no descenders), and b) em-height may or may not correlate to the size of any actual glyphs in the font, so there's no guarantee that the visual bounding box will have any relation to the font's bounding box for the same text.

There's no solution for b) except for actually rendering the font and checking the visual size, which is why I asked. But point height would probably be a good enough guess, if not for a). Maybe ask the user to eyeball space for descenders even if none are present?

manisandro commented 6 years ago

Hmm I don't see an easy solution besides rendering fonts and checking the metrics with QFontMetrics, but that's terribly inefficient. But I suppose having a first more or less good guess with the point height should be sufficient.

By the way, if you are adding the word to an existing text line, you could of course just use font family and size of a sibling, it available.

foghawk commented 6 years ago

I thought about it, but I think in practice that will be rare—tesseract's going to miss words in diagrams or on covers, probably not as part of regular text. (Unless this ends up standing in for a split-words option, which it probably shouldn't in the long run.)

I'll see how bad QFontMetrics is. I'm inclined not to worry about efficiency too much here (it's not exactly a CPU-bound operation) if it gives significantly more accurate results than a point-height guess.

manisandro commented 6 years ago

As an idea, if the use case is words that tesseract missed because they were part of some graphic which confused it, it might be an idea to pass the region the user selects to tesseract, and then use the result to populate the attributes.

SantosSi commented 6 years ago

@manisandro: The idea of re-recognizing is a good one, definitely worth implementing. On the other hand in my opinion it does not replace the need to manually create nodes, e.g. to mitigate recurring misdetection of a page's structure. I vote for both.

manisandro commented 6 years ago

@foghawk Are you working on this / planning to do so? Just for planning, I'd like to push out a new version soonish.

foghawk commented 6 years ago

@manisandro No, sorry, real life's picked up and I'm not likely to get anything done in the immediate future. Go ahead as you please.

manisandro commented 6 years ago

@foghawk Ok thanks for the notice!

manisandro commented 6 years ago

Implemented for Qt

manisandro commented 6 years ago

Done also for Gtk

tukusejssirs commented 3 years ago

I‘m sorry to necrobump, but how exactly one can edit the hOCR source from within gImageReader? I can’t do that from gIR 3.3.1 (GTK).

I can’t select a portion from the image (a rectangle box), nor add another element in the tree (top part of the output pane), nor the source text (bottom part of the output pane).

Tesseract misses some words and I’d like to fix that.

Thanks in advance! :smiley:

Update: You need to right-click on the parent Textline element in the top part of the output pane, then Add word, select a part of the image (create a rectangle) and enter the word.