manisandro / gImageReader

A Gtk/Qt front-end to tesseract-ocr.
GNU General Public License v3.0
1.6k stars 188 forks source link

Poorly combined characters in hOCR pane; font issue? #613

Closed n8marti closed 1 year ago

n8marti commented 1 year ago

If I use a keyboard that combines a base character with an accent (e.g. U+0069 U+0300, or i + ̀ ), the hOCR document pane does a poor job combining the decomposed characters into one. In the following image the first "ì" is generated with the above sequence from one keyboard, while the second one is generated from a different keyboard, which outputs the single composed character U+00EC: image

I believe this must be an issue with the display font used in the hOCR window. Is there a way I can change that font to test out this theory? This is problematic because if I export the hOCR document as PDF using a font that is known to correctly render the characters in other apps (e.g. LibreOffice), they're still incorrectly rendered in the exported PDF.

I was getting similar poor results in gnome-terminal and gedit before changing the fonts used in both places to DejaVu Sans Mono from the default.

n8marti commented 1 year ago

Well, I think I confirmed this in a more general way. I changed the Desktop font in dconf editor from "Ubuntu 11" to a compatible font "Charis SIL Compact 11". This indeed fixes the problem. Below is the same text, but now retyped with the changed Desktop font: image

Is there any way gImageReader could add an option to change the app display font? Maybe I need to contact Ubuntu and see if they can improve their font rendering.

manisandro commented 1 year ago

In theory doable, but usually common practice is for applications to follow the system font settings. I'd rather not add a setting to override the font at application level unless there are very strong reasons to do so.