manisandro / gImageReader

A Gtk/Qt front-end to tesseract-ocr.
GNU General Public License v3.0
1.57k stars 187 forks source link

Remove linebreaks within paragraphs on export to ODT / Apply font size in HTML #666

Open Moini opened 5 months ago

Moini commented 5 months ago

Hi :wave:

I've been wanting to make a PDF zoomable for my e-ink ereader, and I've found that in order to do that, I need text that wraps automatically (else I have to perpetually move the page around to read the text).

I would like to retain font sizes, as export to ODT does (but to HTML does not, for some odd reason, even though the data is in it...?), so I can differentiate titles from paragraphs. With HTML, reflowing works...

For the reflowing to work in ODT, paragraphs (that are recognized) may not contain any hard line breaks.

Could you please add an option to remove those from recognized paragraphs?

And add an option to insert/keep hard linebreaks within a paragraph when the length of the line is less than x percent of the paragraph width? Those are usually lines where it makes sense to have that hard break.

And / or add an option to apply recognized styling to text in HTML? It's frustrating to have that sit in the title attribute, but not being used... Or am I misunderstanding something?

manisandro commented 5 months ago

Something like the strip line break functionality of the plain text mode should be doable.

Regarding applying styling to the HTML: as far as I know this is how hOCR HTML files are structured, but do feel free to research the format further.

Moini commented 5 months ago

@manisandro Thanks, I see, it's a special kind of XHTML, and not supposed to be used in a browser, but for overlay PDFs with image / text layer. I thought it meant HTML in the save dialog. What 'hOCR' in the dropdown meant wasn't clear to me, but it provided recognition of font sizes and paragraphs, according to the available settings, and that was what I had been looking for.

Being able to strip line breaks would help a lot!

lukruh commented 5 months ago

Right now I bound a tiny script to a hotkey for removing single line breaks from text in clipboard. I upvote for a solution in this nice tool (at least for plain text). Could be as simple as replace "-\n" with "" and than "\n" with " ". Maybe double "\n\n" can be avoided using some regex?