manisandro / gImageReader

A Gtk/Qt front-end to tesseract-ocr.
GNU General Public License v3.0
1.6k stars 188 forks source link

How to split in single page #442

Open TeoColuccio opened 4 years ago

TeoColuccio commented 4 years ago

I've this page, can I split this A3 scan in 2 A4, during the export in pdf? Schermata da 2020-04-06 16-16-41

manisandro commented 4 years ago

No this is currently not possible.

TeoColuccio commented 4 years ago

This would indeed be a very useful feature

manisandro commented 4 years ago

It would probably go into the chapter of defining a manual page layout before recognition, it's much easier than attempting to fix up the layout afterwards.

TeoColuccio commented 4 years ago

Ok, so I wrote an issue also on tesseract forum -> https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/tesseract-ocr/TcbG4-vB8NM

For now, a work around I found is to first split each page with pdf arranger and then use gimagereader.

Jossi2 commented 4 years ago

Prior to OCR with gImageReader, I preprocess all scans using ScanTailor Advanced. It has all the functionality asked for here (and much more). I recommend this workflow highly, as OCR results and visual looks of the resulting documents are much, much better. You will find it here: https://github.com/4lex4/scantailor-advanced

TeoColuccio commented 4 years ago

Ok i'll try it. Until now, I used the simply pdf arranger that do it's work as well

Golddouble commented 2 years ago

Prior to OCR with gImageReader, I preprocess all scans using ScanTailor Advanced. It has all the functionality asked for here (and much more). I recommend this workflow highly, as OCR results and visual looks of the resulting documents are much, much better. You will find it here:

Question: ScanTailor has in the output-filter the section "mode->mixed" that creates a "picture zones layer" (tab "picture zones").

If I understand right, then ScanTailor includes two layers, when creating the tiff file as output: a) picture layer b) text layer

Do you use this function of ScanTailor?

It looks like ScanTailor can better separate text from pictures in books, that have a mixture of pictures and text, than gimagereader/tesseract does.

I guess, that when we take such a tiff, generated from ScanTailor, gimagereader does only take the "text layer" to apply OCR. So the result would be better. Does it? Does gimagereader support the text-layer and picture layer output from ScanTailor-tiff files?

Would be very interested in an answer. Thank you.

Jossi2 commented 2 years ago

Sorry, but this is a function of ScanTailor Advanced I've never used so far, and I'm not sure how it works at all. For my purposes, a layer consisting of the (processed) scan image of the whole page plus the text layer produced by OCR have always been sufficient as I have no need to extract individual pictures from the finished PDF later on.