Open TeoColuccio opened 4 years ago
No this is currently not possible.
This would indeed be a very useful feature
It would probably go into the chapter of defining a manual page layout before recognition, it's much easier than attempting to fix up the layout afterwards.
Ok, so I wrote an issue also on tesseract forum -> https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/tesseract-ocr/TcbG4-vB8NM
For now, a work around I found is to first split each page with pdf arranger and then use gimagereader.
Prior to OCR with gImageReader, I preprocess all scans using ScanTailor Advanced. It has all the functionality asked for here (and much more). I recommend this workflow highly, as OCR results and visual looks of the resulting documents are much, much better. You will find it here: https://github.com/4lex4/scantailor-advanced
Ok i'll try it. Until now, I used the simply pdf arranger that do it's work as well
Prior to OCR with gImageReader, I preprocess all scans using ScanTailor Advanced. It has all the functionality asked for here (and much more). I recommend this workflow highly, as OCR results and visual looks of the resulting documents are much, much better. You will find it here:
Question: ScanTailor has in the output-filter the section "mode->mixed" that creates a "picture zones layer" (tab "picture zones").
If I understand right, then ScanTailor includes two layers, when creating the tiff file as output: a) picture layer b) text layer
Do you use this function of ScanTailor?
It looks like ScanTailor can better separate text from pictures in books, that have a mixture of pictures and text, than gimagereader/tesseract does.
I guess, that when we take such a tiff, generated from ScanTailor, gimagereader does only take the "text layer" to apply OCR. So the result would be better. Does it? Does gimagereader support the text-layer and picture layer output from ScanTailor-tiff files?
Would be very interested in an answer. Thank you.
Sorry, but this is a function of ScanTailor Advanced I've never used so far, and I'm not sure how it works at all. For my purposes, a layer consisting of the (processed) scan image of the whole page plus the text layer produced by OCR have always been sufficient as I have no need to extract individual pictures from the finished PDF later on.
I've this page, can I split this A3 scan in 2 A4, during the export in pdf?