trufanov-nok / scantailor-universal

ScanTailor Universal - a fork based on Enhanced+Featured+Master versions of ST
http://scantailor.org
Other
181 stars 16 forks source link

feature request: integrate pdfsandwich #98

Open test2a opened 2 years ago

test2a commented 2 years ago

i continuation of my previous issue, my current workflow is this:

can we integrate pdfsandwich into scantailor and get the best of both worlds?

Piolie commented 2 years ago

You can feed the tiffs directly to tesseract to get an OCRed PDF. See the tesseract docs.

trufanov-nok commented 2 years ago

I'm already working for a long time (a year) on such solution but based on DjVu format instead of PDF. I'm DjVu fan. Technically this should be a one more processing step after Output, with page encoding preview, separate text layer/illustration encoding, and OCR (Tesseract engine). When this feature will be done, published and finetuned then maybe someone (would be great if not me) could adjust this process and replace djvu encoder with pdf encoder (jbig2enc?) or could make a JB2 to JBIG2 converter (which would be more interesting) and assemble PDF from DJVU without reencoding text layer.

test2a commented 2 years ago

@Piolie @trufanov-nok i understand i can use tesseract to do this thing but the point of pdfsandwich is that is already a pre-packaged software that does all this and more, from using tesseract to doing ocr and compressing pdfs.

my argument is simple. take what is already present and not having to reinvent the wheel. i do not know, something like

scantailor | tesseract | pdfsandwich

this should do the trick but i am saying to take the exported files, pipe tiff files to tesseract and then run pdfsandwich on those pdf files.

Piolie commented 2 years ago

I'm not a pdfsandwich (nor tesseract) user, however, its webpage states that

Essentially, pdfsandwich is a wrapper script which calls the following binaries: unpaper, convert, gs, hocr2pdf (for tesseract prior to version 3.03), and tesseract.

Given the current state of affairs, I genuinely don't understand the point of all this: ST already does the post-processing (and does it good), so there's no need for unpaper or convert, and the newest versions of tesseract handle all the OCR-PDF-related process, so no need for gs or hocr2pdf either. Isn't scantailor | tesseract what you're looking for? (honest question).

Personally, I think it's better to keep STU development efforts focused on post-processing, especially given there's currently only one person working on it across all forks (although it made me very happy to see DJVU encoding get some love).

I agree with you @trufanov-nok, that it would be nice to have a PDF encoder that respects text & picture segmentation, applies OCR only to text layer, etc. One can dream...

zdenop commented 2 years ago

I would suggest avoiding JBIG2 if you care about accuracy or provide the alternative option for those who can not use it. See https://en.wikipedia.org/wiki/JBIG2#Disadvantages

veikk0 commented 2 years ago

@zdenop You're mixing up JBIG2's lossless mode with the lossy mode.

Lossless mode works flawlessly and is the most efficient bitonal compression available for PDF files. Lossy mode is the one with issues, which is also stated in the WP article section you linked.

test2a commented 2 years ago

@Piolie i gave your suggestion a thought and i ended up with this bash script. it "does" most of what pdfsandwich did for me although there is no compression currently. my point is, why isn't scantailor giving a pdf output ? https://github.com/trufanov-nok/scantailor-universal/issues/97

can't we ship with these qpdf and tesseract and do this as an export option because how is the user expected to use the tif files? sure if you have individual pages for work and all but if you are doing a book scan, don't you want the user to be able to export to a format where the output can be consumed? or export to archive.org for example. tesseract can do text output also so there is that as well. https://tesseract-ocr.github.io/tessdoc/FAQ.html#what-output-formats-can-tesseract-produce

for i in *.tif ; do tesseract $i $i pdf; done && qpdf --empty --pages *.pdf -- out.pdf

i'm simply asking, for an average scantailor user, how do you use the tif files?

Piolie commented 2 years ago

i'm simply asking, for an average scantailor user, how do you use the tif files?

You can do OCR and convert to DJVU or PDF using many free or commercial tools: DjVuLibre, minidjvu, DjVu Small Mod, tesseract, ABBY, etc. The average user will probably convert to PDF using tesseract. I you want to control OCR quality, compression settings and so on well, you are not the average user ;).