A user-friendly example for a scanned multipage PDF needed

FilipDominec commented 1 year ago

Thank you for this interesting project, which seems to exactly fit my needs, but so far I could not make it work. It the README.md, there is an example command like, but its use is far from straightforward.

Running just recode_pdf --from-pdf scan.pdf --out-pdf TEST.pdf without any hOCR file throws a confusing AttributeError: 'NoneType' object has no attribute 'seek'. Actually I tried to reinstall with three different versions and came here to report a bug.

Then I found another line in the README.md that "It is not possible to recode/compress a PDF without hOCR files". This is a crucial piece of information, but it is somewhat hidden. It is also not easy to find how to generate such a necessary file.

A google search suggested that I can use tesseract scan.tif scan hocr to generate hOCR file from a TIF. This would help for a single TIF file, but apparently tesseract does not accept PDF format.

I suggest that

README should contain a minimum working example for an ordinary computer savvy user, who followed the Installation instructions and just wants to try recoding a scanned PDF file.
The scripts should check for the hOCR file - and if it is missing, print out a sensible message about it (and possibly how to generate it).
If possible, such a hOCR file could even be auto-generated on the fly whenever not provided by the user.

MerlijnWajer commented 1 year ago

Thank you for the report.

I agree with all three points. I had long planned to make more user friendly tooling around this technology but hadn't gotten to this point yet. This is integrated with the Archive.org stack where I also wrote the entire OCR module (which is FOSS) - which I'd have to somehow port and then tie that in to the PDF compression too. The PDF can be compressed without hOCR - but not yet with the current tooling.

Regarding your suggestions:

Agreed
Agreed
A mostly empty hOCR could be made and there is a tool to do this, but I am not sure if the compression would be close to what you would like. That is, the OCR process helps with the quality of the compression.

There are a few more things to say on this:

It is easy to make a "stub" hOCR file, but the compression might suffer. I did work on a tool called pdfcomp just to recompress a given PDF mentioned in this issue: https://github.com/internetarchive/archive-pdf-tools/issues/51 and this issue: https://github.com/ocrmypdf/OCRmyPDF/issues/541#issuecomment-1570593674 - it works but could see some more testing.
You can actually make a hOCR directly from a PDF that has a text layer using this tool: https://github.com/internetarchive/archive-hocr-tools/blob/master/bin/pdf-to-hocr - but I'm assuming that your scan doesn't have a PDF.

FilipDominec commented 1 year ago

I processed my scanned document with ocrmypdf - it generates nice searchable text overlay.

But an attempt at retrieving hOCR file fails:

$pdf-to-hocr -f scan_searchable.pdf
Traceback (most recent call last):
  File "/home/dominecf/.local/bin/pdf-to-hocr", line 429, in <module>
    process_files(args.infile, args.json_metadata_file)
  File "/home/dominecf/.local/bin/pdf-to-hocr", line 388, in process_files
    metadata = json.load(open(json_metadata_file))
TypeError: expected str, bytes or os.PathLike object, not NoneType

MerlijnWajer commented 1 year ago

Okay, the tooling is really under documented. :-(

You first need to make a JSON file from the PDF file so that pdf-to-hocr understands what it is dealing with. The pdfcomp tool that I mentioned does this: https://github.com/internetarchive/archive-pdf-tools/blob/master/bin/pdfcomp

So perhaps you could just try to call pdfcomp on the PDF and see if it does anything sensible? It was made to be plugged into projects like ocrmypdf?

pdfcomp isn't yet a 'first class' citizen of this project, but I think with a small amount of work it can be made quite usable.

internetarchive / archive-pdf-tools

A user-friendly example for a scanned multipage PDF needed #67