internetarchive / archive-pdf-tools

Fast PDF generation and compression. Deals with millions of pages daily.
https://archive-pdf-tools.readthedocs.io/en/latest/
GNU Affero General Public License v3.0
86 stars 13 forks source link

A user-friendly example for a scanned multipage PDF needed #67

Open FilipDominec opened 1 year ago

FilipDominec commented 1 year ago

Thank you for this interesting project, which seems to exactly fit my needs, but so far I could not make it work. It the README.md, there is an example command like, but its use is far from straightforward.

Running just recode_pdf --from-pdf scan.pdf --out-pdf TEST.pdf without any hOCR file throws a confusing AttributeError: 'NoneType' object has no attribute 'seek'. Actually I tried to reinstall with three different versions and came here to report a bug.

Then I found another line in the README.md that "It is not possible to recode/compress a PDF without hOCR files". This is a crucial piece of information, but it is somewhat hidden. It is also not easy to find how to generate such a necessary file.

A google search suggested that I can use tesseract scan.tif scan hocr to generate hOCR file from a TIF. This would help for a single TIF file, but apparently tesseract does not accept PDF format.

I suggest that

  1. README should contain a minimum working example for an ordinary computer savvy user, who followed the Installation instructions and just wants to try recoding a scanned PDF file.
  2. The scripts should check for the hOCR file - and if it is missing, print out a sensible message about it (and possibly how to generate it).
  3. If possible, such a hOCR file could even be auto-generated on the fly whenever not provided by the user.
MerlijnWajer commented 1 year ago

Thank you for the report.

I agree with all three points. I had long planned to make more user friendly tooling around this technology but hadn't gotten to this point yet. This is integrated with the Archive.org stack where I also wrote the entire OCR module (which is FOSS) - which I'd have to somehow port and then tie that in to the PDF compression too. The PDF can be compressed without hOCR - but not yet with the current tooling.

Regarding your suggestions:

  1. Agreed
  2. Agreed
  3. A mostly empty hOCR could be made and there is a tool to do this, but I am not sure if the compression would be close to what you would like. That is, the OCR process helps with the quality of the compression.

There are a few more things to say on this:

FilipDominec commented 1 year ago

I processed my scanned document with ocrmypdf - it generates nice searchable text overlay.

But an attempt at retrieving hOCR file fails:

$pdf-to-hocr -f scan_searchable.pdf
Traceback (most recent call last):
  File "/home/dominecf/.local/bin/pdf-to-hocr", line 429, in <module>
    process_files(args.infile, args.json_metadata_file)
  File "/home/dominecf/.local/bin/pdf-to-hocr", line 388, in process_files
    metadata = json.load(open(json_metadata_file))
TypeError: expected str, bytes or os.PathLike object, not NoneType
MerlijnWajer commented 1 year ago

Okay, the tooling is really under documented. :-(

You first need to make a JSON file from the PDF file so that pdf-to-hocr understands what it is dealing with. The pdfcomp tool that I mentioned does this: https://github.com/internetarchive/archive-pdf-tools/blob/master/bin/pdfcomp

So perhaps you could just try to call pdfcomp on the PDF and see if it does anything sensible? It was made to be plugged into projects like ocrmypdf?

pdfcomp isn't yet a 'first class' citizen of this project, but I think with a small amount of work it can be made quite usable.