Open FilipDominec opened 1 year ago
Thank you for the report.
I agree with all three points. I had long planned to make more user friendly tooling around this technology but hadn't gotten to this point yet. This is integrated with the Archive.org stack where I also wrote the entire OCR module (which is FOSS) - which I'd have to somehow port and then tie that in to the PDF compression too. The PDF can be compressed without hOCR - but not yet with the current tooling.
Regarding your suggestions:
There are a few more things to say on this:
It is easy to make a "stub" hOCR file, but the compression might suffer. I did work on a tool called pdfcomp
just to recompress a given PDF mentioned in this issue: https://github.com/internetarchive/archive-pdf-tools/issues/51 and this issue: https://github.com/ocrmypdf/OCRmyPDF/issues/541#issuecomment-1570593674 - it works but could see some more testing.
You can actually make a hOCR directly from a PDF that has a text layer using this tool: https://github.com/internetarchive/archive-hocr-tools/blob/master/bin/pdf-to-hocr - but I'm assuming that your scan doesn't have a PDF.
I processed my scanned document with ocrmypdf
- it generates nice searchable text overlay.
But an attempt at retrieving hOCR file fails:
$pdf-to-hocr -f scan_searchable.pdf
Traceback (most recent call last):
File "/home/dominecf/.local/bin/pdf-to-hocr", line 429, in <module>
process_files(args.infile, args.json_metadata_file)
File "/home/dominecf/.local/bin/pdf-to-hocr", line 388, in process_files
metadata = json.load(open(json_metadata_file))
TypeError: expected str, bytes or os.PathLike object, not NoneType
Okay, the tooling is really under documented. :-(
You first need to make a JSON file from the PDF file so that pdf-to-hocr understands what it is dealing with. The pdfcomp
tool that I mentioned does this: https://github.com/internetarchive/archive-pdf-tools/blob/master/bin/pdfcomp
So perhaps you could just try to call pdfcomp
on the PDF and see if it does anything sensible? It was made to be plugged into projects like ocrmypdf
?
pdfcomp
isn't yet a 'first class' citizen of this project, but I think with a small amount of work it can be made quite usable.
Thank you for this interesting project, which seems to exactly fit my needs, but so far I could not make it work. It the README.md, there is an example command like, but its use is far from straightforward.
Running just
recode_pdf --from-pdf scan.pdf --out-pdf TEST.pdf
without any hOCR file throws a confusingAttributeError: 'NoneType' object has no attribute 'seek'
. Actually I tried to reinstall with three different versions and came here to report a bug.Then I found another line in the README.md that "It is not possible to recode/compress a PDF without hOCR files". This is a crucial piece of information, but it is somewhat hidden. It is also not easy to find how to generate such a necessary file.
A google search suggested that I can use
tesseract scan.tif scan hocr
to generate hOCR file from a TIF. This would help for a single TIF file, but apparentlytesseract
does not accept PDF format.I suggest that