internetarchive / archive-pdf-tools

Fast PDF generation and compression. Deals with millions of pages daily.
https://archive-pdf-tools.readthedocs.io/en/latest/
GNU Affero General Public License v3.0
97 stars 13 forks source link

Use (not yet released) pdf->hocr conversation to improve compression for existing PDFs #27

Open MerlijnWajer opened 2 years ago

MerlijnWajer commented 2 years ago

If we know where the PDF contains text, we could apply our usual higher-quality hOCR-based compression there.

rmast commented 2 years ago

If you could recognize the font and it's a freely available font then you could replace the invisible text by the visible font and remove the jb2. One of the challenges would be the font spacing to get characters at exact spots. A normal PDF has complete words in a font with only coordinates per word.

Adobe Acrobat Pro even uses the scanned text as font, but also contains many sizes of fonts.

MerlijnWajer commented 2 years ago

I think recognizing the font will be much harder. Tesseract used to be able to do that, but with the LSTM switch they lost that functionality. This particular ticket is about taking the text layers in a PDF (likely glyphless if they come from Tesseract, OCRmyPDF or this repo) and extracting word/line coordinates, so that I can apply the hOCR mask generation code that I wrote earlier to it. The idea behind that mask generation code is that it knows about specific word/lines and their locations, so it can try both a normal mask and a maks of the inverted image, combine that with some heuristics about noise in the mask and hopefully figure out if text is inverted or not in the image itself (where the text is overlayed).

Getting the exact spots for anything that is digitised will be hard since print is usually not exact, I think.