pmaupin / pdfrw

pdfrw is a pure Python library that reads and writes PDFs
Other
1.86k stars 271 forks source link

invisible text layer #239

Open lbr991 opened 1 year ago

lbr991 commented 1 year ago

I'm trying to take a non-searchable pdf and convert it to searchable pdf by superimposing an invisible text layer. I need the invisible text bboxes to align exactly with the original bboxes (extracted by Textract). I also don't want to specify a font and font_size because then the bboxes wouldn't align perfectly.

Is this possible with pdfrw?

sl2c commented 1 year ago

Everything is possible with pdfrw, but there's no out-of-the-box solution. However, it can be done in a straightforward way using pdfrwx, see e.g.: https://github.com/sl2c/pdfrwx/blob/master/hocreditor.py — this is an example class that inserts OCR layer specified using hOCR format as invisible text in PDF.