ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.05k stars 1.02k forks source link

--redo-ocr doesn't remove previous ocr-text layer made by ocrmypdf #897

Open Mark-Joy opened 2 years ago

Mark-Joy commented 2 years ago

Describe the bug

As title described, --redo-ocr doesn't remove previous ocr layer made by ocrmypdf

To Reproduce Steps to reproduce the behavior.

ocrmypdf  "in.pdf" "out.pdf" --output-type=pdf --pdf-renderer=hocr --lang=eng --tesseract-oem=1
ocrmypdf "out.pdf" "out-redo-ocr.pdf" --output-type=pdf --pdf-renderer=hocr --lang=eng --tesseract-oem=1 --redo-ocr

files.zip

Expected behavior Ocr-text layer previous made by ocrmypdf should be removed when using option --redo-ocr Screenshots If applicable, add screenshots to help explain your problem.

System (please complete the following information):

Installation

pip install ocrmypdf

Additional context Add any other context about the problem here.

Mark-Joy commented 2 years ago

It seems that because of 390fdf8, ocr-text is now packed in Form XObject. Simple solution is to name that object for easy detection and removal. Something like:

text_xobj_name = Name('/ocrmypdf-' + str(uuid.uuid4()))

mikeweinberg commented 2 years ago

I experienced this problem with ocrmypdf 14.0.1 when I decided to test performance between tesseract 5 and 4 by redoing a previously-performed "4" pdf as "5".