--redo-ocr doesn't remove previous ocr-text layer made by ocrmypdf

Mark-Joy commented 2 years ago

Describe the bug

As title described, --redo-ocr doesn't remove previous ocr layer made by ocrmypdf

To Reproduce Steps to reproduce the behavior.

ocrmypdf  "in.pdf" "out.pdf" --output-type=pdf --pdf-renderer=hocr --lang=eng --tesseract-oem=1
ocrmypdf "out.pdf" "out-redo-ocr.pdf" --output-type=pdf --pdf-renderer=hocr --lang=eng --tesseract-oem=1 --redo-ocr

files.zip

Expected behavior Ocr-text layer previous made by ocrmypdf should be removed when using option --redo-ocr Screenshots If applicable, add screenshots to help explain your problem.

System (please complete the following information):

OS: Android 9
Python version: 3.10
OCRmyPDF version: 13.2.0

Installation

pip install ocrmypdf

Additional context Add any other context about the problem here.

Mark-Joy commented 2 years ago

It seems that because of 390fdf8, ocr-text is now packed in Form XObject. Simple solution is to name that object for easy detection and removal. Something like:

text_xobj_name = Name('/ocrmypdf-' + str(uuid.uuid4()))

mikeweinberg commented 2 years ago

I experienced this problem with ocrmypdf 14.0.1 when I decided to test performance between tesseract 5 and 4 by redoing a previously-performed "4" pdf as "5".

ocrmypdf / OCRmyPDF

--redo-ocr doesn't remove previous ocr-text layer made by ocrmypdf #897