[Feature]: Option to remove OCR

user1823 commented 1 week ago

Describe the proposed feature

Sometimes, I want to remove the OCR layer from a PDF. However, there is no good way of doing that yet.

Running gs -o out.pdf -sDEVICE=pdfwrite -dFILTERTEXT in.pdf works, but this sometimes increases the filesize (which is also mentioned in the GS docs). However, this is not desired because I just want to remove some information, in which case I would expect a reduction in file size.

I believe that OCRmyPDF already has a way of identifying the OCR text (which is necessary for --redo-ocr). So, implementing a feature to remove OCR should be simple and would fill a gap that is currently left by open-source PDF tools.

I have read the following documented limitation. But, this should not be a reason for not implementing the above-requested feature. We can simply document a similar limitation for the new feature too.

In some cases, existing OCR cannot be detected or replaced. Files produced by OCRmyPDF v2.2 or earlier, for example, are internally represented as having visible text with an opaque image drawn on top. This situation cannot be detected.

jbarlow83 commented 1 week ago

If the type of OCR you have is the type OCRmyPDF can detect then the odd combination --redo-ocr --tesseract-timeout 0 just might do it. Since it remove the OCR to some extent, add nothing back.

user1823 commented 1 week ago

This didn't work. I used OCRmyPDF to OCR a file and then use it to remove the OCR. However, the OCR layer was not removed.

I used these commands:

ocrmypdf --output-type pdf --max-image-mpixels 1000 --tesseract-downsample-above 3508 in.pdf ocr.pdf

ocrmypdf --output-type pdf --redo-ocr --tesseract-timeout 0 --optimize 0 ocr.pdf un_ocr.pdf

In case you are wondering, removing --optimize 0 didn't help either.

ocrmypdf / OCRmyPDF

[Feature]: Option to remove OCR #1435

Describe the proposed feature