Open user1823 opened 1 week ago
If the type of OCR you have is the type OCRmyPDF can detect then the odd combination --redo-ocr --tesseract-timeout 0
just might do it. Since it remove the OCR to some extent, add nothing back.
This didn't work. I used OCRmyPDF to OCR a file and then use it to remove the OCR. However, the OCR layer was not removed.
I used these commands:
ocrmypdf --output-type pdf --max-image-mpixels 1000 --tesseract-downsample-above 3508 in.pdf ocr.pdf
ocrmypdf --output-type pdf --redo-ocr --tesseract-timeout 0 --optimize 0 ocr.pdf un_ocr.pdf
In case you are wondering, removing --optimize 0
didn't help either.
Describe the proposed feature
Sometimes, I want to remove the OCR layer from a PDF. However, there is no good way of doing that yet.
Running
gs -o out.pdf -sDEVICE=pdfwrite -dFILTERTEXT in.pdf
works, but this sometimes increases the filesize (which is also mentioned in the GS docs). However, this is not desired because I just want to remove some information, in which case I would expect a reduction in file size.I believe that OCRmyPDF already has a way of identifying the OCR text (which is necessary for --redo-ocr). So, implementing a feature to remove OCR should be simple and would fill a gap that is currently left by open-source PDF tools.
I have read the following documented limitation. But, this should not be a reason for not implementing the above-requested feature. We can simply document a similar limitation for the new feature too.