--unpaper-args and --clean-final

ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

http://ocrmypdf.readthedocs.io/

Mozilla Public License 2.0

13.56k stars 992 forks source link

--unpaper-args and --clean-final #392

Open sojusnik opened 5 years ago

sojusnik commented 5 years ago

In the release notes it is stated that

The argument --clean-final now implies --clean.

In the documentation however the following commands are shown as an example:

ocrmypdf --clean --clean-final --unpaper-args '--layout double' input.pdf output.pdf ocrmypdf --clean --clean-final --unpaper-args '--layout double --no-noisefilter' input.pdf output.pdf

So are those examples outdated, or do I misinterpret the statement in the release notes, since in those examples --clean could be omitted?

Additionally, in another thread you wrote

I am thinking that I will change the behavior so that --unpaper-args implies --clean and --clean-final.

Do you still plan to implement that?

jbarlow83 commented 5 years ago

--clean-final does imply --clean now. It was useless without.

--unpaper-args implies --clean but not --clean-final. As such the awkward parameters in the docs are correct for the moment.

--unpaper-args might have some practical uses with only --clean, if you disable mask repositioning and are only trying to improve the OCR quality, but I'm concerned that's a fairly narrow use case. It so happens to be the use case of --clean on its own.

jbarlow83 commented 5 years ago

This is a really messy corner of the interface and I'm not too happy with how it currently works, I've decided. I'll have to see if I can come up with something better. Open to suggestions.

githubnavigator commented 5 years ago

I'm thrilled with the new features for unpaper, so I'm glad to see you're still moving ahead on this functionality. My main usage for this is to split two-up scanned pages into 2 separate pages, removing the black edges and centering/deskewing the text onto two new sheets of paper. (I had a problem with that, but I'll post it under a new case.) If there's a way to bake that functionality into ocrmypdf without needing to use unpaper-args, that would be convenient, as I've been struggling with the args.

jbarlow83 commented 5 years ago

Currently thinking:

A new argument --unpaper will do what --clean --clean-final does now (clean the presentation and OCR images)
--clean --clean-final --unpaper-args will all be removed
A new interface will allow plugins, making it possible to write a short script that has the effect of --unpaper-args. This avoids the messy/error prone nested arguments on the command line and is a lot more powerful. For example Pillow or OpenCV functions could also be used on images. Replicating the existing functionality of --unpaper-args would be a motivating example.

sojusnik commented 5 years ago

Sounds reasonable! I'm wondering how user-friendly the approach with the plugins will be: Does someone have to know how to code to use the unpaper parameters, as we have now with --unpaper-args, or will it be a more or less simple external file with those parameters in it?

jbarlow83 commented 5 years ago

It would need to be an external Python script/module that calls unpaper.

githubnavigator commented 5 years ago

Oh. Well, if possible, if you could please leave the "--unpaper-args" option in there, I'd really appreciate it. A separate Python script would significantly increase the complexity for me.

I was able to achieve my desired goals using these commands:

mutool poster -x 2 -y 1 in.pdf out.pdf
ocrmypdf --force-ocr --unpaper-args '--layout single' --deskew --clean-final in.pdf out.pdf

sojusnik commented 5 years ago

Do I understand correctly that with the python script approach it would be finally possible to split pages with unpaper?

jbarlow83 commented 5 years ago

@githubnavigator The idea would be that all of that functionality could be packaged in a short script: ocrmypdf-poster in.pdf out.pdf.

Generally I want to make it easier for people to customize things and attract more contributions.

I am both working on making ocrmypdf usable as a library and making the library extensible. That should make it easier to implement page splitting, but I don't think would be doable with plugins alone. As it currently looks the way to do that would be use to use components of ocrmypdf as a library in a custom command line driver.