ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
13.59k stars 994 forks source link

[Bug/Feature]: a way to disable Ghostscript requirement & broken plugin_manager option #1379

Open nikitar opened 4 weeks ago

nikitar commented 4 weeks ago

What were you trying to do?

I made a plugin that overrides rasterize_pdf_page and generate_pdfa, and it works great. However, when I try to remove ghostscript from the system, ocrmypdf tells me No such file or directory: 'gs' when validating hooks.

  File "/opt/assemble/atticus/current/jobs/image-to-pdf/venv/lib/python3.11/site-packages/ocrmypdf/subprocess/__init__.py", line 159, in get_version
    proc = run(
           ^^^^
  File "/opt/assemble/atticus/current/jobs/image-to-pdf/venv/lib/python3.11/site-packages/ocrmypdf/subprocess/__init__.py", line 63, in run
    proc = subprocess_run(args, env=env, check=check, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.11/subprocess.py", line 1955, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'gs'

Seeing that ghostscript.py is included into default plugins, I tried using plugin_manager option with builtins=False instead: plugin_manager=get_plugin_manager(plugins=[my_plugin_path], builtins=False). Now it fails on options = create_options( step, because plugin_manager object is not a number, string or path. Am I correct that plugin_manager option is no longer functional?

        cmdline.append(f"--{cmd_style_arg}")
        if isinstance(val, int | float):
            cmdline.append(str(val))
        elif isinstance(val, str):
            cmdline.append(val)
        elif isinstance(val, Path):
            cmdline.append(str(val))
        else:
            raise TypeError(f"{arg}: {val} ({type(val)})")
  File "/opt/assemble/atticus/current/jobs/image-to-pdf/venv/lib/python3.11/site-packages/ocrmypdf/api.py", line 368, in ocr
    options = create_options(
              ^^^^^^^^^^^^^^^
  File "/opt/assemble/atticus/current/jobs/image-to-pdf/venv/lib/python3.11/site-packages/ocrmypdf/api.py", line 200, in create_options
    cmdline, deferred = _kwargs_to_cmdline(
                        ^^^^^^^^^^^^^^^^^^^
  File "/opt/assemble/atticus/current/jobs/image-to-pdf/venv/lib/python3.11/site-packages/ocrmypdf/api.py", line 179, in _kwargs_to_cmdline
    raise TypeError(f"{arg}: {val} ({type(val)})")
TypeError: plugin_manager: <ocrmypdf._plugin_manager.OcrmypdfPluginManager object at 0xffffa083cd90> (<class 'ocrmypdf._plugin_manager.OcrmypdfPluginManager'>)

Sorry about the 2-in-1 issue, they're quite connected in this case.

Where are you installing/running from?

PyPI (pip, poetry, pipx, etc.)

OCRmyPDF version

15.4.3

What operating system are you working on?

Linux

Operating system details and version

Ubuntu 20.04

Simple sanity checks

Relevant log output

No response

nikitar commented 4 weeks ago

It seems that the simplest fix would be to add plugin_manager to the defer_kwargs list here, though I'm not sure that's exactly the purpose of defer_kwargs. I could make a pr, if that is indeed the solution.

nikitar commented 4 weeks ago

Or alternatively, a disable_plugins: Iterable[str] | None = None, option.

jbarlow83 commented 4 weeks ago

It should work without source changes if you do

plugin_manager = get_plugin_manager(plugins=['ocrmypdf.builtin_plugins.concurrency', ... all builtins except ocrmypdf.builtin_plugins.ghostscript..., 'yourcustomghostscriptreplacer'], builtins=False)

yourcustomghostscriptreplacer.py would need to @hookimpl every API that builtin_plugins/ghostscript.py does.

To convince yourself that this works, clone the source and rewrite builtins_plugins/ghostscript.py to use your PDF renderer instead of Ghostscript.

The reason fora this is that the builtin ghostscript plugin hooks check_options and tests for the existence of the gs binary. So you need to disable all default plugins to prevent this check_options hook from being installed, and install all ordinary plugins manually.

nikitar commented 3 weeks ago

How do you get around the TypeError: plugin_manager: <ocrmypdf._plugin_manager.OcrmypdfPluginManager object at 0xffff8a341fd0> (<class 'ocrmypdf._plugin_manager.OcrmypdfPluginManager'>) error? As I mentioned in the original post, it seems that plugin_manager option cannot be used at all right now, due to the check requiring most options to be a number, string or path.

nikitar commented 3 weeks ago

Also, am I correct that use_threads=False does not affect rasterize_pdf_page? (That's the impression I get from local tests, and from looking at the code, just wanted to confirm) And there's no way to easily re-use existing Executor logic for this?

Context: using pdfium for rendering, which is explicitly not thread-safe. Trying to see if there's a better approach than having a global pdfium_lock = threading.Lock().

jbarlow83 commented 3 weeks ago

use_threads affects which executor and which type of worker (thread or process). (For various reasons, it's never made sense to get rid of either type.) Then the worker calls rasterize_pdf_page. So you would need a threading lock (regardless of worker type; in the case of process the lock is just never contested).

If a plugin can't run under some configuration (including the setting of use_threads) it should hook check_options and raise an exception to say "plugin can't do that".

Long term I am considering converting ocrmypdf to rust, although I can't promise any kind of timeline, but moving to rust would mean being able to include libraries like pdfium with safe concurrency.

nikitar commented 3 weeks ago

Thank you. With regards to the plugin_manager option triggering a TypeError, should I make a pr to fix it?


On use_threads, I may be wrong, but page_context.plugin_manager.hook.rasterize_pdf_page is always called from the main process, no? E.g. if I print os.getpid() and multiprocessing.current_process().name inside the hook, I always get the same process number and MainProcess as process name, regardless of use_threads=False.

I thought the comment about "in the case of process the lock is just never contested" above suggested otherwise, though I may be wrong.

nikitar commented 3 weeks ago

It seems something overrides use_threads. I.e. use_threads=False when calling ocrmypdf.ocr becomes use_threads=True inside plugin's check_options. Stack trace also suggests it, since it begins in threading.py. (Doing print(''.join(traceback.format_stack())) inside rasterize_pdf_page)

I'm aware use_threads gets overridden in info.py, but those conditions don't seem to apply. (len(pages) is 10+, and available_cpu_count() is 10) I also tried specifying the jobs option, since it seems to be the thing that initialises max_workers in the link above, still no luck.

I'll try to run it in a debugger when I get a moment, just wanted to post an update.

jbarlow83 commented 3 weeks ago

The plugin manager error isn't an error. It's misuse of my admittedly underdocumented spec. You can just call ocrmypdf.ocr(plugins=['your plugin']). For what you're doing you probably don't need to construct a plugin manager - the main reason that's there is for testing (dependency injection). See tests/conftest.py check_ocrmypdf() for test usage.

It is true that in some cases, the directive to use_threads is overridden in certain cases (info.py), but that's a local decision.

nikitar commented 1 week ago

use_threads

Had a deeper look, use_threads=False is currently ignored when using Python api, ocrmypdf.ocr(... use_threads=False, ...). That's because _kwargs_to_cmdline method treats all boolean arguments as if false is the default:

        if isinstance(val, bool):
            if val:
                cmdline.append(f"--{cmd_style_arg}")
            continue

But for use_threads, true is the default:

    jobcontrol.add_argument(
        '--use-threads', action='store_true', default=True, help=argparse.SUPPRESS
    )

Am I correct that there is currently no way to set use_threads to false in the Python API? (As opposed to the command line API)


disabling the ghostscript check

The plugin manager error isn't an error. It's misuse of my admittedly underdocumented spec. You can just call ocrmypdf.ocr(plugins=['your plugin']).

In that case, am I correct that there is no way to remove the ghostscript plugin, and hence no way to remove the ghostscript requirement? Even if you're not using ghostscript for anything?

nikitar commented 1 week ago

(Happy to make a pr for both of those, if you're open to changing the logic)

jbarlow83 commented 1 week ago

You can use --no-use-threads to override that setting.

For ghostscript, compare how the test suite replaces ghostscript with test stubs. It is definitely replaceable.