Open nikitar opened 4 weeks ago
It seems that the simplest fix would be to add plugin_manager
to the defer_kwargs
list here, though I'm not sure that's exactly the purpose of defer_kwargs
. I could make a pr, if that is indeed the solution.
Or alternatively, a disable_plugins: Iterable[str] | None = None,
option.
It should work without source changes if you do
plugin_manager = get_plugin_manager(plugins=['ocrmypdf.builtin_plugins.concurrency', ... all builtins except ocrmypdf.builtin_plugins.ghostscript..., 'yourcustomghostscriptreplacer'], builtins=False)
yourcustomghostscriptreplacer.py would need to @hookimpl
every API that builtin_plugins/ghostscript.py does.
To convince yourself that this works, clone the source and rewrite builtins_plugins/ghostscript.py to use your PDF renderer instead of Ghostscript.
The reason fora this is that the builtin ghostscript plugin hooks check_options and tests for the existence of the gs
binary. So you need to disable all default plugins to prevent this check_options hook from being installed, and install all ordinary plugins manually.
How do you get around the TypeError: plugin_manager: <ocrmypdf._plugin_manager.OcrmypdfPluginManager object at 0xffff8a341fd0> (<class 'ocrmypdf._plugin_manager.OcrmypdfPluginManager'>)
error? As I mentioned in the original post, it seems that plugin_manager
option cannot be used at all right now, due to the check requiring most options to be a number, string or path.
Also, am I correct that use_threads=False does not affect rasterize_pdf_page? (That's the impression I get from local tests, and from looking at the code, just wanted to confirm) And there's no way to easily re-use existing Executor logic for this?
Context: using pdfium for rendering, which is explicitly not thread-safe. Trying to see if there's a better approach than having a global pdfium_lock = threading.Lock()
.
use_threads
affects which executor and which type of worker (thread or process). (For various reasons, it's never made sense to get rid of either type.) Then the worker calls rasterize_pdf_page
. So you would need a threading lock (regardless of worker type; in the case of process the lock is just never contested).
If a plugin can't run under some configuration (including the setting of use_threads) it should hook check_options and raise an exception to say "plugin can't do that".
Long term I am considering converting ocrmypdf to rust, although I can't promise any kind of timeline, but moving to rust would mean being able to include libraries like pdfium with safe concurrency.
Thank you. With regards to the plugin_manager
option triggering a TypeError
, should I make a pr to fix it?
On use_threads
, I may be wrong, but page_context.plugin_manager.hook.rasterize_pdf_page
is always called from the main process, no? E.g. if I print os.getpid()
and multiprocessing.current_process().name
inside the hook, I always get the same process number and MainProcess
as process name, regardless of use_threads=False
.
I thought the comment about "in the case of process the lock is just never contested" above suggested otherwise, though I may be wrong.
It seems something overrides use_threads. I.e. use_threads=False
when calling ocrmypdf.ocr
becomes use_threads=True
inside plugin's check_options
. Stack trace also suggests it, since it begins in threading.py
. (Doing print(''.join(traceback.format_stack()))
inside rasterize_pdf_page
)
I'm aware use_threads gets overridden in info.py, but those conditions don't seem to apply. (len(pages)
is 10+, and available_cpu_count()
is 10) I also tried specifying the jobs
option, since it seems to be the thing that initialises max_workers
in the link above, still no luck.
I'll try to run it in a debugger when I get a moment, just wanted to post an update.
The plugin manager error isn't an error. It's misuse of my admittedly underdocumented spec. You can just call ocrmypdf.ocr(plugins=['your plugin'])
. For what you're doing you probably don't need to construct a plugin manager - the main reason that's there is for testing (dependency injection). See tests/conftest.py check_ocrmypdf() for test usage.
It is true that in some cases, the directive to use_threads is overridden in certain cases (info.py), but that's a local decision.
Had a deeper look, use_threads=False
is currently ignored when using Python api, ocrmypdf.ocr(... use_threads=False, ...)
. That's because _kwargs_to_cmdline method treats all boolean arguments as if false
is the default:
if isinstance(val, bool):
if val:
cmdline.append(f"--{cmd_style_arg}")
continue
But for use_threads
, true
is the default:
jobcontrol.add_argument(
'--use-threads', action='store_true', default=True, help=argparse.SUPPRESS
)
Am I correct that there is currently no way to set use_threads
to false
in the Python API? (As opposed to the command line API)
The plugin manager error isn't an error. It's misuse of my admittedly underdocumented spec. You can just call ocrmypdf.ocr(plugins=['your plugin']).
In that case, am I correct that there is no way to remove the ghostscript plugin, and hence no way to remove the ghostscript requirement? Even if you're not using ghostscript for anything?
(Happy to make a pr for both of those, if you're open to changing the logic)
You can use --no-use-threads
to override that setting.
For ghostscript, compare how the test suite replaces ghostscript with test stubs. It is definitely replaceable.
What were you trying to do?
I made a plugin that overrides rasterize_pdf_page and generate_pdfa, and it works great. However, when I try to remove ghostscript from the system, ocrmypdf tells me
No such file or directory: 'gs'
when validating hooks.Seeing that ghostscript.py is included into default plugins, I tried using
plugin_manager
option withbuiltins=False
instead:plugin_manager=get_plugin_manager(plugins=[my_plugin_path], builtins=False)
. Now it fails onoptions = create_options(
step, because plugin_manager object is not a number, string or path. Am I correct thatplugin_manager
option is no longer functional?Sorry about the 2-in-1 issue, they're quite connected in this case.
Where are you installing/running from?
PyPI (pip, poetry, pipx, etc.)
OCRmyPDF version
15.4.3
What operating system are you working on?
Linux
Operating system details and version
Ubuntu 20.04
Simple sanity checks
Relevant log output
No response