ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
13.85k stars 1.01k forks source link

[Running in Spyder IDE] OCR section stalls out, raises OSError: [Errno 9], or output pdf isn't rotated. #800

Closed SpencerRP closed 2 years ago

SpencerRP commented 3 years ago

Describe the bug When running ocrmypdf on a variety of files, I've hit the following errors in order of prevalence:

This has happened across multiple computers, even ones with fresh installations of anaconda, tesseract, ghostscript, and ocrmypdf. I've always been run it on spyder using ocrmypdf as a module.

To Reproduce

Expected behavior The OCR portion of ocrmypdf will finish without error and generate files which are properly rotated.

System (please complete the following information):

Installation How did you install OCRmyPDF? Did you install it from your operating system's package manager, or using pip?

I've purely installed it through pip. Initially I used pip install git+github.com/jbarlow83/ocrmypdf, but after these errors started I switched to using pip install ocrmpydf.

Additional context

The script used is:


import ocrmypdf
from wand.image import Image as Img
from pathlib import Path
try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

path = r"C:\Users\username\Documents"
infile = "ExampleFile.pdf"
outfile = "ExampleFile_ocr.pdf"
lang = "eng"

ocrmypdf.ocr(input_file = os.path.join(path, infile), output_file = os.path.join(path, outfile), language = lang, output_type='pdf', rotate_pages=True)

(Original code also uses wand, PIL, and pytesseract directly, to re-ocr the output of ocrmypdf as jpg blobs and extract text for translation if lang != eng)

jbarlow83 commented 3 years ago

Does spyder have anything to do with this, or does the issue also occur when you run from a standard Python process or ocrmypdf from the command prompt?

Could you check if you can reproduce this issue with any of the test files in tests/resources, assuming you cannot share a reproducing file?

It will be easier if I can look at the actual file. If you are concerned about sharing the file publicly, you can encrypt it with my public key as described here: https://github.com/jbarlow83/OCRmyPDF/wiki

Do your PDFs contain very large images (hundreds of megapixels, page dimensions about 20", or high DPI above 600x600, anything of that sort of thing)?

SpencerRP commented 3 years ago

It looks like it has to do with spyder - it works on both my own files and your test files (listed below) when I use either the API in the sample code provided inside an interactive python process or a direct command in a standard python process on both anaconda prompt and command prompt. I've only hit issues when using the API in a script run from spyder.

I tested it in spyder with the skew, cardinal, and c02-22 test files. Skew worked cleanly, cardinal stalled while scanning (new to me), c02-22 stalled during OCR on the first page (issue I'm currently seeing with my own files).

My files are generally just scans of receipts or forms taken by a phone or personal printer/scanner, so I think they mostly aren't in the hundreds of megapixels or high DPI. However, I don't have any professional software so I haven't been able to check. If you need to me to I can try converting to a jpg and see what the DPI for that is. I checked a few of the files with adobe reader and they're around 10-13".

jbarlow83 commented 3 years ago

It seems that Spyder is nonstandard Python environment where multiprocessing may not work. https://stackoverflow.com/questions/60366361/pythons-multiprocessing-doesnt-work-in-spyder-ide

You could try --use-threads, or ocrmypdf --plugin ocrmypdf.extra_plugins.semfree. Both options are less performant.

SpencerRP commented 3 years ago

Do you mean performant in the sense that it will create worse output or in the sense that it will be slower? The latter is a-okay to me.

Please feel free to close this. Consider going through your open issues, as I saw some I think could be closed. If you need help for something feel free to hmu. I’d love to learn a bit more about this subject.

Thanks for your help and for your great program!!

On Fri, Jul 9, 2021 at 5:08 PM jbarlow83 @.***> wrote:

It seems that Spyder is nonstandard Python environment where multiprocessing may not work.

https://stackoverflow.com/questions/60366361/pythons-multiprocessing-doesnt-work-in-spyder-ide

You could try --use-threads, or ocrmypdf --plugin ocrmypdf.extra_plugins.semfree. Both options are less performant.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jbarlow83/OCRmyPDF/issues/800#issuecomment-877517524, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGAKUHH3EYUK2OMQPFK52LDTW6FRJANCNFSM5ABSWPMA .

jbarlow83 commented 3 years ago

By performant I mean possibly not as fast.

If you can confirm this works and if you're aware of a reliable way to test if we are running in a Spyder IDE I can add a shim for that. We do this automatically for other weird environments.

If you think some other open issues can be closed please comment on those.

SpencerRP commented 3 years ago

Use threads works. The plugin method gets the same bad file descriptor error. Here's the line in case I'm using it wrong: ocrmypdf.ocr(input_file = pdf, output_file = filename, language = '+'.join(list(set([tess_lang, 'eng']))), plugin="ocrmypdf.extra_plugins.semfree", rotate_pages=True, deskew=True, force_ocr = True)

For detecting the spyder IDE there's this thread: https://stackoverflow.com/questions/17728395/detect-where-python-code-is-running-e-g-in-spyder-interpreter-vs-idle-vs-cm. To summarize, look for "SPYDER" in any variable in os.environment and hope that the user didn't create something with it.

That was initially opened in 2013, so I've created a new question in case there are better answers. https://stackoverflow.com/questions/68354334/reliably-detect-spyder-ide

athompson673 commented 3 years ago

Smells to me like stdout redirection gone wrong... Spyder uses IPython which is similar to Jupyter where a "kernel" process evaluates python inputs, and forwards results back to the GUI. To do this Spyder redirects STDOUT which may be problematic when using "spawn" as a new instance of stdout might get opened rather than copying the redirected file handle. A full traceback would be helpful here to find which file handle is being the problem child.

athompson673 commented 3 years ago

If you'd rather just monkeypatch...

from here

def is_IPython():
    try:
        get_ipython()
    except NameError:
        return False
    return True
jbarlow83 commented 3 years ago

ocrmypdf is very strict about stdout. We write nothing to it unless specifically requested by the user. Any amount of chatter on stdout is a test failure.

To add a few more details in case someone happens to be knowledgeable, ocrmypdf child processes discard all of their log handlers and set up a queue handler (semaphore based, multiple producer single consumer IPC queue). A separate thread in the main process gathers all child process messages and handles them, in the default case forwarding to sys.stderr.

There's also the mess in src/leptonica.py - depending on which leptonica is installed, we might have to redirect and un-redirect stderr before each leptonica.

I can't reproduce this on IPython + Windows 10 (VM with 4 cores assigned), so I'm treating it a Spyder specific issue at the moment. (I haven't tried to reproduce on Spyder.) I'll probably add a warning that we don't play nice with spyder.

athompson673 commented 3 years ago

I can't reproduce this either in spyder directly... Of note, I couldn't run without the if __name__ == "__main__": perhaps there's something I missed in the install process? Capture

jbarlow83 commented 3 years ago

@SpencerRP Can you check if an ifmain guard will fix the issue?

athompson673 commented 3 years ago

BTW: Missing if main guard results in the very typical "new process before finished bootstraping" error (RuntimeError) I couldn't get OSError Bad file descriptor

SpencerRP commented 3 years ago

Currently I'm hitting the stalling out issue once again rather than the OSError, and am uncertain of how to provide a traceback of that. The ifmain guard didn't resolve this issue in my stub. In case it matters, the original script where I use ocrmypdf calls it inside a function, which is called by main, which is called inside a guard.

@athompson673 Sorry for a scrubbish question, but will these provide the full traceback you're looking for?

print(traceback.format_exc(e.__ traceback __)) #without the spaces

print(sys.exc_info()[2])

If there's any environment information I haven't provided that may help, please let me know.

athompson673 commented 3 years ago

@SpencerRP Just don't catch any exceptions via try: except: it should print full trace by default... if it's just stalled, kill it with ctrl-c and see where it was waiting... If you can come up with a pdf and a minimal script that causes this problem that you can share publicly that could help as well... I have not been able to produce any errors just working from the test pdf's (aside from those that have encoding errors intentionally)

for example if I omit the if __name__ == "__main__": clause, the traceback I get is:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "c:\users\wdagutilityaccount\appdata\local\programs\python\python39\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "c:\users\wdagutilityaccount\appdata\local\programs\python\python39\lib\multiprocessing\spawn.py", line 125, in _main
    prepare(preparation_data)
  File "c:\users\wdagutilityaccount\appdata\local\programs\python\python39\lib\multiprocessing\spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "c:\users\wdagutilityaccount\appdata\local\programs\python\python39\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "c:\users\wdagutilityaccount\appdata\local\programs\python\python39\lib\runpy.py", line 268, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "c:\users\wdagutilityaccount\appdata\local\programs\python\python39\lib\runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "c:\users\wdagutilityaccount\appdata\local\programs\python\python39\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\WDAGUtilityAccount\.spyder-py3\temp.py", line 41, in <module>
    ocrmypdf.ocr(input_file = resources / file,
  File "c:\users\wdagutilityaccount\appdata\local\programs\python\python39\lib\site-packages\ocrmypdf\api.py", line 340, in ocr
    return run_pipeline(options=options, plugin_manager=plugin_manager, api=True)
  File "c:\users\wdagutilityaccount\appdata\local\programs\python\python39\lib\site-packages\ocrmypdf\_sync.py", line 374, in run_pipeline
    exec_concurrent(context, executor)
  File "c:\users\wdagutilityaccount\appdata\local\programs\python\python39\lib\site-packages\ocrmypdf\_sync.py", line 271, in exec_concurrent
    executor(
  File "c:\users\wdagutilityaccount\appdata\local\programs\python\python39\lib\site-packages\ocrmypdf\_concurrent.py", line 82, in __call__
    self._execute(
  File "c:\users\wdagutilityaccount\appdata\local\programs\python\python39\lib\site-packages\ocrmypdf\builtin_plugins\concurrency.py", line 125, in _execute
    pool = pool_class(
  File "c:\users\wdagutilityaccount\appdata\local\programs\python\python39\lib\multiprocessing\context.py", line 119, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild,
  File "c:\users\wdagutilityaccount\appdata\local\programs\python\python39\lib\multiprocessing\pool.py", line 212, in __init__
    self._repopulate_pool()
  File "c:\users\wdagutilityaccount\appdata\local\programs\python\python39\lib\multiprocessing\pool.py", line 303, in _repopulate_pool
    return self._repopulate_pool_static(self._ctx, self.Process,
  File "c:\users\wdagutilityaccount\appdata\local\programs\python\python39\lib\multiprocessing\pool.py", line 326, in _repopulate_pool_static
    w.start()
  File "c:\users\wdagutilityaccount\appdata\local\programs\python\python39\lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "c:\users\wdagutilityaccount\appdata\local\programs\python\python39\lib\multiprocessing\context.py", line 327, in _Popen
    return Popen(process_obj)
  File "c:\users\wdagutilityaccount\appdata\local\programs\python\python39\lib\multiprocessing\popen_spawn_win32.py", line 45, in __init__
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "c:\users\wdagutilityaccount\appdata\local\programs\python\python39\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "c:\users\wdagutilityaccount\appdata\local\programs\python\python39\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
SpencerRP commented 3 years ago

@athompson673 So I got it to give me the bad file error, finally. Here's the traceback:

OCR:   0%|          | 0.0/1.0 [00:11<?, ?page/s]
Traceback (most recent call last):

  File "<ipython-input-1-5da87ef33d0a>", line 1, in <module>
    runfile('[redacted path]\Python_Scripts/ocrmypdf_stub.py', wdir='[redacted path]\Python_Scripts')

  File "C:\ProgramData\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
    execfile(filename, namespace)

  File "C:\ProgramData\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "[redacted]\Python_Scripts/ocrmypdf_stub.py", line 40, in <module>
    ocrmypdf.ocr(input_file = pdf, output_file = filename, language = '+'.join(list(set([tess_lang, 'eng']))), use_threads=False, rotate_pages=True, deskew=True, force_ocr = True)

  File "C:\Users\[user]\AppData\Roaming\Python\Python36\site-packages\ocrmypdf\api.py", line 340, in ocr
    return run_pipeline(options=options, plugin_manager=plugin_manager, api=True)

  File "C:\Users\[user]\AppData\Roaming\Python\Python36\site-packages\ocrmypdf\_sync.py", line 374, in run_pipeline
    exec_concurrent(context, executor)

  File "C:\Users\[user]\AppData\Roaming\Python\Python36\site-packages\ocrmypdf\_sync.py", line 284, in exec_concurrent
    task_finished=update_page,

  File "C:\Users\[user]\AppData\Roaming\Python\Python36\site-packages\ocrmypdf\_concurrent.py", line 89, in __call__
    task_finished=task_finished,

  File "C:\Users\[user]\AppData\Roaming\Python\Python36\site-packages\ocrmypdf\builtin_plugins\concurrency.py", line 132, in _execute
    for result in results:

  File "C:\ProgramData\Anaconda3\lib\multiprocessing\pool.py", line 735, in next
    raise value

OSError: [Errno 9] Bad file descriptor

I used two different computers:

On A it was stalling silently both with and without guard, but now with the guard gives the bad file descriptor error. Dropping the guard still leads to stalling. Ctrl+c/kernel interrupt doesn't impact that - it just keeps running. On B with the guard it works completely fine, without the guard it gives a runtime error and then stalls (ctrl-c/clicking kernel interrupt does nothing). I believe I didn't get the bad file descriptor on B. Only the stalling error while not using guard.

So this is potentially a version issue? I can upgrade computer A and see if it still gives an error, but I'll need to talk to my coworker about it.

Here's the code and pdf's from the test data that replicate the above. The settings for the ocrmypdf call I think don't matter, but they're what I've been using.


import os, ocrmypdf

tess_lang = "eng"
path = "Path\To\File"
file = "skew.pdf" #works
file = "cardinal.pdf" #breaks at scanning contents section/it's been 20 minutes with no progress past first page
file = "c02-22.pdf" #Breaks at OCR section on first page - logs say 0.5 and then it stalls for 10+ minutes. Sometimes breaks by saying [Errno9] Bad File Descriptor instead.
pdf = os.path.join(path, file)

if __name__ == "__main__":
 ocrmypdf.ocr(input_file = pdf, output_file = pdf, language = '+'.join(list(set([tess_lang, 'eng']))), rotate_pages=True, deskew=True, force_ocr = True)```
athompson673 commented 3 years ago

I downgraded my test environment to python 3.6.5, and started getting some problems, though not exactly OSerror... the first time through I got a memory error, then a dll load error (probably related to sandbox only having 4Gb by default). The second time I closed some things, and it ran fine ¯_(ツ)_/¯. I'm wondering now if we've run into a bit of dependency hell, and updating everything might be the best option. Unfortunately nothing jumps out at me from the traceback that could indicate what's going on...

jbarlow83 commented 3 years ago

python 3.6 has some issues with multiprocessing that were resolved in later versions. I'd say there's no point in chasing the issue in 3.6, but if it continues to be reproducible on 3.8 or 3.9 then there is some concern.

SpencerRP commented 3 years ago

I agree, it doesn't seem worth chasing the issue. I'll look into updating my systems and reopen this if I hit issues once more. Does the shim still seem worthwhile?