jwilk-archive / ocrodjvu

OCR for DjVu
GNU General Public License v2.0
45 stars 19 forks source link

Multiple jobs do not work with Tesseract 4 #31

Open ashipunov opened 5 years ago

ashipunov commented 5 years ago

I installed today Tesseract 4 from Ubuntu ppa (ppa:alex-p/tesseract-ocr, 4.0.0+git3515-9bcfa90c-1ppa1~xenial1). Tesseract itself works normal, and ocrodjvu also works OK with the default "j=1". However, when I specifed "j=4", ocrodjvu hangs and when I break it, I have the following:

$ time ocrodjvu --in-place -l rus+lat -j 4 nikolaev1970_diat_posjet.djvu 
Processing 'nikolaev1970_diat_posjet.djvu':
- Page #1
- Page #2
- Page #3
- Page #4
- Page #5
- Page #6
^Ctesseract: Tesseract Open Source OCR Engine v4.0.0-288-g9bcf with Leptonica
tesseract: Page 1
tesseract: Detected 105 diacritics
tesseract: Tesseract Open Source OCR Engine v4.0.0-288-g9bcf with Leptonica
tesseract: Page 1
tesseract: Tesseract Open Source OCR Engine v4.0.0-288-g9bcf with Leptonica
tesseract: Page 1
Exception while processing page 3:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 434, in page_thread
    result = self.process_page(page)
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 407, in process_page
    result = self._engine.recognize(pfile, language=self._options.language, details=self._options.details, uax29=self._options.uax29)
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 296, in recognize
    return f(image, language, details=details, uax29=uax29)
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 262, in recognize_hocr
    _wait_for_worker(worker)
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 69, in _wait_for_worker
    worker.wait()
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/ipc.py", line 121, in wait
    raise CalledProcessInterrupted(-return_code, self.__command)
CalledProcessInterrupted: Command 'tesseract' was interrupted by signal SIGINT
Exception while processing page 4:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 434, in page_thread
    result = self.process_page(page)
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 407, in process_page
    result = self._engine.recognize(pfile, language=self._options.language, details=self._options.details, uax29=self._options.uax29)
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 296, in recognize
    return f(image, language, details=details, uax29=uax29)
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 262, in recognize_hocr
    _wait_for_worker(worker)
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 69, in _wait_for_worker
    worker.wait()
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/ipc.py", line 121, in wait
    raise CalledProcessInterrupted(-return_code, self.__command)
CalledProcessInterrupted: Command 'tesseract' was interrupted by signal SIGINT
tesseract: Tesseract Open Source OCR Engine v4.0.0-288-g9bcf with Leptonica
tesseract: Page 1
Interrupted by user.
Exception while processing page 5:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 434, in page_thread
    result = self.process_page(page)
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 407, in process_page
    result = self._engine.recognize(pfile, language=self._options.language, details=self._options.details, uax29=self._options.uax29)
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 296, in recognize
    return f(image, language, details=details, uax29=uax29)
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 262, in recognize_hocr
    _wait_for_worker(worker)
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 69, in _wait_for_worker
    worker.wait()
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/ipc.py", line 121, in wait
    raise CalledProcessInterrupted(-return_code, self.__command)
CalledProcessInterrupted: Command 'tesseract' was interrupted by signal SIGINT
Exception while processing page 6:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 434, in page_thread
    result = self.process_page(page)
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 407, in process_page
    result = self._engine.recognize(pfile, language=self._options.language, details=self._options.details, uax29=self._options.uax29)
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 296, in recognize
    return f(image, language, details=details, uax29=uax29)
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 262, in recognize_hocr
    _wait_for_worker(worker)
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 69, in _wait_for_worker
    worker.wait()
  File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/ipc.py", line 121, in wait
    raise CalledProcessInterrupted(-return_code, self.__command)
CalledProcessInterrupted: Command 'tesseract' was interrupted by signal SIGINT
Intermediate files were left in the '/tmp/ocrodjvu.3_ZmXE' directory.

real    30m20.909s
user    118m6.372s
sys 0m10.420s

I know that there are issues with multi-threading so I used recommendations from

https://github.com/tesseract-ocr/tesseract/issues/898

and from

https://appliedmachinelearning.blog/2018/06/30/performing-ocr-by-running-parallel-instances-of-tesseract-4-0-python/

to set the environment as 'OMP_THREAD_LIMIT=1 tesseract'. However, all my attempts, namely (a) rename executable and replace it with the script, (b) make script which contains the alias and finally (c) change your code to allow this environment variable, failed.

My system info output:

$ inxi
CPU~Dual core Intel Core i7-2620M (-HT-MCP-) speed/max~799/3400 MHz Kernel~4.4.0-141-generic x86_64 Up~4:18 Mem~1518.7/7865.9MB HDD~2000.4GB(30.6% used) Procs~197 Client~Shell inxi~2.2.35 

Ocrodjvu version:

$ ocrodjvu --version
ocrodjvu 0.10.4
+ Python 2.7.12
+ subprocess32
+ python-djvulibre 0.7
+ lxml 3.5.0
+ html5lib-python 0.999
+ PyICU 1.9.2
  + ICU 55.1
    + Unicode 7.0

In the end, I reverted everything to Tesseract 3, and now it works. This means, for example, that I cannot OCR books in Armenian and Quechua as these languages for some reason are not in Tesseract 3.

Please help.

jwilk commented 5 years ago
$ time ocrodjvu --in-place -l rus+lat -j 4 nikolaev1970_diat_posjet.djvu 

Is this reproducible for you? That is, when you run this command again, does it hang too?

Unfortunately, I wasn't able to reproduce the bug here:

$ time ocrodjvu --in-place -l rus+lat -j 4 nikolaev1970_diat_posjet.djvu
Processing 'nikolaev1970_diat_posjet.djvu':
- Page #1
- Page #2
- Page #3
...
- Page #12

real    1m57.306s
user    2m55.692s
sys     2m29.692s

This was on Ubuntu 16.04 (xenial), same Tesseract and ocrodjvu versions as yours, and higher-end hardware (3 cores of Intel Xeon Gold 6140).

https://github.com/tesseract-ocr/tesseract/issues/898

This is weird. Unless something else is going on, excessive usage of threads shouldn't make things "infinitely slow".

But setting OMP_THREAD_LIMIT is a good idea anyway; I'll try to make ocrodjvu set this automatically for the next release.

In the mean time, you can set this manually:

$ time OMP_THREAD_LIMIT=1 ocrodjvu --in-place -l rus+lat -j 4 nikolaev1970_diat_posjet.djvu
Processing 'nikolaev1970_diat_posjet.djvu':
- Page #1
- Page #2
- Page #4
...
- Page #12

real    0m37.206s
user    1m45.108s
sys     0m2.484s
ashipunov commented 5 years ago

Dear Jakub,

Thank you very much. I will try to install Tesseract 4 again and then report my results.

By the way (this is unrelated), I did tried to find any instruction about how to install ocrodjvu under Windows (I was asked to help) but did not find any. Is it possible to install there?

With best wishes,

Alexey Shipunov

вт, 5 февр. 2019 г. в 11:44, Jakub Wilk notifications@github.com:

$ time ocrodjvu --in-place -l rus+lat -j 4 nikolaev1970_diat_posjet.djvu

Is this reproducible for you? That is, when you run this command again, does it hang too?

Unfortunately, I wasn't able to reproduce the bug here:

$ time ocrodjvu --in-place -l rus+lat -j 4 nikolaev1970_diat_posjet.djvuProcessing 'nikolaev1970_diat_posjet.djvu':- Page #1- Page #2- Page #3...- Page #12 real 1m57.306suser 2m55.692ssys 2m29.692s

This was on Ubuntu 16.04 (xenial), same Tesseract and ocrodjvu versions as yours, and higher-end hardware (3 cores of Intel Xeon Gold 6140).

tesseract-ocr/tesseract#898 https://github.com/tesseract-ocr/tesseract/issues/898

This is weird. Unless something else is going on, excessive usage of threads shouldn't make things "infinitely slow".

But setting OMP_THREAD_LIMIT is a good idea anyway; I'll try to make ocrodjvu set this automatically for the next release.

In the mean time, you can set this manually:

$ time OMP_THREAD_LIMIT=1 ocrodjvu --in-place -l rus+lat -j 4 nikolaev1970_diat_posjet.djvuProcessing 'nikolaev1970_diat_posjet.djvu':- Page #1- Page #2- Page #4...- Page #12 real 0m37.206suser 1m45.108ssys 0m2.484s

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jwilk/ocrodjvu/issues/31#issuecomment-460733917, or mute the thread https://github.com/notifications/unsubscribe-auth/AAReQICBJA8fJ92_VXxUEr-E9XihWD-Jks5vKcL1gaJpZM4agad2 .

ashipunov commented 5 years ago

Sorry, here am I again with the same issue. First, without OMP_THREAD_LIMIT=1 situation is the same:

$ inxi
CPU~Dual core Intel Core i7-2620M (-HT-MCP-) speed/max~804/3400 MHz Kernel~4.4.0-141-generic x86_64
$ ocrodjvu --version
ocrodjvu 0.10.4
+ Python 2.7.12
+ subprocess32
+ python-djvulibre 0.7
+ lxml 3.5.0
+ html5lib-python 0.999
+ PyICU 1.9.2
  + ICU 55.1
    + Unicode 7.0
$ tesseract --version
tesseract 4.0.0-297-gec8f
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0
$ # now:
$ time ocrodjvu --in-place -l rus+lat -j 4 nikolaev1970_diat_posjet.djvu
Processing 'nikolaev1970_diat_posjet.djvu':
- Page #1
- Page #2
- Page #3
- Page #4
- Page #5
- Page #6
^C
...
CalledProcessInterrupted: Command 'tesseract' was interrupted by signal SIGINT
Interrupted by user.
Intermediate files were left in the '/tmp/ocrodjvu.k9gIME' directory.
real    14m34.142s
user    58m3.444s
sys 0m4.336s

However:

$ time OMP_THREAD_LIMIT=1 ocrodjvu --in-place -l rus+lat -j 4 nikolaev1970_diat_posjet.djvu
Processing 'nikolaev1970_diat_posjet.djvu':
- Page #1
- Page #3
...
- Page #12
real    1m29.354s
user    5m24.756s
sys 0m3.328s

Finally, all works with "-j 1" as it should (but, to my surprise, only two times slower then with "-j 4").

Wild idea: does it reflect the difference between Intel Xeon and Intel Core i7? If so, I should try a different machine.

ashipunov commented 5 years ago

So another machine (sorry, I do not have Xeons but this is i5 with eight cores):

$  inxi
CPU~Quad core Intel Core i5-8250U (-HT-MCP-) speed/max~938/3400 MHz Kernel~4.15.0-43-generic x86_64
$ time ocrodjvu --in-place -l rus+lat -j 8 nikolaev1970_diat_posjet.djvu 
Processing 'nikolaev1970_diat_posjet.djvu':
- Page #1
- Page #2
- Page #3
...
- Page #10
^C
CalledProcessInterrupted: Command 'tesseract' was interrupted by signal SIGINT
Interrupted by user.
Intermediate files were left in the '/tmp/ocrodjvu.0DU1iF' directory.
real    14m43.246s
user    117m31.202s
sys 0m1.366s
$ # but:
$ time OMP_THREAD_LIMIT=1 ocrodjvu --in-place -l rus+lat -j 8 nikolaev1970_diat_posjet.djvu
Processing 'nikolaev1970_diat_posjet.djvu':
- Page #1
- Page #2
- Page #3 
...
- Page #12
real    0m32.009s
user    3m30.699s
sys 0m3.298s

This is really weird! I believe that this is a bug, probably associated with Intel i* processors.

jwilk commented 5 years ago

Color me baffled. :-/

I've hacked up a script to dump some information about the hanging processes: examine-hangs. Hopefully it'll shed some light on what's going on, but I'm not overly optimistic…

I'd like you to do the following:

  1. Disable ptrace restrictions that would prevent GDB from working:

    # sysctl kernel.yama.ptrace_scope=0
  2. Install GDB and a bunch of debug packages:

    # apt-get install gdb djvulibre-dbg libc6-dbg libgcc1-dbg libgomp1-dbg libstdc++6-5-dbg python-djvu-dbg python-lxml-dbg python2.7-dbg
  3. Run ocrodjvu with -j 4 (without OMP_THREAD_LIMIT) until it hangs.

  4. Run examine-hangs. (It's going to produce copious amount of output on stdout, so it's best to redirect it to a file.)

Send me the file with the output by email, or zip it and attach here.

jwilk commented 5 years ago

But setting OMP_THREAD_LIMIT is a good idea anyway; I'll try to make ocrodjvu set this automatically

This was implemented in 0.11.

ashipunov commented 5 years ago

Hi,

Thanks, I will try but I cannot guarantee that I will do it very soon. Thanks anyway!

Alexey

пн, 11 февр. 2019 г. в 11:11, Jakub Wilk notifications@github.com:

Color me baffled. :-/

I've hacked up a script to dump some information about the hanging processes: examine-hangs https://github.com/jwilk/ocrodjvu/blob/master/private/examine-hangs. Hopefully it'll shed some light on what's going on, but I'm not overly optimistic…

I'd like you to do the following:

  1. Disable ptrace restrictions that would prevent GDB from working:

  2. Install GDB and a bunch of debug packages:

  3. Run ocrodjvu with -j 4 (without OMP_THREAD_LIMIT) until it hangs.

  4. Run examine-hangs. (It's going to produce copious amount of output on stdout, so it's best to redirect it to a file.)

Send me the file with the output by email, or zip it and attach here.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jwilk/ocrodjvu/issues/31#issuecomment-462410755, or mute the thread https://github.com/notifications/unsubscribe-auth/AAReQA33i_Te_fxgGkFlhHhchA2KetEAks5vMaRFgaJpZM4agad2 .

ashipunov commented 5 years ago

I'd like you to do the following: ...

Done! I run the script ten times, just in case. Results are attached.

Alexey

пн, 11 февр. 2019 г. в 21:49, Alexey Shipunov dactylorhiza@gmail.com:

Hi,

Thanks, I will try but I cannot guarantee that I will do it very soon. Thanks anyway!

Alexey

пн, 11 февр. 2019 г. в 11:11, Jakub Wilk notifications@github.com:

Color me baffled. :-/

I've hacked up a script to dump some information about the hanging processes: examine-hangs https://github.com/jwilk/ocrodjvu/blob/master/private/examine-hangs. Hopefully it'll shed some light on what's going on, but I'm not overly optimistic…

I'd like you to do the following:

  1. Disable ptrace restrictions that would prevent GDB from working:

  2. Install GDB and a bunch of debug packages:

  3. Run ocrodjvu with -j 4 (without OMP_THREAD_LIMIT) until it hangs.

  4. Run examine-hangs. (It's going to produce copious amount of output on stdout, so it's best to redirect it to a file.)

Send me the file with the output by email, or zip it and attach here.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jwilk/ocrodjvu/issues/31#issuecomment-462410755, or mute the thread https://github.com/notifications/unsubscribe-auth/AAReQA33i_Te_fxgGkFlhHhchA2KetEAks5vMaRFgaJpZM4agad2 .

jwilk commented 5 years ago

Results are attached.

I didn't get any attachments. (I guess GitHub discarded them?)

ashipunov commented 5 years ago

I am trying to send the attachment on your personal email, and also place results of the last run below:

===================================================================================================

6507 /usr/bin/python /usr/local/bin/ocrodjvu --in-place -l rus+lat -j 8 nikolaev1970_diat_posjet.djvu

ocrodjvu(6507)-+-tesseract(6530)-+-{tesseract}(6545) | |-{tesseract}(6546) | -{tesseract}(6547) |-tesseract(6531)-+-{tesseract}(6542) | |-{tesseract}(6543) |-{tesseract}(6544) |-tesseract(6532)-+-{tesseract}(6559) | |-{tesseract}(6560) | -{tesseract}(6561) |-tesseract(6533)-+-{tesseract}(6556) | |-{tesseract}(6557) |-{tesseract}(6558) |-tesseract(6534)-+-{tesseract}(6549) | |-{tesseract}(6550) | -{tesseract}(6551) |-tesseract(6535)-+-{tesseract}(6553) | |-{tesseract}(6554) |-{tesseract}(6555) |-tesseract(6552)-+-{tesseract}(6562) | |-{tesseract}(6563) | -{tesseract}(6564) |-tesseract(6585)-+-{tesseract}(6586) | |-{tesseract}(6587) |-{tesseract}(6588) |-{ocrodjvu}(6509) |-{ocrodjvu}(6511) |-{ocrodjvu}(6512) |-{ocrodjvu}(6513) |-{ocrodjvu}(6517) |-{ocrodjvu}(6519) |-{ocrodjvu}(6520) |-{ocrodjvu}(6522) `-{ocrodjvu}(6524)

PID LWP S STARTED ELAPSED TIME %CPU RSZ VSZ COMMAND 6589 - - 14:32:52 01:34 00:00:00 0.0 5460 24844 /bin/bash

===================================================================================================

AS

сб, 23 февр. 2019 г. в 14:55, Jakub Wilk notifications@github.com:

Results are attached.

I didn't get any attachments. (I guess GitHub discarded them?)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jwilk/ocrodjvu/issues/31#issuecomment-466693679, or mute the thread https://github.com/notifications/unsubscribe-auth/AAReQCortgkpXe84bYoYbXFFZ_HYW7Gpks5vQarBgaJpZM4agad2 .

jwilk commented 5 years ago

Yikes, there was a bug in the examination script that broke it almost completely. :-( I've fixed the in 0ca41df8d80359134cf3536185c49fde1361f541. Could you try again with the updated script?

ashipunov commented 5 years ago

Sure. Five minutes.

AS

сб, 23 февр. 2019 г. в 15:37, Jakub Wilk notifications@github.com:

Yikes, there was a bug in the examination script that broke it almost completely. :-( I've fixed the in 0ca41df https://github.com/jwilk/ocrodjvu/commit/0ca41df8d80359134cf3536185c49fde1361f541 . Could you try again with the updated script?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jwilk/ocrodjvu/issues/31#issuecomment-466698785, or mute the thread https://github.com/notifications/unsubscribe-auth/AAReQO8FUUGZyGmf_XgoAUyODsfJAy4sks5vQbSRgaJpZM4agad2 .

ashipunov commented 5 years ago

Now examine_hangs.sh hangs itself ;) but output something large. I attach ZIP because output is bulky.

AS

сб, 23 февр. 2019 г. в 15:38, Alexey Shipunov dactylorhiza@gmail.com:

Sure. Five minutes.

AS

сб, 23 февр. 2019 г. в 15:37, Jakub Wilk notifications@github.com:

Yikes, there was a bug in the examination script that broke it almost completely. :-( I've fixed the in 0ca41df. Could you try again with the updated script?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

ashipunov commented 5 years ago

Dear Jakub,

Did you receive my output files?

Alexey

jwilk commented 5 years ago

Did you receive my output files?

Yes, thanks. I'll post their summary later on.

ashipunov commented 5 years ago

Thanks!

AS

ср, 27 февр. 2019 г. в 08:55, Jakub Wilk notifications@github.com:

Did you receive my output files?

Yes, thanks. I'll post their summary later on.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jwilk/ocrodjvu/issues/31#issuecomment-467892483, or mute the thread https://github.com/notifications/unsubscribe-auth/AAReQJsrtu73R-JuqvFDJDTb-J1yqf0Oks5vRpxDgaJpZM4agad2 .

jwilk commented 5 years ago

Here's the summary of the report I received from @ashipunov:

ashipunov commented 5 years ago

So to summarize summary, it is still unclear why... I believe that my hypothesis about processor-related issue might be feasible.

пт, 1 мар. 2019 г. в 12:02, Jakub Wilk notifications@github.com:

Here's the summary:

-

There's the ocrodjvu process running (with 10 threads), and 8 tesseract processes (with 4 threads each).

After almost 6 minutes, only the first page is done. All the tesseract processes seem to be consuming CPU:

 PID   LWP S  STARTED     ELAPSED     TIME %CPU   RSZ    VSZ COMMAND
7493     - - 15:41:18       05:57 00:00:00  0.1 63360 1395120 /usr/bin/python /usr/local/bin/ocrodjvu --in-place -l rus+lat -j 8 nikolaev1970_diat_posjet.djvu
7516     - - 15:41:18       05:57 00:03:28 58.5 54180 120080 tesseract /tmp/ocrodjvu.a6lLsl/000002.tif /tmp/ocrodjvu.RCQ4D3/tmp -l rus+lat /tmp/ocrodjvu.RCQ4D3/tessconf
7517     - - 15:41:18       05:57 00:04:55 82.8 72088 137560 tesseract /tmp/ocrodjvu.a6lLsl/000005.tif /tmp/ocrodjvu.VawxBw/tmp -l rus+lat /tmp/ocrodjvu.VawxBw/tessconf
7518     - - 15:41:18       05:57 00:06:25  107 59708 124680 tesseract /tmp/ocrodjvu.a6lLsl/000007.tif /tmp/ocrodjvu.KOh_L6/tmp -l rus+lat /tmp/ocrodjvu.KOh_L6/tessconf
7519     - - 15:41:18       05:57 00:06:26  108 73656 138964 tesseract /tmp/ocrodjvu.a6lLsl/000003.tif /tmp/ocrodjvu.e5v4Qh/tmp -l rus+lat /tmp/ocrodjvu.e5v4Qh/tessconf
7520     - - 15:41:18       05:57 00:06:24  107 70500 135832 tesseract /tmp/ocrodjvu.a6lLsl/000006.tif /tmp/ocrodjvu.1GJ9Kc/tmp -l rus+lat /tmp/ocrodjvu.1GJ9Kc/tessconf
7521     - - 15:41:18       05:57 00:06:24  107 73040 138592 tesseract /tmp/ocrodjvu.a6lLsl/000004.tif /tmp/ocrodjvu.VgIIXx/tmp -l rus+lat /tmp/ocrodjvu.VgIIXx/tessconf
7560     - - 15:41:30       05:45 00:06:10  107 70360 135840 tesseract /tmp/ocrodjvu.a6lLsl/000008.tif /tmp/ocrodjvu.96qHYI/tmp -l rus+lat /tmp/ocrodjvu.96qHYI/tessconf
7581     - - 15:42:32       04:43 00:05:09  109 74156 139876 tesseract /tmp/ocrodjvu.a6lLsl/000009.tif /tmp/ocrodjvu.4XLOur/tmp -l rus+lat /tmp/ocrodjvu.4XLOur/tessconf

-

Backtraces from ocrodjvu threads look fine:

  • the main thread:

    Waiting for the GIL
    File "/usr/lib/python2.7/threading.py", line 340, in wait
      waiter.acquire()
    File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 512, in _process
      condition.wait()
    File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 549, in process
      self._process(*args, **kwargs)
    File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 567, in main
      context.process(options.path, options.pages)
    File "/usr/local/bin/ocrodjvu", line 7, in <module>
      _.main(sys.argv)
    • internal python-djvulibre thread:

      0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185

      1 0x00007fdb8cf085f6 in DJVU::GMonitor::wait (this=this@entry=0x1e44af0) at GThreads.cpp:576

      2 0x00007fdb8cf47ec0 in ddjvu_message_wait (ctx=0x1e44ae0) at ddjvuapi.cpp:733

      3 0x00007fdb8c3a7319 in pyx_pf_4djvu_6decodeContext_message_distributor (pyx_self=, pyx_v_kwargs={'sentinel': <object at remote 0x7fdb8f2e4120>}, __pyx_v_self=) at build/temp.linux-x86_64-2.7/src/decode.c:15397

      4 pyx_pw_4djvu_6decode_1_Context_message_distributor (__pyx_self=, pyx_args=, __pyx_kwds=) at build/temp.linux-x86_64-2.7/src/decode.c:15312

      5 0x00000000004a587e in PyObject_Call () at ../Objects/abstract.c:2546

      6 0x00000000004c5f3d in PyEval_CallObjectWithKeywords () at ../Python/ceval.c:4219

      7 0x0000000000589662 in t_bootstrap () at ../Modules/threadmodule.c:620

      8 0x00007fdb8efd46ba in start_thread (arg=0x7fdb89105700) at pthread_create.c:333

      9 0x00007fdb8ed0a41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

    • 8 worker threads:

      File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 67, in _wait_for_worker stderr = worker.stderr.readlines() File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 262, in recognize_hocr _wait_for_worker(worker) File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 296, in recognize return f(image, language, details=details, uax29=uax29) File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 407, in process_page result = self._engine.recognize(pfile, language=self._options.language, details=self._options.details, uax29=self._options.uax29) File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 434, in page_thread result = self.process_page(page) File "/usr/lib/python2.7/threading.py", line 754, in run self.target(*self.args, **self.kwargs) File "/usr/lib/python2.7/threading.py", line 801, in bootstrap_inner self.run() File "/usr/lib/python2.7/threading.py", line 774, in bootstrap self.bootstrap_inner()

    There are no backtraces for tesseract processes, because apparently GDB hangs on them. :-(

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jwilk/ocrodjvu/issues/31#issuecomment-468754965, or mute the thread https://github.com/notifications/unsubscribe-auth/AAReQLoQRZAi8nTxLkKD-p-A0O8CWuIGks5vSWsxgaJpZM4agad2 .