Open ashipunov opened 5 years ago
$ time ocrodjvu --in-place -l rus+lat -j 4 nikolaev1970_diat_posjet.djvu
Is this reproducible for you? That is, when you run this command again, does it hang too?
Unfortunately, I wasn't able to reproduce the bug here:
$ time ocrodjvu --in-place -l rus+lat -j 4 nikolaev1970_diat_posjet.djvu
Processing 'nikolaev1970_diat_posjet.djvu':
- Page #1
- Page #2
- Page #3
...
- Page #12
real 1m57.306s
user 2m55.692s
sys 2m29.692s
This was on Ubuntu 16.04 (xenial), same Tesseract and ocrodjvu versions as yours, and higher-end hardware (3 cores of Intel Xeon Gold 6140).
This is weird. Unless something else is going on, excessive usage of threads shouldn't make things "infinitely slow".
But setting OMP_THREAD_LIMIT
is a good idea anyway; I'll try to make ocrodjvu set this automatically for the next release.
In the mean time, you can set this manually:
$ time OMP_THREAD_LIMIT=1 ocrodjvu --in-place -l rus+lat -j 4 nikolaev1970_diat_posjet.djvu
Processing 'nikolaev1970_diat_posjet.djvu':
- Page #1
- Page #2
- Page #4
...
- Page #12
real 0m37.206s
user 1m45.108s
sys 0m2.484s
Dear Jakub,
Thank you very much. I will try to install Tesseract 4 again and then report my results.
By the way (this is unrelated), I did tried to find any instruction about how to install ocrodjvu under Windows (I was asked to help) but did not find any. Is it possible to install there?
With best wishes,
Alexey Shipunov
вт, 5 февр. 2019 г. в 11:44, Jakub Wilk notifications@github.com:
$ time ocrodjvu --in-place -l rus+lat -j 4 nikolaev1970_diat_posjet.djvu
Is this reproducible for you? That is, when you run this command again, does it hang too?
Unfortunately, I wasn't able to reproduce the bug here:
$ time ocrodjvu --in-place -l rus+lat -j 4 nikolaev1970_diat_posjet.djvuProcessing 'nikolaev1970_diat_posjet.djvu':- Page #1- Page #2- Page #3...- Page #12 real 1m57.306suser 2m55.692ssys 2m29.692s
This was on Ubuntu 16.04 (xenial), same Tesseract and ocrodjvu versions as yours, and higher-end hardware (3 cores of Intel Xeon Gold 6140).
tesseract-ocr/tesseract#898 https://github.com/tesseract-ocr/tesseract/issues/898
This is weird. Unless something else is going on, excessive usage of threads shouldn't make things "infinitely slow".
But setting OMP_THREAD_LIMIT is a good idea anyway; I'll try to make ocrodjvu set this automatically for the next release.
In the mean time, you can set this manually:
$ time OMP_THREAD_LIMIT=1 ocrodjvu --in-place -l rus+lat -j 4 nikolaev1970_diat_posjet.djvuProcessing 'nikolaev1970_diat_posjet.djvu':- Page #1- Page #2- Page #4...- Page #12 real 0m37.206suser 1m45.108ssys 0m2.484s
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jwilk/ocrodjvu/issues/31#issuecomment-460733917, or mute the thread https://github.com/notifications/unsubscribe-auth/AAReQICBJA8fJ92_VXxUEr-E9XihWD-Jks5vKcL1gaJpZM4agad2 .
Sorry, here am I again with the same issue. First, without OMP_THREAD_LIMIT=1 situation is the same:
$ inxi
CPU~Dual core Intel Core i7-2620M (-HT-MCP-) speed/max~804/3400 MHz Kernel~4.4.0-141-generic x86_64
$ ocrodjvu --version
ocrodjvu 0.10.4
+ Python 2.7.12
+ subprocess32
+ python-djvulibre 0.7
+ lxml 3.5.0
+ html5lib-python 0.999
+ PyICU 1.9.2
+ ICU 55.1
+ Unicode 7.0
$ tesseract --version
tesseract 4.0.0-297-gec8f
leptonica-1.76.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0
$ # now:
$ time ocrodjvu --in-place -l rus+lat -j 4 nikolaev1970_diat_posjet.djvu
Processing 'nikolaev1970_diat_posjet.djvu':
- Page #1
- Page #2
- Page #3
- Page #4
- Page #5
- Page #6
^C
...
CalledProcessInterrupted: Command 'tesseract' was interrupted by signal SIGINT
Interrupted by user.
Intermediate files were left in the '/tmp/ocrodjvu.k9gIME' directory.
real 14m34.142s
user 58m3.444s
sys 0m4.336s
However:
$ time OMP_THREAD_LIMIT=1 ocrodjvu --in-place -l rus+lat -j 4 nikolaev1970_diat_posjet.djvu
Processing 'nikolaev1970_diat_posjet.djvu':
- Page #1
- Page #3
...
- Page #12
real 1m29.354s
user 5m24.756s
sys 0m3.328s
Finally, all works with "-j 1" as it should (but, to my surprise, only two times slower then with "-j 4").
Wild idea: does it reflect the difference between Intel Xeon and Intel Core i7? If so, I should try a different machine.
So another machine (sorry, I do not have Xeons but this is i5 with eight cores):
$ inxi
CPU~Quad core Intel Core i5-8250U (-HT-MCP-) speed/max~938/3400 MHz Kernel~4.15.0-43-generic x86_64
$ time ocrodjvu --in-place -l rus+lat -j 8 nikolaev1970_diat_posjet.djvu
Processing 'nikolaev1970_diat_posjet.djvu':
- Page #1
- Page #2
- Page #3
...
- Page #10
^C
CalledProcessInterrupted: Command 'tesseract' was interrupted by signal SIGINT
Interrupted by user.
Intermediate files were left in the '/tmp/ocrodjvu.0DU1iF' directory.
real 14m43.246s
user 117m31.202s
sys 0m1.366s
$ # but:
$ time OMP_THREAD_LIMIT=1 ocrodjvu --in-place -l rus+lat -j 8 nikolaev1970_diat_posjet.djvu
Processing 'nikolaev1970_diat_posjet.djvu':
- Page #1
- Page #2
- Page #3
...
- Page #12
real 0m32.009s
user 3m30.699s
sys 0m3.298s
This is really weird! I believe that this is a bug, probably associated with Intel i* processors.
Color me baffled. :-/
I've hacked up a script to dump some information about the hanging processes: examine-hangs. Hopefully it'll shed some light on what's going on, but I'm not overly optimistic…
I'd like you to do the following:
Disable ptrace restrictions that would prevent GDB from working:
# sysctl kernel.yama.ptrace_scope=0
Install GDB and a bunch of debug packages:
# apt-get install gdb djvulibre-dbg libc6-dbg libgcc1-dbg libgomp1-dbg libstdc++6-5-dbg python-djvu-dbg python-lxml-dbg python2.7-dbg
Run ocrodjvu with -j 4
(without OMP_THREAD_LIMIT
) until it hangs.
Run examine-hangs
. (It's going to produce copious amount of output on stdout, so it's best to redirect it to a file.)
Send me the file with the output by email, or zip it and attach here.
But setting
OMP_THREAD_LIMIT
is a good idea anyway; I'll try to make ocrodjvu set this automatically
This was implemented in 0.11.
Hi,
Thanks, I will try but I cannot guarantee that I will do it very soon. Thanks anyway!
Alexey
пн, 11 февр. 2019 г. в 11:11, Jakub Wilk notifications@github.com:
Color me baffled. :-/
I've hacked up a script to dump some information about the hanging processes: examine-hangs https://github.com/jwilk/ocrodjvu/blob/master/private/examine-hangs. Hopefully it'll shed some light on what's going on, but I'm not overly optimistic…
I'd like you to do the following:
Disable ptrace restrictions that would prevent GDB from working:
Install GDB and a bunch of debug packages:
Run ocrodjvu with -j 4 (without OMP_THREAD_LIMIT) until it hangs.
Run examine-hangs. (It's going to produce copious amount of output on stdout, so it's best to redirect it to a file.)
Send me the file with the output by email, or zip it and attach here.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jwilk/ocrodjvu/issues/31#issuecomment-462410755, or mute the thread https://github.com/notifications/unsubscribe-auth/AAReQA33i_Te_fxgGkFlhHhchA2KetEAks5vMaRFgaJpZM4agad2 .
I'd like you to do the following: ...
Done! I run the script ten times, just in case. Results are attached.
Alexey
пн, 11 февр. 2019 г. в 21:49, Alexey Shipunov dactylorhiza@gmail.com:
Hi,
Thanks, I will try but I cannot guarantee that I will do it very soon. Thanks anyway!
Alexey
пн, 11 февр. 2019 г. в 11:11, Jakub Wilk notifications@github.com:
Color me baffled. :-/
I've hacked up a script to dump some information about the hanging processes: examine-hangs https://github.com/jwilk/ocrodjvu/blob/master/private/examine-hangs. Hopefully it'll shed some light on what's going on, but I'm not overly optimistic…
I'd like you to do the following:
Disable ptrace restrictions that would prevent GDB from working:
Install GDB and a bunch of debug packages:
Run ocrodjvu with -j 4 (without OMP_THREAD_LIMIT) until it hangs.
Run examine-hangs. (It's going to produce copious amount of output on stdout, so it's best to redirect it to a file.)
Send me the file with the output by email, or zip it and attach here.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jwilk/ocrodjvu/issues/31#issuecomment-462410755, or mute the thread https://github.com/notifications/unsubscribe-auth/AAReQA33i_Te_fxgGkFlhHhchA2KetEAks5vMaRFgaJpZM4agad2 .
Results are attached.
I didn't get any attachments. (I guess GitHub discarded them?)
I am trying to send the attachment on your personal email, and also place results of the last run below:
===================================================================================================
6507 /usr/bin/python /usr/local/bin/ocrodjvu --in-place -l rus+lat -j 8 nikolaev1970_diat_posjet.djvu
ocrodjvu(6507)-+-tesseract(6530)-+-{tesseract}(6545)
| |-{tesseract}(6546)
| -{tesseract}(6547) |-tesseract(6531)-+-{tesseract}(6542) | |-{tesseract}(6543) |
-{tesseract}(6544)
|-tesseract(6532)-+-{tesseract}(6559)
| |-{tesseract}(6560)
| -{tesseract}(6561) |-tesseract(6533)-+-{tesseract}(6556) | |-{tesseract}(6557) |
-{tesseract}(6558)
|-tesseract(6534)-+-{tesseract}(6549)
| |-{tesseract}(6550)
| -{tesseract}(6551) |-tesseract(6535)-+-{tesseract}(6553) | |-{tesseract}(6554) |
-{tesseract}(6555)
|-tesseract(6552)-+-{tesseract}(6562)
| |-{tesseract}(6563)
| -{tesseract}(6564) |-tesseract(6585)-+-{tesseract}(6586) | |-{tesseract}(6587) |
-{tesseract}(6588)
|-{ocrodjvu}(6509)
|-{ocrodjvu}(6511)
|-{ocrodjvu}(6512)
|-{ocrodjvu}(6513)
|-{ocrodjvu}(6517)
|-{ocrodjvu}(6519)
|-{ocrodjvu}(6520)
|-{ocrodjvu}(6522)
`-{ocrodjvu}(6524)
PID LWP S STARTED ELAPSED TIME %CPU RSZ VSZ COMMAND 6589 - - 14:32:52 01:34 00:00:00 0.0 5460 24844 /bin/bash
===================================================================================================
AS
сб, 23 февр. 2019 г. в 14:55, Jakub Wilk notifications@github.com:
Results are attached.
I didn't get any attachments. (I guess GitHub discarded them?)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jwilk/ocrodjvu/issues/31#issuecomment-466693679, or mute the thread https://github.com/notifications/unsubscribe-auth/AAReQCortgkpXe84bYoYbXFFZ_HYW7Gpks5vQarBgaJpZM4agad2 .
Yikes, there was a bug in the examination script that broke it almost completely. :-( I've fixed the in 0ca41df8d80359134cf3536185c49fde1361f541. Could you try again with the updated script?
Sure. Five minutes.
AS
сб, 23 февр. 2019 г. в 15:37, Jakub Wilk notifications@github.com:
Yikes, there was a bug in the examination script that broke it almost completely. :-( I've fixed the in 0ca41df https://github.com/jwilk/ocrodjvu/commit/0ca41df8d80359134cf3536185c49fde1361f541 . Could you try again with the updated script?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jwilk/ocrodjvu/issues/31#issuecomment-466698785, or mute the thread https://github.com/notifications/unsubscribe-auth/AAReQO8FUUGZyGmf_XgoAUyODsfJAy4sks5vQbSRgaJpZM4agad2 .
Now examine_hangs.sh hangs itself ;) but output something large. I attach ZIP because output is bulky.
AS
сб, 23 февр. 2019 г. в 15:38, Alexey Shipunov dactylorhiza@gmail.com:
Sure. Five minutes.
AS
сб, 23 февр. 2019 г. в 15:37, Jakub Wilk notifications@github.com:
Yikes, there was a bug in the examination script that broke it almost completely. :-( I've fixed the in 0ca41df. Could you try again with the updated script?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
Dear Jakub,
Did you receive my output files?
Alexey
Did you receive my output files?
Yes, thanks. I'll post their summary later on.
Thanks!
AS
ср, 27 февр. 2019 г. в 08:55, Jakub Wilk notifications@github.com:
Did you receive my output files?
Yes, thanks. I'll post their summary later on.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jwilk/ocrodjvu/issues/31#issuecomment-467892483, or mute the thread https://github.com/notifications/unsubscribe-auth/AAReQJsrtu73R-JuqvFDJDTb-J1yqf0Oks5vRpxDgaJpZM4agad2 .
Here's the summary of the report I received from @ashipunov:
There's the ocrodjvu process running (with 10 threads), and 8 tesseract processes (with 4 threads each).
After almost 6 minutes, only the first page is done. All the tesseract processes seem to be consuming CPU:
PID LWP S STARTED ELAPSED TIME %CPU RSZ VSZ COMMAND
7493 - - 15:41:18 05:57 00:00:00 0.1 63360 1395120 /usr/bin/python /usr/local/bin/ocrodjvu --in-place -l rus+lat -j 8 nikolaev1970_diat_posjet.djvu
7516 - - 15:41:18 05:57 00:03:28 58.5 54180 120080 tesseract /tmp/ocrodjvu.a6lLsl/000002.tif /tmp/ocrodjvu.RCQ4D3/tmp -l rus+lat /tmp/ocrodjvu.RCQ4D3/tessconf
7517 - - 15:41:18 05:57 00:04:55 82.8 72088 137560 tesseract /tmp/ocrodjvu.a6lLsl/000005.tif /tmp/ocrodjvu.VawxBw/tmp -l rus+lat /tmp/ocrodjvu.VawxBw/tessconf
7518 - - 15:41:18 05:57 00:06:25 107 59708 124680 tesseract /tmp/ocrodjvu.a6lLsl/000007.tif /tmp/ocrodjvu.KOh_L6/tmp -l rus+lat /tmp/ocrodjvu.KOh_L6/tessconf
7519 - - 15:41:18 05:57 00:06:26 108 73656 138964 tesseract /tmp/ocrodjvu.a6lLsl/000003.tif /tmp/ocrodjvu.e5v4Qh/tmp -l rus+lat /tmp/ocrodjvu.e5v4Qh/tessconf
7520 - - 15:41:18 05:57 00:06:24 107 70500 135832 tesseract /tmp/ocrodjvu.a6lLsl/000006.tif /tmp/ocrodjvu.1GJ9Kc/tmp -l rus+lat /tmp/ocrodjvu.1GJ9Kc/tessconf
7521 - - 15:41:18 05:57 00:06:24 107 73040 138592 tesseract /tmp/ocrodjvu.a6lLsl/000004.tif /tmp/ocrodjvu.VgIIXx/tmp -l rus+lat /tmp/ocrodjvu.VgIIXx/tessconf
7560 - - 15:41:30 05:45 00:06:10 107 70360 135840 tesseract /tmp/ocrodjvu.a6lLsl/000008.tif /tmp/ocrodjvu.96qHYI/tmp -l rus+lat /tmp/ocrodjvu.96qHYI/tessconf
7581 - - 15:42:32 04:43 00:05:09 109 74156 139876 tesseract /tmp/ocrodjvu.a6lLsl/000009.tif /tmp/ocrodjvu.4XLOur/tmp -l rus+lat /tmp/ocrodjvu.4XLOur/tessconf
Backtraces from ocrodjvu threads look fine:
Waiting for the GIL
File "/usr/lib/python2.7/threading.py", line 340, in wait
waiter.acquire()
File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 512, in _process
condition.wait()
File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 549, in process
self._process(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 567, in main
context.process(options.path, options.pages)
File "/usr/local/bin/ocrodjvu", line 7, in <module>
_.main(sys.argv)
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fdb8cf085f6 in DJVU::GMonitor::wait (this=this@entry=0x1e44af0) at GThreads.cpp:576
#2 0x00007fdb8cf47ec0 in ddjvu_message_wait (ctx=0x1e44ae0) at ddjvuapi.cpp:733
#3 0x00007fdb8c3a7319 in __pyx_pf_4djvu_6decode__Context_message_distributor (__pyx_self=<optimized out>, __pyx_v_kwargs={'sentinel': <object at remote 0x7fdb8f2e4120>}, __pyx_v_self=<optimized out>) at build/temp.linux-x86_64-2.7/src/decode.c:15397
#4 __pyx_pw_4djvu_6decode_1_Context_message_distributor (__pyx_self=<optimized out>, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>) at build/temp.linux-x86_64-2.7/src/decode.c:15312
#5 0x00000000004a587e in PyObject_Call () at ../Objects/abstract.c:2546
#6 0x00000000004c5f3d in PyEval_CallObjectWithKeywords () at ../Python/ceval.c:4219
#7 0x0000000000589662 in t_bootstrap () at ../Modules/threadmodule.c:620
#8 0x00007fdb8efd46ba in start_thread (arg=0x7fdb89105700) at pthread_create.c:333
#9 0x00007fdb8ed0a41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 67, in _wait_for_worker
stderr = worker.stderr.readlines()
File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 262, in recognize_hocr
_wait_for_worker(worker)
File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 296, in recognize
return f(image, language, details=details, uax29=uax29)
File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 407, in process_page
result = self._engine.recognize(pfile, language=self._options.language, details=self._options.details, uax29=self._options.uax29)
File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 434, in page_thread
result = self.process_page(page)
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
self.__bootstrap_inner()
There are no backtraces for tesseract processes, because apparently GDB hangs on them. :-(
So to summarize summary, it is still unclear why... I believe that my hypothesis about processor-related issue might be feasible.
пт, 1 мар. 2019 г. в 12:02, Jakub Wilk notifications@github.com:
Here's the summary:
-
There's the ocrodjvu process running (with 10 threads), and 8 tesseract processes (with 4 threads each).
After almost 6 minutes, only the first page is done. All the tesseract processes seem to be consuming CPU:
PID LWP S STARTED ELAPSED TIME %CPU RSZ VSZ COMMAND 7493 - - 15:41:18 05:57 00:00:00 0.1 63360 1395120 /usr/bin/python /usr/local/bin/ocrodjvu --in-place -l rus+lat -j 8 nikolaev1970_diat_posjet.djvu 7516 - - 15:41:18 05:57 00:03:28 58.5 54180 120080 tesseract /tmp/ocrodjvu.a6lLsl/000002.tif /tmp/ocrodjvu.RCQ4D3/tmp -l rus+lat /tmp/ocrodjvu.RCQ4D3/tessconf 7517 - - 15:41:18 05:57 00:04:55 82.8 72088 137560 tesseract /tmp/ocrodjvu.a6lLsl/000005.tif /tmp/ocrodjvu.VawxBw/tmp -l rus+lat /tmp/ocrodjvu.VawxBw/tessconf 7518 - - 15:41:18 05:57 00:06:25 107 59708 124680 tesseract /tmp/ocrodjvu.a6lLsl/000007.tif /tmp/ocrodjvu.KOh_L6/tmp -l rus+lat /tmp/ocrodjvu.KOh_L6/tessconf 7519 - - 15:41:18 05:57 00:06:26 108 73656 138964 tesseract /tmp/ocrodjvu.a6lLsl/000003.tif /tmp/ocrodjvu.e5v4Qh/tmp -l rus+lat /tmp/ocrodjvu.e5v4Qh/tessconf 7520 - - 15:41:18 05:57 00:06:24 107 70500 135832 tesseract /tmp/ocrodjvu.a6lLsl/000006.tif /tmp/ocrodjvu.1GJ9Kc/tmp -l rus+lat /tmp/ocrodjvu.1GJ9Kc/tessconf 7521 - - 15:41:18 05:57 00:06:24 107 73040 138592 tesseract /tmp/ocrodjvu.a6lLsl/000004.tif /tmp/ocrodjvu.VgIIXx/tmp -l rus+lat /tmp/ocrodjvu.VgIIXx/tessconf 7560 - - 15:41:30 05:45 00:06:10 107 70360 135840 tesseract /tmp/ocrodjvu.a6lLsl/000008.tif /tmp/ocrodjvu.96qHYI/tmp -l rus+lat /tmp/ocrodjvu.96qHYI/tessconf 7581 - - 15:42:32 04:43 00:05:09 109 74156 139876 tesseract /tmp/ocrodjvu.a6lLsl/000009.tif /tmp/ocrodjvu.4XLOur/tmp -l rus+lat /tmp/ocrodjvu.4XLOur/tessconf
-
Backtraces from ocrodjvu threads look fine:
the main thread:
Waiting for the GIL File "/usr/lib/python2.7/threading.py", line 340, in wait waiter.acquire() File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 512, in _process condition.wait() File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 549, in process self._process(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 567, in main context.process(options.path, options.pages) File "/usr/local/bin/ocrodjvu", line 7, in <module> _.main(sys.argv)
internal python-djvulibre thread:
0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
1 0x00007fdb8cf085f6 in DJVU::GMonitor::wait (this=this@entry=0x1e44af0) at GThreads.cpp:576
2 0x00007fdb8cf47ec0 in ddjvu_message_wait (ctx=0x1e44ae0) at ddjvuapi.cpp:733
3 0x00007fdb8c3a7319 in pyx_pf_4djvu_6decodeContext_message_distributor (pyx_self=
, pyx_v_kwargs={'sentinel': <object at remote 0x7fdb8f2e4120>}, __pyx_v_self=) at build/temp.linux-x86_64-2.7/src/decode.c:15397 4 pyx_pw_4djvu_6decode_1_Context_message_distributor (__pyx_self=
, pyx_args=, __pyx_kwds= ) at build/temp.linux-x86_64-2.7/src/decode.c:15312 5 0x00000000004a587e in PyObject_Call () at ../Objects/abstract.c:2546
6 0x00000000004c5f3d in PyEval_CallObjectWithKeywords () at ../Python/ceval.c:4219
7 0x0000000000589662 in t_bootstrap () at ../Modules/threadmodule.c:620
8 0x00007fdb8efd46ba in start_thread (arg=0x7fdb89105700) at pthread_create.c:333
9 0x00007fdb8ed0a41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
8 worker threads:
File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 67, in _wait_for_worker stderr = worker.stderr.readlines() File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 262, in recognize_hocr _wait_for_worker(worker) File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/engines/tesseract.py", line 296, in recognize return f(image, language, details=details, uax29=uax29) File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 407, in process_page result = self._engine.recognize(pfile, language=self._options.language, details=self._options.details, uax29=self._options.uax29) File "/usr/local/lib/python2.7/dist-packages/ocrodjvu/cli/ocrodjvu.py", line 434, in page_thread result = self.process_page(page) File "/usr/lib/python2.7/threading.py", line 754, in run self.target(*self.args, **self.kwargs) File "/usr/lib/python2.7/threading.py", line 801, in bootstrap_inner self.run() File "/usr/lib/python2.7/threading.py", line 774, in bootstrap self.bootstrap_inner()
There are no backtraces for tesseract processes, because apparently GDB hangs on them. :-(
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jwilk/ocrodjvu/issues/31#issuecomment-468754965, or mute the thread https://github.com/notifications/unsubscribe-auth/AAReQLoQRZAi8nTxLkKD-p-A0O8CWuIGks5vSWsxgaJpZM4agad2 .
I installed today Tesseract 4 from Ubuntu ppa (ppa:alex-p/tesseract-ocr, 4.0.0+git3515-9bcfa90c-1ppa1~xenial1). Tesseract itself works normal, and ocrodjvu also works OK with the default "j=1". However, when I specifed "j=4", ocrodjvu hangs and when I break it, I have the following:
I know that there are issues with multi-threading so I used recommendations from
https://github.com/tesseract-ocr/tesseract/issues/898
and from
https://appliedmachinelearning.blog/2018/06/30/performing-ocr-by-running-parallel-instances-of-tesseract-4-0-python/
to set the environment as 'OMP_THREAD_LIMIT=1 tesseract'. However, all my attempts, namely (a) rename executable and replace it with the script, (b) make script which contains the alias and finally (c) change your code to allow this environment variable, failed.
My system info output:
Ocrodjvu version:
In the end, I reverted everything to Tesseract 3, and now it works. This means, for example, that I cannot OCR books in Armenian and Quechua as these languages for some reason are not in Tesseract 3.
Please help.