Closed sallyhill closed 5 years ago
Many other couple thousand images going in worked well, but these images were the error images.
Any solution to that? Having similar issues.
@sallyhill , @psinger Does this happen with:
--oem 1
--oem 0
?
@amitdo
I just tried it, and it works with both options.
Any idea what's going on?
Seems like a bug in combining the two OCR engines.
Any way to track this down further?
You can use GDB to see the function call chain.
Frankly, I only use --oem 1 (or 3 with best/fast traineddata), so I'm not so motivated to invest time on this issue. Sorry.
:+1:
I get the reported assertion with the second image (all other images work for me) and will have a look.
@stweil,
Same assert was reported in:
Also see PR #1286
@zdenop Please label as bug.
have you found any solution for this? my pdf has Arabic and English both. I'm facing the same issue.contains_unichar_id(unichar_id):Error:Assert failed:in file c:\projects\github\tesseract-ocr\src\ccutil\unicharset.h, line 511 Exception in thread "main" java.lang.Error: Invalid memory access at com.sun.jna.Native.invokePointer(Native Method) at com.sun.jna.Function.invokePointer(Function.java:470) at com.sun.jna.Function.invoke(Function.java:404) at com.sun.jna.Function.invoke(Function.java:315) at com.sun.jna.Library$Handler.invoke(Library.java:212) at com.sun.proxy.$Proxy1.TessBaseAPIGetUTF8Text(Unknown Source) at net.sourceforge.tess4j.Tesseract.getOCRText(Tesseract.java:433) at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:288) at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:209) at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:193)
yeah.. i made a patch for it,\that removes this assert.. it's kinda ok'ish works.. but don't really solve an issue
Thanks syzer. from where I can get the patch. Please share. Could you please guide me to prepare trained data. Regards
Please see https://github.com/tesseract-ocr/tesseract/pull/1286 for the patch.
It has not been merged yet.
If you try it please provide feedback.
Please publish one standard jar file, so that we can explore it. And could you please guide me to create traineddata file.
thanks
Hi. I have the same issue, using Tesseract Open Source OCR Engine vv4.0.0-beta.1.20180608 with Leptonica for Windows. How can I get this patch?
I can reproduce this and since I haven't seen a stack trace for this yet I will post the one I have:
contains_unichar_id(unichar_id):Error:Assert failed:in file ../../src/ccutil/unicharset.h, line 511
Thread 1 "tesseract" received signal SIGSEGV, Segmentation fault.
ERRCODE::error (this=this@entry=0x7ffff7d774c8 <_ZL13ASSERT_FAILED>, caller=caller@entry=0x7ffff7874630 "contains_unichar_id(unichar_id)", acti
format=format@entry=0x7ffff7871e41 "in file %s, line %d") at errcode.cpp:86
86 if (!*p)
(gdb) bt
#0 ERRCODE::error (this=this@entry=0x7ffff7d774c8 <_ZL13ASSERT_FAILED>, caller=caller@entry=0x7ffff7874630 "contains_unichar_id(unichar_id)", action=action@entry=ABORT,
format=format@entry=0x7ffff7871e41 "in file %s, line %d") at errcode.cpp:86
#1 0x00007ffff77c5ef4 in UNICHARSET::get_isdigit (unichar_id=297, this=0x5555559ac990) at ../../src/ccutil/unicharset.h:511
#2 tesseract::Dict::char_for_dawg (dawg=0x555556c3f2d0, ch=297, this=0x555555dfb120) at dict.h:435
#3 tesseract::Dict::def_letter_is_okay(void*, int, bool) const () at dict.cpp:413
#4 0x00007ffff77c624e in tesseract::Dict::valid_word(WERD_CHOICE const&, bool) const () at ../../src/ccstruct/ratngs.h:314
#5 0x00007ffff76c437b in tesseract::Tesseract::recog_word(WERD_RES*) () at tfacepp.cpp:69
#6 0x00007ffff76c1ed3 in tesseract::Tesseract::tess_segment_pass_n (this=this@entry=0x7ffff7fd2010, pass_n=pass_n@entry=1, word=word@entry=0x55555ad33a20) at tessbox.cpp:48
#7 0x00007ffff7674b8e in tesseract::Tesseract::match_word_pass_n(int, WERD_RES*, ROW*, BLOCK*) () at control.cpp:1644
#8 0x00007ffff7674d89 in tesseract::Tesseract::classify_word_pass1 (this=0x7ffff7fd2010, word_data=..., in_word=0x55555acd0780, out_words=<optimized out>)
at control.cpp:1450
#9 0x00007ffff7676114 in tesseract::Tesseract::RetryWithLanguage(tesseract::WordData const&, void (tesseract::Tesseract::*)(tesseract::WordData const&, WERD_RES**, tesseract::PointerVector<WERD_RES>*), bool, WERD_RES**, tesseract::PointerVector<WERD_RES>*) () at control.cpp:923
#10 0x00007ffff7676944 in tesseract::Tesseract::classify_word_and_language(int, PAGE_RES_IT*, tesseract::WordData*) () at ../../src/ccutil/genericvector.h:716
#11 0x00007ffff767a189 in tesseract::Tesseract::RecogAllWordsPassN(int, ETEXT_DESC*, PAGE_RES_IT*, GenericVector<tesseract::WordData>*) () at control.cpp:276
#12 0x00007ffff767ba43 in tesseract::Tesseract::recog_all_words(PAGE_RES*, ETEXT_DESC*, TBOX const*, char const*, int) () at control.cpp:369
#13 0x00007ffff7663c6e in tesseract::TessBaseAPI::Recognize(ETEXT_DESC*) () at baseapi.cpp:907
#14 0x00007ffff7664002 in tesseract::TessBaseAPI::ProcessPage (this=this@entry=0x5555557592c0 <main::api>, pix=0x55555598a720, page_index=page_index@entry=0,
filename=filename@entry=0x7fffffffe5fa "0003.jpg", retry_config=retry_config@entry=0x0, timeout_millisec=timeout_millisec@entry=0, renderer=0x555555983800)
at baseapi.cpp:1217
#15 0x00007ffff7666fe9 in tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*, int, tesseract::TessResultRenderer*) () at baseapi.cpp:1169
#16 0x00007ffff766711e in tesseract::TessBaseAPI::ProcessPages (this=this@entry=0x5555557592c0 <main::api>, filename=filename@entry=0x7fffffffe5fa "0003.jpg",
retry_config=retry_config@entry=0x0, timeout_millisec=timeout_millisec@entry=0, renderer=<optimized out>) at baseapi.cpp:1070
#17 0x0000555555556c73 in main () at ../../src/ccutil/genericvector.h:716
#18 0x00007ffff67ff06b in __libc_start_main () from /usr/lib/libc.so.6
#19 0x000055555555729a in _start () at tesseractmain.cpp:602
Looks like the unicode point being provided to get_isdigit is not a valid digit and hits the assertion. Not sure how and why we end up there though.
Please check the version of traineddata file that you are using.
Also try with traineddata from tessdata_fast and tessdata_best. Do you get the same error?
On Thu, Aug 09, 2018 at 07:34:40AM -0700, Shreeshrii wrote:
Please check the version of traineddata file that you are using.
I used an about 2 week old version of the models in the tesseract-data github repo.
Also try with traineddata from tessdata_fast and tessdata_best. Do you get the same error?
Sadly I don't have access to the installation at the moment because I am off work and will be going on holiday tomorrow. I will make a note in my calendar to check this after I am back.
Cheers,
Silvan
The issue only occurs with models from tessdata
(starting with commit d87b3c) and OCR engine mode 2.
The issue only occurs with models from tessdata (starting with commit d87b3c) and OCR engine mode 2.
That commit 'Updated LSTM Models to integerized tessdata_best'.
The earlier commit by Ray was on Nov 29, 2016 Added LSTM models+lang models to 101 langs.
However, after that the format of traineddata files has changed to include the recoder. If I remember correctly, those LSTM models do not work/produce accurate recognition results with current code.
2017-07-14 (dc8745e) Ray Smith: Move LSTM unicharset and recoder to traineddata with version string part1. Backwards compatible - maybe.
@stweil This is in continuation to the comment above.
Traineddata files now have two separate unicharsets, one for legacy and the other for lstm.
It is possible that both these unicharsets were the same in the models from Nov 29, 2016. In that case the error will not manifest.
Even now, its is possible that certain language traineddatas have same unicharset for both legacy and lstm, those languages also will not show the error.
I expect that the error comes in languages which use the recoder/unicharcomprssor and where the two unicharsets are different.
This is my guess, I haven't verified it in the files.
On Thu, Aug 9, 2018 at 4:34 PM Shreeshrii notifications@github.com wrote:
Also try with traineddata from tessdata_fast and tessdata_best. Do you get the same error?
I tried both and can confirm that there is no error if I use the models from tessdata_fast or tessdata_best (as others have observed as well).
I consider this to be one of the most important bugs which I'd like to get fixed for 4.0.0, even if it only occurs with models from https://github.com/tesseract-ocr/traineddata when both old and new OCR engine are used (which is still the default). Several possible solutions exist:
--oem 3
would no longer be "based on what is available", but "best which is available". Drawback: People would still get the error when running with --oem 2
."best which is available"
Should be: best if available, else legacy if available, else exit with an error "not a valid traineddata"
It will be helpful if @jbreiden can check whether this error also happens with Google's version of tesseract.
See discussion #1849 with some ideas for workaround solutions.
@stweil, since we want to release 4.0.0 in the next 2-3 weeks and we still don't have a fix for this issue, I think we need to move to plan B (make a workaround).
We don't. I found a fix today. See pull request #1954.
Thanks!
I assume it also solves the other similar reports, right? https://github.com/tesseract-ocr/tesseract/issues/1205#issuecomment-364169774
Yes, I assume so. @sallyhill, @psinger please test the new code.
unfortunatly this issue still persists with releases containing the above bugfix (4.0.0 on archlinux)
➜ ~/projects/tesseract git:(master) tesseract --version
tesseract 4.0.0
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.2) : libpng 1.6.37 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 1.0.2
Found AVX2
Found AVX
Found SSE
(gdb) bt
#0 0x00007effa32860fb in ERRCODE::error(char const*, TessErrorLogCode, char const*, ...) const ()
from /usr/lib/libtesseract.so.4
#1 0x00007effa31f2a84 in tesseract::Dict::case_ok(WERD_CHOICE const&, UNICHARSET const&) const ()
from /usr/lib/libtesseract.so.4
#2 0x00007effa31fec28 in tesseract::Dict::AcceptableResult(WERD_RES*) const () from /usr/lib/libtesseract.so.4
#3 0x00007effa30cc734 in tesseract::Tesseract::match_word_pass_n(int, WERD_RES*, ROW*, BLOCK*) ()
from /usr/lib/libtesseract.so.4
#4 0x00007effa30cc7fa in tesseract::Tesseract::classify_word_pass1(tesseract::WordData const&, WERD_RES**, tesseract::PointerVector<WERD_RES>*) () from /usr/lib/libtesseract.so.4
#5 0x00007effa30ce0c7 in tesseract::Tesseract::RetryWithLanguage(tesseract::WordData const&, void (tesseract::Tesseract::*)(tesseract::WordData const&, WERD_RES**, tesseract::PointerVector<WERD_RES>*), bool, WERD_RES**, tesseract::PointerVector<WERD_RES>*) () from /usr/lib/libtesseract.so.4
#6 0x00007effa30ce7f1 in tesseract::Tesseract::classify_word_and_language(int, PAGE_RES_IT*, tesseract::WordData*)
() from /usr/lib/libtesseract.so.4
#7 0x00007effa30d1240 in tesseract::Tesseract::RecogAllWordsPassN(int, ETEXT_DESC*, PAGE_RES_IT*, GenericVector<tesseract::WordData>*) () from /usr/lib/libtesseract.so.4
#8 0x00007effa30d2f84 in tesseract::Tesseract::recog_all_words(PAGE_RES*, ETEXT_DESC*, TBOX const*, char const*, int) () from /usr/lib/libtesseract.so.4
#9 0x00007effa30bc6b3 in tesseract::TessBaseAPI::Recognize(ETEXT_DESC*) () from /usr/lib/libtesseract.so.4
#10 0x00007effa30bca2b in tesseract::TessBaseAPI::ProcessPage(Pix*, int, char const*, char const*, int, tesseract::TessResultRenderer*) () from /usr/lib/libtesseract.so.4
#11 0x00007effa30bd6f5 in tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*, int, tesseract::TessResultRenderer*) () from /usr/lib/libtesseract.so.4
#12 0x00007effa30bd8af in tesseract::TessBaseAPI::ProcessPages(char const*, char const*, int, tesseract::TessResultRenderer*) () from /usr/lib/libtesseract.so.4
#13 0x000055bb5496cc96 in main ()
bad news is that I can not share the file causing it.
Try using --oem 1
as a workaround.
@ingwinlu, it would help to have a reproducible test case. Perhaps you can find a shareable image, or you can send me your image via e-mail.
I get the error: "Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica contains_unichar_id(unichar_id):Error:Assert failed:in file ../ccutil/unicharset.h, line 513" When running it on the following test image: The problem persists even when running with --oem 1
Your tesseract version is very very old. Use the latest code when dealing with issue.
i have the latest version
you wrote:
I get the error: "Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica contains_unichar_id(unichar_id):Error:Assert failed:in file ../ccutil/unicharset.h, line 513"
yes, that is the error i am getting. I could not find any instructions for installing Tesseract on RedHat, so I used the instructions given by this guy's blog: https://www.mail-archive.com/tesseract-ocr@googlegroups.com/msg15794.html
If you get that error you are not using the latest code/version. And it is not tesseract issue.
I uninstalled tessaract and reinstalled it using the instructions given here: https://github.com/tesseract-ocr/tesseract/wiki The problem still persists. I notice that tesseract-lang is only version 4.00, which does not match the version 4.1.0 of tesseract itself. Could this be what is causing the issue, and if so then how do I get the most recent version of tesseract-lang?
I am getting same error even when I try to use no config. Is this issue still closed??
Please post tesseract version, which traineddata you used and the image giving error.
Environment
Current Behavior:
text to string of these images throws a TesseractError that prints: (-6, 'contains_unichar_id(unichar_id):Error:Assert failed:in file ../ccutil/unicharset.h, line 513') on the attached files
Expected Behavior:
No error.
Suggested Fix:
I am not sure. Right now I'm just running pytesseract.image_to_string in a try block