tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.12k stars 9.4k forks source link

some images translated to text using Tesseract 4 throw an error regarding "contains_unichar_id" #1205

Closed sallyhill closed 5 years ago

sallyhill commented 6 years ago

Environment

Current Behavior:

text to string of these images throws a TesseractError that prints: (-6, 'contains_unichar_id(unichar_id):Error:Assert failed:in file ../ccutil/unicharset.h, line 513') on the attached files

Expected Behavior:

No error.

Suggested Fix:

I am not sure. Right now I'm just running pytesseract.image_to_string in a try block

sallyhill commented 6 years ago

Many other couple thousand images going in worked well, but these images were the error images.

psinger commented 6 years ago

Any solution to that? Having similar issues.

amitdo commented 6 years ago

@sallyhill , @psinger Does this happen with:

?

psinger commented 6 years ago

@amitdo

I just tried it, and it works with both options.

Any idea what's going on?

amitdo commented 6 years ago

Seems like a bug in combining the two OCR engines.

psinger commented 6 years ago

Any way to track this down further?

amitdo commented 6 years ago

You can use GDB to see the function call chain.

Frankly, I only use --oem 1 (or 3 with best/fast traineddata), so I'm not so motivated to invest time on this issue. Sorry.

syzer commented 6 years ago

:+1:

stweil commented 6 years ago

I get the reported assertion with the second image (all other images work for me) and will have a look.

amitdo commented 6 years ago

@stweil,

Same assert was reported in:

1154 #1177 #1181 #1222 #1223 #1232 #1237 #1307

Also see PR #1286

Shreeshrii commented 6 years ago

@stweil https://github.com/tesseract-ocr/tesseract/issues/1423

Shreeshrii commented 6 years ago

New report https://github.com/tesseract-ocr/tesseract/issues/1601

Shreeshrii commented 6 years ago

@zdenop Please label as bug.

ghost commented 6 years ago

have you found any solution for this? my pdf has Arabic and English both. I'm facing the same issue.contains_unichar_id(unichar_id):Error:Assert failed:in file c:\projects\github\tesseract-ocr\src\ccutil\unicharset.h, line 511 Exception in thread "main" java.lang.Error: Invalid memory access at com.sun.jna.Native.invokePointer(Native Method) at com.sun.jna.Function.invokePointer(Function.java:470) at com.sun.jna.Function.invoke(Function.java:404) at com.sun.jna.Function.invoke(Function.java:315) at com.sun.jna.Library$Handler.invoke(Library.java:212) at com.sun.proxy.$Proxy1.TessBaseAPIGetUTF8Text(Unknown Source) at net.sourceforge.tess4j.Tesseract.getOCRText(Tesseract.java:433) at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:288) at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:209) at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:193)

syzer commented 6 years ago

yeah.. i made a patch for it,\that removes this assert.. it's kinda ok'ish works.. but don't really solve an issue

ghost commented 6 years ago

Thanks syzer. from where I can get the patch. Please share. Could you please guide me to prepare trained data. Regards

Shreeshrii commented 6 years ago

Please see https://github.com/tesseract-ocr/tesseract/pull/1286 for the patch.

It has not been merged yet.

If you try it please provide feedback.

ghost commented 6 years ago

Please publish one standard jar file, so that we can explore it. And could you please guide me to create traineddata file.

thanks

danablanc commented 6 years ago

Hi. I have the same issue, using Tesseract Open Source OCR Engine vv4.0.0-beta.1.20180608 with Leptonica for Windows. How can I get this patch?

Shugyousha commented 6 years ago

I can reproduce this and since I haven't seen a stack trace for this yet I will post the one I have:

contains_unichar_id(unichar_id):Error:Assert failed:in file ../../src/ccutil/unicharset.h, line 511

Thread 1 "tesseract" received signal SIGSEGV, Segmentation fault.
ERRCODE::error (this=this@entry=0x7ffff7d774c8 <_ZL13ASSERT_FAILED>, caller=caller@entry=0x7ffff7874630 "contains_unichar_id(unichar_id)", acti
    format=format@entry=0x7ffff7871e41 "in file %s, line %d") at errcode.cpp:86
86            if (!*p)
(gdb) bt
#0  ERRCODE::error (this=this@entry=0x7ffff7d774c8 <_ZL13ASSERT_FAILED>, caller=caller@entry=0x7ffff7874630 "contains_unichar_id(unichar_id)", action=action@entry=ABORT,
    format=format@entry=0x7ffff7871e41 "in file %s, line %d") at errcode.cpp:86
#1  0x00007ffff77c5ef4 in UNICHARSET::get_isdigit (unichar_id=297, this=0x5555559ac990) at ../../src/ccutil/unicharset.h:511
#2  tesseract::Dict::char_for_dawg (dawg=0x555556c3f2d0, ch=297, this=0x555555dfb120) at dict.h:435
#3  tesseract::Dict::def_letter_is_okay(void*, int, bool) const () at dict.cpp:413
#4  0x00007ffff77c624e in tesseract::Dict::valid_word(WERD_CHOICE const&, bool) const () at ../../src/ccstruct/ratngs.h:314
#5  0x00007ffff76c437b in tesseract::Tesseract::recog_word(WERD_RES*) () at tfacepp.cpp:69
#6  0x00007ffff76c1ed3 in tesseract::Tesseract::tess_segment_pass_n (this=this@entry=0x7ffff7fd2010, pass_n=pass_n@entry=1, word=word@entry=0x55555ad33a20) at tessbox.cpp:48
#7  0x00007ffff7674b8e in tesseract::Tesseract::match_word_pass_n(int, WERD_RES*, ROW*, BLOCK*) () at control.cpp:1644
#8  0x00007ffff7674d89 in tesseract::Tesseract::classify_word_pass1 (this=0x7ffff7fd2010, word_data=..., in_word=0x55555acd0780, out_words=<optimized out>)
    at control.cpp:1450
#9  0x00007ffff7676114 in tesseract::Tesseract::RetryWithLanguage(tesseract::WordData const&, void (tesseract::Tesseract::*)(tesseract::WordData const&, WERD_RES**, tesseract::PointerVector<WERD_RES>*), bool, WERD_RES**, tesseract::PointerVector<WERD_RES>*) () at control.cpp:923
#10 0x00007ffff7676944 in tesseract::Tesseract::classify_word_and_language(int, PAGE_RES_IT*, tesseract::WordData*) () at ../../src/ccutil/genericvector.h:716
#11 0x00007ffff767a189 in tesseract::Tesseract::RecogAllWordsPassN(int, ETEXT_DESC*, PAGE_RES_IT*, GenericVector<tesseract::WordData>*) () at control.cpp:276
#12 0x00007ffff767ba43 in tesseract::Tesseract::recog_all_words(PAGE_RES*, ETEXT_DESC*, TBOX const*, char const*, int) () at control.cpp:369
#13 0x00007ffff7663c6e in tesseract::TessBaseAPI::Recognize(ETEXT_DESC*) () at baseapi.cpp:907
#14 0x00007ffff7664002 in tesseract::TessBaseAPI::ProcessPage (this=this@entry=0x5555557592c0 <main::api>, pix=0x55555598a720, page_index=page_index@entry=0,
    filename=filename@entry=0x7fffffffe5fa "0003.jpg", retry_config=retry_config@entry=0x0, timeout_millisec=timeout_millisec@entry=0, renderer=0x555555983800)
    at baseapi.cpp:1217
#15 0x00007ffff7666fe9 in tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*, int, tesseract::TessResultRenderer*) () at baseapi.cpp:1169
#16 0x00007ffff766711e in tesseract::TessBaseAPI::ProcessPages (this=this@entry=0x5555557592c0 <main::api>, filename=filename@entry=0x7fffffffe5fa "0003.jpg",
    retry_config=retry_config@entry=0x0, timeout_millisec=timeout_millisec@entry=0, renderer=<optimized out>) at baseapi.cpp:1070
#17 0x0000555555556c73 in main () at ../../src/ccutil/genericvector.h:716
#18 0x00007ffff67ff06b in __libc_start_main () from /usr/lib/libc.so.6
#19 0x000055555555729a in _start () at tesseractmain.cpp:602

Looks like the unicode point being provided to get_isdigit is not a valid digit and hits the assertion. Not sure how and why we end up there though.

Shreeshrii commented 6 years ago

Please check the version of traineddata file that you are using.

Also try with traineddata from tessdata_fast and tessdata_best. Do you get the same error?

Shugyousha commented 6 years ago

On Thu, Aug 09, 2018 at 07:34:40AM -0700, Shreeshrii wrote:

Please check the version of traineddata file that you are using.

I used an about 2 week old version of the models in the tesseract-data github repo.

Also try with traineddata from tessdata_fast and tessdata_best. Do you get the same error?

Sadly I don't have access to the installation at the moment because I am off work and will be going on holiday tomorrow. I will make a note in my calendar to check this after I am back.

Cheers,

Silvan

stweil commented 6 years ago

The issue only occurs with models from tessdata (starting with commit d87b3c) and OCR engine mode 2.

Shreeshrii commented 6 years ago

The issue only occurs with models from tessdata (starting with commit d87b3c) and OCR engine mode 2.

That commit 'Updated LSTM Models to integerized tessdata_best'.

The earlier commit by Ray was on Nov 29, 2016 Added LSTM models+lang models to 101 langs.

However, after that the format of traineddata files has changed to include the recoder. If I remember correctly, those LSTM models do not work/produce accurate recognition results with current code.

2017-07-14 (dc8745e) Ray Smith: Move LSTM unicharset and recoder to traineddata with version string part1. Backwards compatible - maybe.

Shreeshrii commented 6 years ago

@stweil This is in continuation to the comment above.

Traineddata files now have two separate unicharsets, one for legacy and the other for lstm.

It is possible that both these unicharsets were the same in the models from Nov 29, 2016. In that case the error will not manifest.

Even now, its is possible that certain language traineddatas have same unicharset for both legacy and lstm, those languages also will not show the error.

I expect that the error comes in languages which use the recoder/unicharcomprssor and where the two unicharsets are different.

This is my guess, I haven't verified it in the files.

Shugyousha commented 6 years ago

On Thu, Aug 9, 2018 at 4:34 PM Shreeshrii notifications@github.com wrote:

Also try with traineddata from tessdata_fast and tessdata_best. Do you get the same error?

I tried both and can confirm that there is no error if I use the models from tessdata_fast or tessdata_best (as others have observed as well).

stweil commented 6 years ago

I consider this to be one of the most important bugs which I'd like to get fixed for 4.0.0, even if it only occurs with models from https://github.com/tesseract-ocr/traineddata when both old and new OCR engine are used (which is still the default). Several possible solutions exist:

  1. Fix it. That's my favourite solution, but I still could not solve it. It would help to have a very short and simple text which triggers the problem (or if someone else finds the correct fix). Removing the assertion is not the correct fix!
  2. Avoid it. That would require changing the default: --oem 3 would no longer be "based on what is available", but "best which is available". Drawback: People would still get the error when running with --oem 2.
amitdo commented 6 years ago

"best which is available"

Should be: best if available, else legacy if available, else exit with an error "not a valid traineddata"

Shreeshrii commented 6 years ago

It will be helpful if @jbreiden can check whether this error also happens with Google's version of tesseract.

stweil commented 5 years ago

See discussion #1849 with some ideas for workaround solutions.

amitdo commented 5 years ago

@stweil, since we want to release 4.0.0 in the next 2-3 weeks and we still don't have a fix for this issue, I think we need to move to plan B (make a workaround).

stweil commented 5 years ago

We don't. I found a fix today. See pull request #1954.

amitdo commented 5 years ago

Thanks!

I assume it also solves the other similar reports, right? https://github.com/tesseract-ocr/tesseract/issues/1205#issuecomment-364169774

stweil commented 5 years ago

Yes, I assume so. @sallyhill, @psinger please test the new code.

ingwinlu commented 5 years ago

unfortunatly this issue still persists with releases containing the above bugfix (4.0.0 on archlinux)

➜  ~/projects/tesseract git:(master) tesseract --version
tesseract 4.0.0
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.2) : libpng 1.6.37 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 1.0.2
 Found AVX2
 Found AVX
 Found SSE
(gdb) bt
#0  0x00007effa32860fb in ERRCODE::error(char const*, TessErrorLogCode, char const*, ...) const ()
   from /usr/lib/libtesseract.so.4
#1  0x00007effa31f2a84 in tesseract::Dict::case_ok(WERD_CHOICE const&, UNICHARSET const&) const ()
   from /usr/lib/libtesseract.so.4
#2  0x00007effa31fec28 in tesseract::Dict::AcceptableResult(WERD_RES*) const () from /usr/lib/libtesseract.so.4
#3  0x00007effa30cc734 in tesseract::Tesseract::match_word_pass_n(int, WERD_RES*, ROW*, BLOCK*) ()
   from /usr/lib/libtesseract.so.4
#4  0x00007effa30cc7fa in tesseract::Tesseract::classify_word_pass1(tesseract::WordData const&, WERD_RES**, tesseract::PointerVector<WERD_RES>*) () from /usr/lib/libtesseract.so.4
#5  0x00007effa30ce0c7 in tesseract::Tesseract::RetryWithLanguage(tesseract::WordData const&, void (tesseract::Tesseract::*)(tesseract::WordData const&, WERD_RES**, tesseract::PointerVector<WERD_RES>*), bool, WERD_RES**, tesseract::PointerVector<WERD_RES>*) () from /usr/lib/libtesseract.so.4
#6  0x00007effa30ce7f1 in tesseract::Tesseract::classify_word_and_language(int, PAGE_RES_IT*, tesseract::WordData*)
    () from /usr/lib/libtesseract.so.4
#7  0x00007effa30d1240 in tesseract::Tesseract::RecogAllWordsPassN(int, ETEXT_DESC*, PAGE_RES_IT*, GenericVector<tesseract::WordData>*) () from /usr/lib/libtesseract.so.4
#8  0x00007effa30d2f84 in tesseract::Tesseract::recog_all_words(PAGE_RES*, ETEXT_DESC*, TBOX const*, char const*, int) () from /usr/lib/libtesseract.so.4
#9  0x00007effa30bc6b3 in tesseract::TessBaseAPI::Recognize(ETEXT_DESC*) () from /usr/lib/libtesseract.so.4
#10 0x00007effa30bca2b in tesseract::TessBaseAPI::ProcessPage(Pix*, int, char const*, char const*, int, tesseract::TessResultRenderer*) () from /usr/lib/libtesseract.so.4
#11 0x00007effa30bd6f5 in tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*, int, tesseract::TessResultRenderer*) () from /usr/lib/libtesseract.so.4
#12 0x00007effa30bd8af in tesseract::TessBaseAPI::ProcessPages(char const*, char const*, int, tesseract::TessResultRenderer*) () from /usr/lib/libtesseract.so.4
#13 0x000055bb5496cc96 in main ()

bad news is that I can not share the file causing it.

amitdo commented 5 years ago

Try using --oem 1 as a workaround.

stweil commented 5 years ago

@ingwinlu, it would help to have a reproducible test case. Perhaps you can find a shareable image, or you can send me your image via e-mail.

buerge3 commented 4 years ago

I get the error: "Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica contains_unichar_id(unichar_id):Error:Assert failed:in file ../ccutil/unicharset.h, line 513" When running it on the following test image: filter The problem persists even when running with --oem 1

zdenop commented 4 years ago

Your tesseract version is very very old. Use the latest code when dealing with issue.

buerge3 commented 4 years ago

i have the latest version image

zdenop commented 4 years ago

you wrote:

I get the error: "Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica contains_unichar_id(unichar_id):Error:Assert failed:in file ../ccutil/unicharset.h, line 513"

buerge3 commented 4 years ago

yes, that is the error i am getting. I could not find any instructions for installing Tesseract on RedHat, so I used the instructions given by this guy's blog: https://www.mail-archive.com/tesseract-ocr@googlegroups.com/msg15794.html

zdenop commented 4 years ago

If you get that error you are not using the latest code/version. And it is not tesseract issue.

buerge3 commented 4 years ago

I uninstalled tessaract and reinstalled it using the instructions given here: https://github.com/tesseract-ocr/tesseract/wiki The problem still persists. I notice that tesseract-lang is only version 4.00, which does not match the version 4.1.0 of tesseract itself. Could this be what is causing the issue, and if so then how do I get the most recent version of tesseract-lang?

Hemant2022 commented 3 years ago

I am getting same error even when I try to use no config. Is this issue still closed??

Shreeshrii commented 3 years ago

Please post tesseract version, which traineddata you used and the image giving error.