koreader / koreader

An ebook reader application supporting PDF, DjVu, EPUB, FB2 and many more formats, running on Cervantes, Kindle, Kobo, PocketBook and Android devices
http://koreader.rocks/
GNU Affero General Public License v3.0
16.36k stars 1.24k forks source link

App Crash when highlight a word on Arabic pdf #12478

Open mhmadaladin opened 1 week ago

mhmadaladin commented 1 week ago

Issue

When trying to highlight any word while the Force OCR option is "on" and choosing Arabic language, Koreader crash immediately.

Note 1: I'm using the 3.04 tesseract Arabic files and changed the language in default.custom.lua file as in manual to replace Chinese by Arabic

Note 2: A friend of mine is using the exact same tesseract files and having no problem or rarely crash, his device is jailbroken Kindle Oasis 9th generation

Steps to reproduce

1- open Koreader through kual launcher 2- choose any pdf book ( mainly happens with Arabic pdfs) 3- choose Force OCR option & Arabic language 4- try to highlight any word or paragraph

crash.log (if applicable)

crash.log is a file that is automatically created when KOReader crashes. It can normally be found in the KOReader directory:

Android logs are kept in memory. Please go to [Menu] → Help → Bug Report to save these logs to a file.

Please try to include the relevant sections in your issue description. You can upload the whole crash.log file (zipped if necessary) on GitHub by dragging and dropping it onto this textbox.

If your issue doesn't directly concern a Lua crash, we'll quite likely need you to reproduce the issue with verbose debug logging enabled before providing the logs to us. To do so, go to Top menu → Hamburger menu → Help → Report a bug and tap Enable verbose logging. Restart as requested, then repeat the steps for your issue.

If you instead opt to inline it, please do so behind a spoiler tag:

crash.log ``` ```

crash.log

NiLuJe commented 1 week ago

Do you actually have a proper set of tesseract data installed for your language?

I'm unfamiliar with tesseract and its error messages, but what's in the logs vaguely smells like PEBCAK, at least in part ;).

mhmadaladin commented 1 week ago

Yes, I installed the Arabic Tesseract trained data with all complimentary files from https://github.com/tesseract-ocr/tessdata/tree/3.04.00 as mentioned in the user guide, and it worked normally on other device as mentioned before

May be the log seems odd, because I tried in many books to make sure the problem is not in specific file

I think this part of the log is the main problem if anyone can help : Error: LSTM requested, but not present!! Loading tesseract. mgr->GetComponent(TESSDATA_INTTEMP, &fp):Error:Assert failed:in file build/arm-kindlepw2-linux-gnueabi/thirdparty/tesseract/source/src/classify/adaptmatch.cpp, line 539 lipc-wait-event exited normally with status: 0 Aborted ---------**

benoit-pierre commented 1 week ago

Yes, I installed the Arabic Tesseract trained data with all complimentary files from https://github.com/tesseract-ocr/tessdata/tree/3.04.00 as mentioned in the user guide, and it worked normally on other device as mentioned before

That's too old for the updated tesseract we use starting with 07/2024. The info message in KOReader is up-to-date:

No OCR results or no language data.

KOReader has a build-in OCR engine for recognizing words in scanned PDF and DjVu documents. In order to use OCR in scanned pages, you need to install tesseract trained data for your document language.

You can download language data files for Tesseract version 5.3.4 from https://tesseract-ocr.github.io/tessdoc/Data-Files

Copy the language data files (e.g., eng.traineddata for English and spa.traineddata for Spanish) into koreader/data/tessdata]])

@offset-torque: that part of the user guide needs to be updated.

mergen3107 commented 1 week ago

I missed that too! I should update my ocr files too :D

offset-torque commented 1 week ago

that part of the user guide needs to be updated.

If the person who is responsible for the OCR integration in KOReader corrects the current guide section with up-to-date information, I can include it in the upcoming update. If this is not you, please ping the relevant dev.

NiLuJe commented 1 week ago

It's basically what was quoted above (i.e., the current in-app help text). I have no idea who wrote what's currently in the guide for that section (and it doesn't matter all that much, nothing much has changed in practice, we just need to stop pointing to outdated links, basically).

mergen3107 commented 1 week ago

I think it was me 👀 On mobile now, can’t edit stuff much, sorry

offset-torque commented 1 week ago

I have no idea who wrote what's currently in the guide for that section (and it doesn't matter all that much, nothing much has changed in practice, we just need to stop pointing to outdated links, basically).

Outdated info I linked is in our wiki page (under the title of "Dictionary support") so that's out of my jurisdiction.

We have two options:

  1. Someone updates the wiki page so this link directs the user to the correct information
  2. Someone updates the OCR section in the user guide properly (which is not touched for at least 3 years)

Considering that more users will follow the user guide than the wiki page, I suggest the second option. In short, until one of our devs spend a little effort to revise this tiny section, user guide will stay like this.

mhmadaladin commented 1 week ago

Yes, I installed the Arabic Tesseract trained data with all complimentary files from https://github.com/tesseract-ocr/tessdata/tree/3.04.00 as mentioned in the user guide, and it worked normally on other device as mentioned before

That's too old for the updated tesseract we use starting with 07/2024. The info message in KOReader is up-to-date:

No OCR results or no language data. KOReader has a build-in OCR engine for recognizing words in scanned PDF and DjVu documents. In order to use OCR in scanned pages, you need to install tesseract trained data for your document language. You can download language data files for Tesseract version 5.3.4 from https://tesseract-ocr.github.io/tessdoc/Data-Files Copy the language data files (e.g., eng.traineddata for English and spa.traineddata for Spanish) into koreader/data/tessdata]])

@offset-torque: that part of the user guide needs to be updated.

Thank you so much, now there is no crash, but unfortunately most of words are not recognized, a message appears saying there is no OCR tesseract data, I think this has to do with the quality of the Arabic train data, but any help is appreciated.

benoit-pierre commented 1 week ago

Have you tried the 3 variants (tessdata, tessdata-best, tessdata-fast)?

mhmadaladin commented 1 week ago

Have you tried the 3 variants (tessdata, tessdata-best, tessdata-fast)?

No, only one, I'll try the other two when I return from Work, thank you for your support

poire-z commented 1 week ago

I must say I always had a bad experience highlight/lookup'ing in scanned PDF (including in mine made of book page photos made at public libraries, just concatenating the JPG - or after conversion to B&W TIF, in a PDF). Thought I would update my tessdata, using https://github.com/tesseract-ocr/tessdata_fast/blob/main/eng.traineddata and see, but still no.

Ie. in our sample.pdf, setting Forced OCR: on, long-pressin on that "population" word, I get a highlight where the black bar is and a "No OCR results or no language data" message. image

I don't know if OCR is already at play there, or if it is some other part of the code that should pick up segments of text in the bitmap before giving it to OCR.

(I remember that 6 years ago, I opened #3688, mentionning it may have something to do with the book dpi.)

mhmadaladin commented 1 week ago

I must say I always had a bad experience highlight/lookup'ing in scanned PDF (including in mine made of book page photos made at public libraries, just concatenating the JPG - or after conversion to B&W TIF, in a PDF). Thought I would update my tessdata, using https://github.com/tesseract-ocr/tessdata_fast/blob/main/eng.traineddata and see, but still no.

Ie. in our sample.pdf, setting Forced OCR: on, long-pressin on that "population" word, I get a highlight where the black bar is and a "No OCR results or no language data" message. image

I don't know if OCR is already at play there, or if it is some other part of the code that should pick up segments of text in the bitmap before giving it to OCR.

(I remember that 6 years ago, I opened #3688, mentionning it may have something to do with the book dpi.)

Thanks for the dpi tip, i didn't imagine English OCR also having problems, hope there would be improvement of the OCR settings next update

mergen3107 commented 1 week ago

Yes, wrong OCR coordinates is still a mystery for me. Couldn't figure it out

NiLuJe commented 1 week ago
  1. Someone updates the wiki page so this link directs the user to the correct information

Done.

offset-torque commented 1 week ago

I will expand the user guide section according to the updated wiki.

NiLuJe commented 1 week ago

I've just now quickly reworded the following section that mentioned deprecated defaults.lua stuff, that might need to be reworded to align with however the defaults stuff is explained elsewhere in the guide ;).

(Unfortunately, you can't add new entries to arrays in the Advanced settings UI, so we can't quite entirely get rid of the manual edit nonsense).

Steven630 commented 1 week ago

12481 I can't get Forced OCR to work even with the correct data copied to that folder.

mhmadaladin commented 1 week ago

12481 I can't get Forced OCR to work even with the correct data copied to that folder.

I don't know about Android version, but it worked on my kindle, although not very good result, you can OCR your pdf and read the OCRed version directly on koreader, until the Force OCR problem resolved, after a lot of trials Pdf24 give fairly good results and Free https://tools.pdf24.org/en/ocr-pdf

Steven630 commented 1 week ago

12481 I can't get Forced OCR to work even with the correct data copied to that folder.

I don't know about Android version, but it worked on my kindle, although not very good result, you can OCR your pdf and read the OCRed version directly on koreader, until the Force OCR problem resolved, after a lot of trials Pdf24 give fairly good results and Free https://tools.pdf24.org/en/ocr-pdf

Thank you for your reply. I tried to OCR my file and made a PDF with a text layer with ABBYY. Strangely enough, even though the result file is smaller and I can select words on my computer, Koreader just won't let me highlight or look up words in the dictionary. Long pressing had no effect whatsoever. I wonder if there is a format requirement for pdfs with text layers to work in Koreader. I really want to highlight notes in this book.

mhmadaladin commented 1 week ago

12481 I can't get Forced OCR to work even with the correct data copied to that folder.

I don't know about Android version, but it worked on my kindle, although not very good result, you can OCR your pdf and read the OCRed version directly on koreader, until the Force OCR problem resolved, after a lot of trials Pdf24 give fairly good results and Free https://tools.pdf24.org/en/ocr-pdf

Thank you for your reply. I tried to OCR my file and made a PDF with a text layer with ABBYY. Strangely enough, even though the result file is smaller and I can select words on my computer, Koreader just won't let me highlight or look up words in the dictionary. Long pressing had no effect whatsoever. I wonder if there is a format requirement for pdfs with text layers to work in Koreader. I really want to highlight notes in this book.

That's strange, it works with me, may be you keep the Force OCR option, you must switch it off for the original OCR to work, or may be like you said it differ according to the pdf version

Frenzie commented 1 week ago

Force OCR means "ignore the embedded text layer because it's beyond atrocious". You presumably rarely want to turn that on at all, and certainly not by default.

mhmadaladin commented 1 week ago

Force OCR means "ignore the embedded text layer because it's beyond atrocious". You presumably rarely want to turn that on at all, and certainly not by default.

Yes that's correct if the file have text layer originally, most English books I read have epub version or pdf with text layer so no need for this option except rarely, while in other languages (like Arabic for me) most books I need are scanned pdfs with no text layer so I either OCR the file first using other app or website which is time consuming, or directly read it with force OCR option on which would be much better option if worked properly.

Frenzie commented 1 week ago

If there's no text layer OCR is always performed. Force OCR only refers to ignoring the text layer (i.e., forcing OCR even though it's not necessary).

Is there a test document perhaps? But it's always possible OCR just doesn't do a great job.

mhmadaladin commented 1 week ago

If there's no text layer OCR is always performed. Force OCR only refers to ignoring the text layer (i.e., forcing OCR even though it's not necessary).

Is there a test document perhaps? But it's always possible OCR just doesn't do a great job.

Thank you that's great discovery 😅, but unfortunately most of words are not recognized, that's why I thought no OCR is performed automatically, Mostly a message appears saying No OCR data available, here's example of scanned Arabic pdf الحياة_الخالدة_لهنرييتالاكس،_ريبيكا_سكلوت.pdf

Frenzie commented 1 week ago

The recent Tesseract upgrade has changed its behavior a bit in a way we don't (hopefully not can't) deal with yet. The older version used to nearly always return some nonsense while the newer version is better about saying it doesn't know. The interface still goes by the logic that no results means you need to set it up.

benoit-pierre commented 1 week ago

I can confirm that most words don't seem to be recognized, even with the best model. The issue with Arabic support seems to be known (Cf. https://github.com/tesseract-ocr/tesseract/issues/2047).

There are other models available here, plus this one mentioned in the issue above. I don't get better results on that test document, but maybe you'll have more luck on other documents? When testing them, make sure to rename the file to ara.traineddata.

mhmadaladin commented 1 week ago

The recent Tesseract upgrade has changed its behavior a bit in a way we don't (hopefully not can't) deal with yet. The older version used to nearly always return some nonsense while the newer version is better about saying it doesn't know. The interface still goes by the logic that no results means you need to set it up.

Yes the last update is slightly better, I hope it could be fixed soon.

mhmadaladin commented 1 week ago

I can confirm that most words don't seem to be recognized, even with the best model. The issue with Arabic support seems to be known (Cf. tesseract-ocr/tesseract#2047).

There are other models available here, plus this one mentioned in the issue above. I don't get better results on that test document, but maybe you'll have more luck on other documents? When testing them, make sure to rename the file to ara.traineddata.

Thank you 🙏, Yes the Arabic tesseract support is so messed up. Sorry but how to download the train files in the first link, I can't see any option to download.

benoit-pierre commented 1 week ago

Sorry but how to download the train files in the first link, I can't see any option to download.

Click on a file, and then on the "Raw" or download (📥) button.

mhmadaladin commented 1 week ago

Sorry but how to download the train files in the first link, I can't see any option to download.

Click on a file, and then on the "Raw" or download (📥) button.

Yes I found them 😅, i didn't see them at first among the other files, thank you again for your support

benoit-pierre commented 5 days ago

Can you try after changing ocr_type from 3 to -1 in frontend/document/koptinterface.lua?

--- i/frontend/document/koptinterface.lua
+++ w/frontend/document/koptinterface.lua
@@ -24,7 +24,7 @@ local KoptInterface = {
     -- in `$TESSDATA_PREFIX/` on more recent versions).
     tessocr_data = not os.getenv('TESSDATA_PREFIX') and DataStorage:getDataDir().."/data/tessdata" or nil,
     ocr_lang = "eng",
-    ocr_type = 3, -- default 0, for more accuracy use 3
+    ocr_type = -1, -- default 0, for more accuracy use 3
     last_context_size = nil,
     default_context_size = 1024*1024,
 }