Open mhmadaladin opened 1 week ago
Do you actually have a proper set of tesseract data installed for your language?
I'm unfamiliar with tesseract and its error messages, but what's in the logs vaguely smells like PEBCAK, at least in part ;).
Yes, I installed the Arabic Tesseract trained data with all complimentary files from https://github.com/tesseract-ocr/tessdata/tree/3.04.00 as mentioned in the user guide, and it worked normally on other device as mentioned before
May be the log seems odd, because I tried in many books to make sure the problem is not in specific file
I think this part of the log is the main problem if anyone can help : Error: LSTM requested, but not present!! Loading tesseract. mgr->GetComponent(TESSDATA_INTTEMP, &fp):Error:Assert failed:in file build/arm-kindlepw2-linux-gnueabi/thirdparty/tesseract/source/src/classify/adaptmatch.cpp, line 539 lipc-wait-event exited normally with status: 0 Aborted ---------**
Yes, I installed the Arabic Tesseract trained data with all complimentary files from https://github.com/tesseract-ocr/tessdata/tree/3.04.00 as mentioned in the user guide, and it worked normally on other device as mentioned before
That's too old for the updated tesseract we use starting with 07/2024. The info message in KOReader is up-to-date:
No OCR results or no language data.
KOReader has a build-in OCR engine for recognizing words in scanned PDF and DjVu documents. In order to use OCR in scanned pages, you need to install tesseract trained data for your document language.
You can download language data files for Tesseract version 5.3.4 from https://tesseract-ocr.github.io/tessdoc/Data-Files
Copy the language data files (e.g., eng.traineddata for English and spa.traineddata for Spanish) into koreader/data/tessdata]])
@offset-torque: that part of the user guide needs to be updated.
I missed that too! I should update my ocr files too :D
that part of the user guide needs to be updated.
If the person who is responsible for the OCR integration in KOReader corrects the current guide section with up-to-date information, I can include it in the upcoming update. If this is not you, please ping the relevant dev.
It's basically what was quoted above (i.e., the current in-app help text). I have no idea who wrote what's currently in the guide for that section (and it doesn't matter all that much, nothing much has changed in practice, we just need to stop pointing to outdated links, basically).
I think it was me 👀 On mobile now, can’t edit stuff much, sorry
I have no idea who wrote what's currently in the guide for that section (and it doesn't matter all that much, nothing much has changed in practice, we just need to stop pointing to outdated links, basically).
Outdated info I linked is in our wiki page (under the title of "Dictionary support") so that's out of my jurisdiction.
We have two options:
Considering that more users will follow the user guide than the wiki page, I suggest the second option. In short, until one of our devs spend a little effort to revise this tiny section, user guide will stay like this.
Yes, I installed the Arabic Tesseract trained data with all complimentary files from https://github.com/tesseract-ocr/tessdata/tree/3.04.00 as mentioned in the user guide, and it worked normally on other device as mentioned before
That's too old for the updated tesseract we use starting with 07/2024. The info message in KOReader is up-to-date:
No OCR results or no language data. KOReader has a build-in OCR engine for recognizing words in scanned PDF and DjVu documents. In order to use OCR in scanned pages, you need to install tesseract trained data for your document language. You can download language data files for Tesseract version 5.3.4 from https://tesseract-ocr.github.io/tessdoc/Data-Files Copy the language data files (e.g., eng.traineddata for English and spa.traineddata for Spanish) into koreader/data/tessdata]])
@offset-torque: that part of the user guide needs to be updated.
Thank you so much, now there is no crash, but unfortunately most of words are not recognized, a message appears saying there is no OCR tesseract data, I think this has to do with the quality of the Arabic train data, but any help is appreciated.
Have you tried the 3 variants (tessdata
, tessdata-best
, tessdata-fast
)?
Have you tried the 3 variants (
tessdata
,tessdata-best
,tessdata-fast
)?
No, only one, I'll try the other two when I return from Work, thank you for your support
I must say I always had a bad experience highlight/lookup'ing in scanned PDF (including in mine made of book page photos made at public libraries, just concatenating the JPG - or after conversion to B&W TIF, in a PDF). Thought I would update my tessdata, using https://github.com/tesseract-ocr/tessdata_fast/blob/main/eng.traineddata and see, but still no.
Ie. in our sample.pdf, setting Forced OCR: on
, long-pressin on that "population" word, I get a highlight where the black bar is and a "No OCR results or no language data" message.
I don't know if OCR is already at play there, or if it is some other part of the code that should pick up segments of text in the bitmap before giving it to OCR.
(I remember that 6 years ago, I opened #3688, mentionning it may have something to do with the book dpi.)
I must say I always had a bad experience highlight/lookup'ing in scanned PDF (including in mine made of book page photos made at public libraries, just concatenating the JPG - or after conversion to B&W TIF, in a PDF). Thought I would update my tessdata, using https://github.com/tesseract-ocr/tessdata_fast/blob/main/eng.traineddata and see, but still no.
Ie. in our sample.pdf, setting
Forced OCR: on
, long-pressin on that "population" word, I get a highlight where the black bar is and a "No OCR results or no language data" message.I don't know if OCR is already at play there, or if it is some other part of the code that should pick up segments of text in the bitmap before giving it to OCR.
(I remember that 6 years ago, I opened #3688, mentionning it may have something to do with the book dpi.)
Thanks for the dpi tip, i didn't imagine English OCR also having problems, hope there would be improvement of the OCR settings next update
Yes, wrong OCR coordinates is still a mystery for me. Couldn't figure it out
- Someone updates the wiki page so this link directs the user to the correct information
Done.
I will expand the user guide section according to the updated wiki.
I've just now quickly reworded the following section that mentioned deprecated defaults.lua
stuff, that might need to be reworded to align with however the defaults stuff is explained elsewhere in the guide ;).
(Unfortunately, you can't add new entries to arrays in the Advanced settings UI, so we can't quite entirely get rid of the manual edit nonsense).
12481 I can't get Forced OCR to work even with the correct data copied to that folder.
I don't know about Android version, but it worked on my kindle, although not very good result, you can OCR your pdf and read the OCRed version directly on koreader, until the Force OCR problem resolved, after a lot of trials Pdf24 give fairly good results and Free https://tools.pdf24.org/en/ocr-pdf
12481 I can't get Forced OCR to work even with the correct data copied to that folder.
I don't know about Android version, but it worked on my kindle, although not very good result, you can OCR your pdf and read the OCRed version directly on koreader, until the Force OCR problem resolved, after a lot of trials Pdf24 give fairly good results and Free https://tools.pdf24.org/en/ocr-pdf
Thank you for your reply. I tried to OCR my file and made a PDF with a text layer with ABBYY. Strangely enough, even though the result file is smaller and I can select words on my computer, Koreader just won't let me highlight or look up words in the dictionary. Long pressing had no effect whatsoever. I wonder if there is a format requirement for pdfs with text layers to work in Koreader. I really want to highlight notes in this book.
12481 I can't get Forced OCR to work even with the correct data copied to that folder.
I don't know about Android version, but it worked on my kindle, although not very good result, you can OCR your pdf and read the OCRed version directly on koreader, until the Force OCR problem resolved, after a lot of trials Pdf24 give fairly good results and Free https://tools.pdf24.org/en/ocr-pdf
Thank you for your reply. I tried to OCR my file and made a PDF with a text layer with ABBYY. Strangely enough, even though the result file is smaller and I can select words on my computer, Koreader just won't let me highlight or look up words in the dictionary. Long pressing had no effect whatsoever. I wonder if there is a format requirement for pdfs with text layers to work in Koreader. I really want to highlight notes in this book.
That's strange, it works with me, may be you keep the Force OCR option, you must switch it off for the original OCR to work, or may be like you said it differ according to the pdf version
Force OCR means "ignore the embedded text layer because it's beyond atrocious". You presumably rarely want to turn that on at all, and certainly not by default.
Force OCR means "ignore the embedded text layer because it's beyond atrocious". You presumably rarely want to turn that on at all, and certainly not by default.
Yes that's correct if the file have text layer originally, most English books I read have epub version or pdf with text layer so no need for this option except rarely, while in other languages (like Arabic for me) most books I need are scanned pdfs with no text layer so I either OCR the file first using other app or website which is time consuming, or directly read it with force OCR option on which would be much better option if worked properly.
If there's no text layer OCR is always performed. Force OCR only refers to ignoring the text layer (i.e., forcing OCR even though it's not necessary).
Is there a test document perhaps? But it's always possible OCR just doesn't do a great job.
If there's no text layer OCR is always performed. Force OCR only refers to ignoring the text layer (i.e., forcing OCR even though it's not necessary).
Is there a test document perhaps? But it's always possible OCR just doesn't do a great job.
Thank you that's great discovery 😅, but unfortunately most of words are not recognized, that's why I thought no OCR is performed automatically, Mostly a message appears saying No OCR data available, here's example of scanned Arabic pdf الحياة_الخالدة_لهنرييتالاكس،_ريبيكا_سكلوت.pdf
The recent Tesseract upgrade has changed its behavior a bit in a way we don't (hopefully not can't) deal with yet. The older version used to nearly always return some nonsense while the newer version is better about saying it doesn't know. The interface still goes by the logic that no results means you need to set it up.
I can confirm that most words don't seem to be recognized, even with the best model. The issue with Arabic support seems to be known (Cf. https://github.com/tesseract-ocr/tesseract/issues/2047).
There are other models available here, plus this one mentioned in the issue above. I don't get better results on that test document, but maybe you'll have more luck on other documents? When testing them, make sure to rename the file to ara.traineddata
.
The recent Tesseract upgrade has changed its behavior a bit in a way we don't (hopefully not can't) deal with yet. The older version used to nearly always return some nonsense while the newer version is better about saying it doesn't know. The interface still goes by the logic that no results means you need to set it up.
Yes the last update is slightly better, I hope it could be fixed soon.
I can confirm that most words don't seem to be recognized, even with the best model. The issue with Arabic support seems to be known (Cf. tesseract-ocr/tesseract#2047).
There are other models available here, plus this one mentioned in the issue above. I don't get better results on that test document, but maybe you'll have more luck on other documents? When testing them, make sure to rename the file to
ara.traineddata
.
Thank you 🙏, Yes the Arabic tesseract support is so messed up. Sorry but how to download the train files in the first link, I can't see any option to download.
Sorry but how to download the train files in the first link, I can't see any option to download.
Click on a file, and then on the "Raw" or download (📥) button.
Sorry but how to download the train files in the first link, I can't see any option to download.
Click on a file, and then on the "Raw" or download (📥) button.
Yes I found them 😅, i didn't see them at first among the other files, thank you again for your support
Can you try after changing ocr_type
from 3 to -1 in frontend/document/koptinterface.lua
?
--- i/frontend/document/koptinterface.lua
+++ w/frontend/document/koptinterface.lua
@@ -24,7 +24,7 @@ local KoptInterface = {
-- in `$TESSDATA_PREFIX/` on more recent versions).
tessocr_data = not os.getenv('TESSDATA_PREFIX') and DataStorage:getDataDir().."/data/tessdata" or nil,
ocr_lang = "eng",
- ocr_type = 3, -- default 0, for more accuracy use 3
+ ocr_type = -1, -- default 0, for more accuracy use 3
last_context_size = nil,
default_context_size = 1024*1024,
}
Issue
When trying to highlight any word while the Force OCR option is "on" and choosing Arabic language, Koreader crash immediately.
Note 1: I'm using the 3.04 tesseract Arabic files and changed the language in default.custom.lua file as in manual to replace Chinese by Arabic
Note 2: A friend of mine is using the exact same tesseract files and having no problem or rarely crash, his device is jailbroken Kindle Oasis 9th generation
Steps to reproduce
1- open Koreader through kual launcher 2- choose any pdf book ( mainly happens with Arabic pdfs) 3- choose Force OCR option & Arabic language 4- try to highlight any word or paragraph
crash.log
(if applicable)crash.log
is a file that is automatically created when KOReader crashes. It can normally be found in the KOReader directory:/mnt/private/koreader
for Cervanteskoreader/
directory for Kindle.adds/koreader/
directory for Koboapplications/koreader/
directory for PocketbookAndroid logs are kept in memory. Please go to [Menu] → Help → Bug Report to save these logs to a file.
Please try to include the relevant sections in your issue description. You can upload the whole
crash.log
file (zipped if necessary) on GitHub by dragging and dropping it onto this textbox.If your issue doesn't directly concern a Lua crash, we'll quite likely need you to reproduce the issue with verbose debug logging enabled before providing the logs to us. To do so, go to
Top menu → Hamburger menu → Help → Report a bug
and tapEnable verbose logging
. Restart as requested, then repeat the steps for your issue.If you instead opt to inline it, please do so behind a spoiler tag:
crash.log
```crash.log