Open amitdo opened 7 years ago
I'm talking about the parameter tessedit_char_blacklist
you can give to Tesseract in the command line either with -c parameter=value
or with a config file that will contain the parameter.
This is my last answer. This place is not a support forum.
I tried to train Tesseract to recognize Rashi script. Here are the results:
https://gitlab.com/pninim.org/tessdata_heb_rashi
It was my first time training Tesseract, so I might have made mistakes. I have documented the process, so if you see anything that can be improved - please let me know. It looks like the recognition is comparable to that of ABBYY FineReader (at least in my sample test). Any feedback is appreciated!
@AvtechScientific Thank you for taking the effort to train Hebrew Rashi script.
@amitdo and others can check it further.
Please share the test data and results that show how well your new traineddata does in recognizing Rashi script - something on the lines of https://github.com/tesseract-ocr/tessdata_contrib/blob/main/khmLimon.md Thanks!
just a few naive comments from a cursory look at heb.wordlist:
@EastEriq - thank you for your feedback!
(1) / (2) heb.wordlist
was generated automatically from Sefaria's MongoDB dump so there might be quite a lot of inconsistencies... The question is - how problematic is it for real world recognition? That's why I asked people for feedback to see all possible test data documents (clean typed docs, scans of modern docs, scans of old books, etc..) and how current model performs compared to FineReader...
(3) Indeed, probably like gershayim it should be treated as alphabetic character, to avoid word splitting... Do I treat it as punctuation somewhere?
(4) I don't remember seeing books in Rashi script with nikkud, so it is probably not such a practical wide spread case. (regarding taamim - see 1./2.)
(5) Do you mean two separate files - heb.wordlist
and ara.wordlist
?
@AvtechScientific thank you for trying it out. I just downloaded the Training data file and found that there is still some issues with blurrier text.
For context I used this file and for some comparison this is what I have:
`
כזזשך כחעו יזוחי יזלהים ישל. ארץ ממולדחי נחשבחי כנודר קונם - בורח מחיוח נזורדי״ל ליו הפ במורדי אור וישג ה ז טליהם את אונם - זרים רדפוני חנם - והייתי כאורח נטה ללו פה אטשטרדם חלפחי זם הספד כע״ט מוצל משרפה - נזעשה ידי חומן שר וגדול׳ בישרחזל רועה צאנם - שר נולדחי בייזי `
I tried setting the DPI higher but this is as good as I got it.
Looking forward to seeing the project progress.
@benyamindsmith thank you for your feedback. Issues like these are expected. What is interesting - is to figure out how does this compare to FineReader or other OCR software... Do you have it to digitize the same input?
@Shreeshrii - thank you. PR created.
I want to ocr Ladino texts. The material that I have is scanned PDFs from Hebrew books (printed in 18th and 19th centuries). The density and size are mostly incorrect. There are four variations of bet, gimel, zayin and, pe which signify ve, dj/ch, j, and fe. I have prepared some 300 lines of truth data and ran the training. I have the following problems: I used unicodes FB31, FB32, FB36 and FB44 for these special characters (lhebrew letters with dagesh). But I don;t see these codes in unicharset which tesseract prepares. As a result in the output the output file the dagesh moves to another letter. I would also like to add the new font on the trained hebrew
.
@amitdo What is the current state of this issue. I am working on digitizing a number of Talmud and Hebrew related books. What can i do to move this forward and improve support for diacritics as Rashi script?
This issue was directed to Ray from Google, who trained all the models in the 3 tessdata repos. He does not participate in the project in recent years.
The process is documented in the tessdoc repo. It's not easy to train the LSTM engine, and I don't have time to help. You can try to ask in our forum.
The status: still unsolved.
@amitdo What is the current state of this issue. I am working on digitizing a number of Talmud and Hebrew related books. What can i do to move this forward and improve support for diacritics as Rashi script?
I trained tesseract for Rashi script some time ago and documented the training process:
https://gitlab.com/pninim.org/tessdata_heb_rashi
There you can also find the link to the Rashi font collection.
Here, i'm going to raise some issues related to Tesseract's Hebrew support.
Dear participants who have interest in Arabic support, I suggest to raise Arabic issues in a separate 'issue', even if there are similar issues for both Arabic/Persian and Hebrew.
Let's start with the nikud issue.
Hebrew has two writing forms:
Nikud - Diacritical signs used in Hebrew writing.
Modern Hebrew is written (mostly) without nikud.
Children's books are written with nikud. Poetry is also usually written with nikud. Hebrew dictionaries also use nikud. The Hebrew bible use nikud. It also uses te'amim (Cantillation marks).
There are some mixed forms: 1) In this form, most of the body text is written without nikud, but in a few places nikud is used. 1a) Some paragraphs/sentences use nikud, when quoting the bible or a poem for example. 1b) One or few words in some paragraphs use nikud. This form is used for example for foreign names of people and places (like cities). Without nikud many words will be ambiguous. Usually a native Hebrew speaker will use context to solve this ambiguousness. Sometimes there will still be ambiguousness, and then using nikud can be used to solve this issue. 2) In this form, most (or at least a large percent) of the words in the text is written with nikud, but for the words with nikud, the nikud is only partial.
The following part is relevant to both (1b) and (2) above. When adding nikud to a word, it might be in 'full' or 'partial' form. Sometimes adding just one nikud sign is enough to make the word unambiguous.
Ray, If you only use the web for building the langdata, you won't find many good sources for Hebrew with nikud.
Here is an excellent source which has both Hebrew with nikud (mostly poetry) and without nikud (most of the prose): http://benyehuda.org/ Project Ben-Yehuda, named after Eliezer Ben-Yehuda, is like the famous Project Gutenberg, but it just for Hebrew. Note that some parts are copyrighted. In some other parts the copyrights were expired according to the Israeli law, but might be still copyrighted in the US. For your use case, building a corpus, I don't think the copyrights matters, but IANAL.
Do you use the Hebrew Bible as a source (like the one from Wikisource)? I don't sure if it is a good idea to use it for modern Hebrew.
More information will follow later.