tesseract-ocr / langdata

Source training data for Tesseract for lots of languages
Apache License 2.0
834 stars 888 forks source link

Hebrew issues #82

Open amitdo opened 7 years ago

amitdo commented 7 years ago

Here, i'm going to raise some issues related to Tesseract's Hebrew support.

Dear participants who have interest in Arabic support, I suggest to raise Arabic issues in a separate 'issue', even if there are similar issues for both Arabic/Persian and Hebrew.

Let's start with the nikud issue.

Hebrew has two writing forms:

Nikud - Diacritical signs used in Hebrew writing.

Modern Hebrew is written (mostly) without nikud.

Children's books are written with nikud. Poetry is also usually written with nikud. Hebrew dictionaries also use nikud. The Hebrew bible use nikud. It also uses te'amim (Cantillation marks).

There are some mixed forms: 1) In this form, most of the body text is written without nikud, but in a few places nikud is used. 1a) Some paragraphs/sentences use nikud, when quoting the bible or a poem for example. 1b) One or few words in some paragraphs use nikud. This form is used for example for foreign names of people and places (like cities). Without nikud many words will be ambiguous. Usually a native Hebrew speaker will use context to solve this ambiguousness. Sometimes there will still be ambiguousness, and then using nikud can be used to solve this issue. 2) In this form, most (or at least a large percent) of the words in the text is written with nikud, but for the words with nikud, the nikud is only partial.

The following part is relevant to both (1b) and (2) above. When adding nikud to a word, it might be in 'full' or 'partial' form. Sometimes adding just one nikud sign is enough to make the word unambiguous.

Ray, If you only use the web for building the langdata, you won't find many good sources for Hebrew with nikud.

Here is an excellent source which has both Hebrew with nikud (mostly poetry) and without nikud (most of the prose): http://benyehuda.org/ Project Ben-Yehuda, named after Eliezer Ben-Yehuda, is like the famous Project Gutenberg, but it just for Hebrew. Note that some parts are copyrighted. In some other parts the copyrights were expired according to the Israeli law, but might be still copyrighted in the US. For your use case, building a corpus, I don't think the copyrights matters, but IANAL.

Do you use the Hebrew Bible as a source (like the one from Wikisource)? I don't sure if it is a good idea to use it for modern Hebrew.

More information will follow later.

amitdo commented 3 years ago

I'm talking about the parameter tessedit_char_blacklist you can give to Tesseract in the command line either with -c parameter=value or with a config file that will contain the parameter.

This is my last answer. This place is not a support forum.

AvtechScientific commented 2 years ago

I tried to train Tesseract to recognize Rashi script. Here are the results:

https://gitlab.com/pninim.org/tessdata_heb_rashi

It was my first time training Tesseract, so I might have made mistakes. I have documented the process, so if you see anything that can be improved - please let me know. It looks like the recognition is comparable to that of ABBYY FineReader (at least in my sample test). Any feedback is appreciated!

Shreeshrii commented 2 years ago

@AvtechScientific Thank you for taking the effort to train Hebrew Rashi script.

@amitdo and others can check it further.

Please share the test data and results that show how well your new traineddata does in recognizing Rashi script - something on the lines of https://github.com/tesseract-ocr/tessdata_contrib/blob/main/khmLimon.md Thanks!

EastEriq commented 2 years ago

just a few naive comments from a cursory look at heb.wordlist:

  1. there are many words beginning with geresh or gershayim. Otherwise, it is probably sensible and correct to treat geresh and gershayim as alphabetic characters, to avoid word splitting. There a re a few inconsistencies thoug, like sometime double geresh instead of gershayim
  2. there are several words ending with sof pasuq, which should be punctuation
  3. I'm in doubt whether maqaf should be considered punctuation
  4. what is your position w.r.t including nikkud? You have words including even taamim
  5. would it be considerable to make different wordlists for hebrew rather than aramaic, considering the corpora this would be used for?
AvtechScientific commented 2 years ago

@EastEriq - thank you for your feedback!

(1) / (2) heb.wordlist was generated automatically from Sefaria's MongoDB dump so there might be quite a lot of inconsistencies... The question is - how problematic is it for real world recognition? That's why I asked people for feedback to see all possible test data documents (clean typed docs, scans of modern docs, scans of old books, etc..) and how current model performs compared to FineReader...

(3) Indeed, probably like gershayim it should be treated as alphabetic character, to avoid word splitting... Do I treat it as punctuation somewhere?

(4) I don't remember seeing books in Rashi script with nikkud, so it is probably not such a practical wide spread case. (regarding taamim - see 1./2.)

(5) Do you mean two separate files - heb.wordlist and ara.wordlist?

benyamindsmith commented 2 years ago

@AvtechScientific thank you for trying it out. I just downloaded the Training data file and found that there is still some issues with blurrier text.

For context I used this file and for some comparison this is what I have:

The text

image

The output

`

כזזשך כחעו יזוחי יזלהים ישל. ארץ ממולדחי נחשבחי כנודר קונם - בורח מחיוח נזורדי״ל ליו הפ במורדי אור וישג ה ז טליהם את אונם - זרים רדפוני חנם - והייתי כאורח נטה ללו פה אטשטרדם חלפחי זם הספד כע״ט מוצל משרפה - נזעשה ידי חומן שר וגדול׳ בישרחזל רועה צאנם - שר נולדחי בייזי `

I tried setting the DPI higher but this is as good as I got it.

Looking forward to seeing the project progress.

AvtechScientific commented 2 years ago

@benyamindsmith thank you for your feedback. Issues like these are expected. What is interesting - is to figure out how does this compare to FineReader or other OCR software... Do you have it to digitize the same input?

AvtechScientific commented 2 years ago

@Shreeshrii - thank you. PR created.

ghost commented 2 years ago

I want to ocr Ladino texts. The material that I have is scanned PDFs from Hebrew books (printed in 18th and 19th centuries). The density and size are mostly incorrect. There are four variations of bet, gimel, zayin and, pe which signify ve, dj/ch, j, and fe. I have prepared some 300 lines of truth data and ran the training. I have the following problems: I used unicodes FB31, FB32, FB36 and FB44 for these special characters (lhebrew letters with dagesh). But I don;t see these codes in unicharset which tesseract prepares. As a result in the output the output file the dagesh moves to another letter. I would also like to add the new font on the trained hebrew

sample

.

Maxwell175 commented 1 year ago

@amitdo What is the current state of this issue. I am working on digitizing a number of Talmud and Hebrew related books. What can i do to move this forward and improve support for diacritics as Rashi script?

amitdo commented 1 year ago

This issue was directed to Ray from Google, who trained all the models in the 3 tessdata repos. He does not participate in the project in recent years.

The process is documented in the tessdoc repo. It's not easy to train the LSTM engine, and I don't have time to help. You can try to ask in our forum.

amitdo commented 1 year ago

The status: still unsolved.

AvtechScientific commented 1 year ago

@amitdo What is the current state of this issue. I am working on digitizing a number of Talmud and Hebrew related books. What can i do to move this forward and improve support for diacritics as Rashi script?

I trained tesseract for Rashi script some time ago and documented the training process:

https://gitlab.com/pninim.org/tessdata_heb_rashi

There you can also find the link to the Rashi font collection.