Hebrew issues - Githubissues

amitdo commented 6 years ago

Here, i'm going to raise some issues related to Tesseract's Hebrew support.

Dear participants who have interest in Arabic support, I suggest to raise Arabic issues in a separate 'issue', even if there are similar issues for both Arabic/Persian and Hebrew.

Let's start with the nikud issue.

Hebrew has two writing forms:

Hebrew with nikud
Hebrew without nikud

Nikud - Diacritical signs used in Hebrew writing.

Modern Hebrew is written (mostly) without nikud.

Children's books are written with nikud. Poetry is also usually written with nikud. Hebrew dictionaries also use nikud. The Hebrew bible use nikud. It also uses te'amim (Cantillation marks).

There are some mixed forms: 1) In this form, most of the body text is written without nikud, but in a few places nikud is used. 1a) Some paragraphs/sentences use nikud, when quoting the bible or a poem for example. 1b) One or few words in some paragraphs use nikud. This form is used for example for foreign names of people and places (like cities). Without nikud many words will be ambiguous. Usually a native Hebrew speaker will use context to solve this ambiguousness. Sometimes there will still be ambiguousness, and then using nikud can be used to solve this issue. 2) In this form, most (or at least a large percent) of the words in the text is written with nikud, but for the words with nikud, the nikud is only partial.

The following part is relevant to both (1b) and (2) above. When adding nikud to a word, it might be in 'full' or 'partial' form. Sometimes adding just one nikud sign is enough to make the word unambiguous.

Ray, If you only use the web for building the langdata, you won't find many good sources for Hebrew with nikud.

Here is an excellent source which has both Hebrew with nikud (mostly poetry) and without nikud (most of the prose): http://benyehuda.org/ Project Ben-Yehuda, named after Eliezer Ben-Yehuda, is like the famous Project Gutenberg, but it just for Hebrew. Note that some parts are copyrighted. In some other parts the copyrights were expired according to the Israeli law, but might be still copyrighted in the US. For your use case, building a corpus, I don't think the copyrights matters, but IANAL.

Do you use the Hebrew Bible as a source (like the one from Wikisource)? I don't sure if it is a good idea to use it for modern Hebrew.

More information will follow later.

amitdo commented 6 years ago

https://github.com/tesseract-ocr/tesseract/issues/648#issuecomment-318141576

theraysmith commented:

Here are some examples of test data with diacritics:

Truth: שֶׁהוּא נָס מִפָּנָיו, אָמַר לוֹ מֹשֶה: כָּל הַיּוֹם הָיִיתִי אוֹמֵר לְךָ בְּשֵׁם הַקֹּדֶשׁ וְלֹא הָיִיתָ OCR: שָהוּא נס מִפָּנִיו, אָמַר לו משָה: כָּל הַיום הָיִיתי אוּמר לֶך בָּשַם הקדש ולא הָיִיתָ Confs: 0.84 0.56 0.64 0.93 0.96 0.77 0.88 0.76 0.63 0.64 0.54 0.45 0.91 0.88 0.58 Diff: שָהוּא נָס מִפָּנִיו, אָמַר לוֹ מֹשָה: כָּל הַיּוֹם הָיִיתִי אוּמֵר לֶ ךָ בָּשַם הַקֹּדֶשׁ וְלֹא הָיִיתָ Recall Errors = 12 Precision Errors = 2

Truth: ותופחים בבטנים ובשירים שארכם כארך סיגריה, OCR: וְתופחים בַּבַּטָנִיס וּבשירים שאַרכֶּם כַארְף סִיגְרִיה, Confs: 0.71 0.71 0.91 0.8 0.56 0.56 Diff: וְתופחים בַּבַּטָנִיס וּבשירים שאַרכֶּם כַארְף סִיגְרִיה, Recall Errors = 6 Precision Errors = 1

In all these cases, tesseract gets a poor result. In case 1, the diacritics are in the truth text, and Tesseract gets them badly wrong. In case 2, the diacritics are NOT in the truth text, and Tessseract suggests some anyway I don't think that both of these truth texts can be "correct" in the sense that one has the diacritics and the other does not. Which way should it be and why?

amitdo commented 6 years ago

(1) I didn't find mistakes in the letters themselves.

(2) I found two mistakes in the letters themselves. Samekh [ס] instead of Mem-sofit [ם].

The issues with training Tesseract to recognize nikud are:

You need good sources for the training text. I suspect you don't have good sources.
For Hebrew without nikud the number of glyphs the network needs to learn is small: 22 letters + 5 final letters forms + [0-9] + punctuation + a few common marks. For Hebrew with nikud the number of glyphs the network needs to learn is much greater - some hundreds.
The signs of nikud are very small.
The signs can be confused with noise in the image.
I think you have some issues with the Hebrew unicharset, will report later about that.

amitdo commented 6 years ago

(2) The second mistake in the letters themselves. Pe-sofit [ף] instead of Kaf-sofit [ך].

The letters it wrongly chose are indeed very similar to the correct letters.

The two incorrect words in (2) are not true dictionary words.

amitdo commented 6 years ago

Another issue with Hebrew - Tesseract's Dictionary.

Hebrew uses both prefix and suffix with base words.

http://hspell.ivrix.org.il/ (AGPL)

http://hspell.ivrix.org.il/WHATSNEW

Vocabulary: 468,508 words (when built with "--enable-fatverb") based on 24495 base words: 12908 nouns, 3889 adjectives, 5261 verb stems, and 2437 other words.

So 468,508 words are produced from 24,495 base words+their suffix forms.

They don't mention the prefix forms. I think that from 24,495 base words + prefix and suffix forms you will get at least 9 million words.

The Hebrew wordlist contains 152,000 words. I believe that this list will not cover enough words in Hebrew. The result: lstm + dict might be not better than raw lstm only. This is my assumption and it needs to be verified.

Hspell's dictionary does not include nikud.

amitdo commented 6 years ago

@nyh, @dankenigsberg [hspell authors]

Sorry to bother you.

Can you please read the comment above this one, and answer these questions:

How much 'words' can hspell recognize (including all the suffix forms)?
Zipf law; How many 'words' do you think are needed for 70% coverage in Hebrew (no nikud)? for 80%? 85%? 90%? 95%?

amitdo commented 6 years ago

Returning back to your question.

In all these cases, tesseract gets a poor result. In case 1, the diacritics are in the truth text, and Tesseract gets them badly wrong. In case 2, the diacritics are NOT in the truth text, and Tessseract suggests some anyway I don't think that both of these truth texts can be "correct" in the sense that one has the diacritics and the other does not. Which way should it be and why?

Both (1) and (2) are not so good because of the issues with nikud.

In (1) the OCR'ed text has a lot of nikud mistakes. If you omit (or try to completely ignore) the nikud in the OCR'ed text, the text is almost perfect in its 'without nikud' form. When you omit the nikud, for some words you'll have to add vav [ו] or yud [י] letters instead of the nikud. The right way to write הַקֹּדֶשׁ in the 'without nikud' form is הקודש Here you add a vav instead of the omitted holam-haser sign.

In (2) the network tries to be 'too smart' and adds nikud signs which does not appear in the ground truth. For OCR this 'feature' is not something that you want.

As a note, this feature can be useful for another, separate application: Converting (kind of translating) text 'without nikud' to 'with nikud' form. But it will be useful only if it will have good accuracy. For training, you'll use a pair of text lines - (1) the 'without nikud' input and (2) the desired 'with nikud' output. Something like that was done a few years ago with HMM by two Israeli students. https://www.cs.bgu.ac.il/~elhadad/hocr/. A funny thing is that they trained an old version of Tesserract to read Hebrew with nikud and then used the OCR'ed output of scanned book written with nikud as part of training the HMM 'nikud translator'.

amitdo commented 6 years ago

To summarize the above: (For OCR) I think (1) is more preferable than (2). But both aren't good.

So, unless you can make the nikud recognition much better, IMO a reasonable solution might be to drop the nikud signs.

amitdo commented 6 years ago

heb.wordlist contains these words:

אַרטיקל- קאַװע־שטיבל בלאַט: נאַװיגאציע, אַר־עס־עס באַניצער װאָס קאַטעגאָריע אינהאַלט באַהאַלטן נאָך אַלע צוזאַמענארבעט אַרטיקל רעדאַקטירן דאָס אַריבערשליסונגען אָקטאָבער אַנאָנימע נאָר באַנוצערס אַלץ האָט [בעאַרבעטן] זאָל קאָנטאַקט אַהער ראָבאָטן װי װען װעגן װעט (װערסיעס) מעדיעװיקי װערסיעס װעלכע װערסיע

All of them are Yiddish words, not Hebrew.

If you omit these words (you should), only 12 words with nikud will be left in the heb.wordlist file.

Shreeshrii commented 6 years ago

Ray, I got better results while segregating Devanagari with Vedic accents from regular Devanagari for training. See https://github.com/tesseract-ocr/tessdata/issues/61#issuecomment-316744125

You can also consider having two separate traineddatas, one with nikud and one without for Hebrew, each with corresponding wordlists, if that gives better accuracy for each. It won't work for mixed texts though.

amitdo commented 6 years ago

The Hebrew word list and training text should not contain the Yiddish digraphs:

05F0 װ HEBREW LIGATURE YIDDISH DOUBLE VAV 05F1 ױ HEBREW LIGATURE YIDDISH VAV YOD 05F2 ײ HEBREW LIGATURE YIDDISH DOUBLE YOD

amitdo commented 6 years ago

Ray, How many fonts do you use for Hebrew training?

amitdo commented 6 years ago

Here are two examples from Project Ben-Yehuda:

Poem: http://benyehuda.org/bialik/bia060.html

Prose: http://benyehuda.org/yalag/yalag_086.html

amitdo commented 6 years ago

@theraysmith If you have further questions about this subject, I'll be happy to answer them.

theraysmith commented 6 years ago

I have questions: Prefix and suffixes. Do they just append to the base word without changing it (like cat->cats) or do they potentially change the word slightly (like lady->ladies)? If they never do the latter, it might be possible to fix the problem by allowing no-space word concatenation, like Chinese, Japanese, and Thai.

My previous post was missing the images: The nikuds were on both images, but the ground truth was wrong, as it didn't contain them.

In my opinion, Tesseract should output exactly the nikuds that are in the image, no more, no less. Is that reasonable? It makes the training complicated because it means that words can appear either way. I can see the appeal of just discarding all the nikud, but it doesn't seem the right thing to do.

Yiddish. You list a bunch of Yiddish words, and then in a separate post Yiddish-specific characters. Do all those Yiddish words contain one or more of those characters? If not, how I can separate Yiddish from Hebrew? Those Yiddish characters are not in the unicharset for the latest Hebrew model. The unicharset only has 67 characters.

If you omit these words (you should), only 12 words with nikud will be left in the heb.wordlist file. You need good sources for the training text. I suspect you don't have good sources. That gives me a new idea for filtering the wordlists. I think I can solve the problem of the nikuds. I have all of the web at my disposal, so it is just a matter of filtering correctly, provided there aren't changes of font to deal with. (See below.)

I take it that the unicodes you refer to as nikud are 5b0-5c4, and that the cantillation marks 591-5af are used so rarely as to be totally ignored? The unicharset for the best model that I just pushed, only has 5b0, 5b4, 5b6, 5b7, 5b8, 5bc + 27 base letters. I could force others to be included, or just broaden the filter to capture more of the corpus. I notice that 5b9 is the most frequent dropped character. Please suggest which others should be included.

Fonts. Too many to count. (attached) hebrewfonts.txt Problem: I have noticed that there is an older style of font, in which the letters are very rounded instead of rather square. Tesseract is very inaccurate on this style as there are few if any fonts that look like that. Question: Do any of the attached fonts use this older style? If so which? (I can boost the frequency to get the accuracy up). If not, are there any publicly available?

amitdo commented 6 years ago

I have noticed that there is an older style of font, in which the letters are very rounded instead of rather square.

Something like these samples: http://fontbit.co.il/search.asp?tag=3&style=13 ? That's the style used for handwriting in Hebrew. It's different from the printed style.

amitdo commented 6 years ago

https://fonts.google.com/?subset=hebrew

A list of Hebrew fonts from the Open Siddur Project http://opensiddur.org/tools/fonts/

amitdo commented 6 years ago

Prefix and suffixes. Do they just append to the base word without changing it (like cat->cats) or do they potentially change the word slightly (like lady->ladies)? If they never do the latter, it might be possible to fix the problem by allowing no-space word concatenation, like Chinese, Japanese, and Thai.

Both forms exist.

A road - kvish כביש - plural kvishim כבישים A meeting - pgisha פגישה - plural pgishot פגישות

amitdo commented 6 years ago

I suggest to use these unicodes for heb.traineddata (Hebrew, not including additional Yiddish unicodes):

Hebrew Alef-Bet (Alphabet)

05D0-05EA 22 letters + 5 final forms = 27

Numerals

0-9 0030-0039

Nikud

If you want to support nikud, you should include: 05B0-05BC, 05C1, 05C2

Unique Hebrew punctuation marks

05BE ‫־‬ HEBREW PUNCTUATION MAQAF 05F3 ‫׳‬ HEBREW PUNCTUATION GERESH 05F4 ‫״‬ HEBREW PUNCTUATION GERSHAYIM

Common marks

Other common marks - the ones that are already in heb.trainedata (As 'Common').

Links

http://unicode.org/charts/PDF/U0590.pdf https://en.wikipedia.org/wiki/Hebrew_alphabet https://en.wikipedia.org/wiki/Hebrew_punctuation https://en.wikipedia.org/wiki/Niqqud

amitdo commented 6 years ago

In my opinion, Tesseract should output exactly the nikuds that are in the image, no more, no less. Is that reasonable?

Yes.

The ideal is that Tesseract will do a good job with these texts: 1) Text without nikud. 2) Text with minor use of nikud. 3) Text with nikud in all/most words.

It makes the training complicated because it means that words can appear either way.

The question is if it can achieve high accuracy on the 3 above kinds of texts.

I can see the appeal of just discarding all the nikud, but it doesn't seem the right thing to do.

I think you may want to consider and try several approaches for training:

(GT here is Ground Truth)

1) The GT, in all text lines, does not include nikud signs. This model will be used for most Hebrew texts, which either does not use nikud, or that the nikud appears in 0.1% up to 1% of the words.
2) The GT has nikud in all/most words. This model will be used on Hebrew texts which is very likely to have nikud: poetry and texts aimed to children. 3) Half of text lines in the GT have nikud and the second half does not have it. 4) Like 3, but any letter+nikud sign(s) combination in the GT will be normalized to a form without nikud as a first step in training. 5) Like 3 but during OCR the user will have an option to blacklist all nikud signs.

amitdo commented 6 years ago

My comments about the Hebrew wordlist were based on the file in the langdata repo.

amitdo commented 6 years ago

@theraysmith,

Please read my new comments, starting with https://github.com/tesseract-ocr/langdata/issues/82#issuecomment-320266441

Talking about the files in best/heb.traineddata:

The heb.lstm-unicharset does have some nikud signs, but it lacks some other nikud signs.
In heb.lstm-word-dawg, there are only 67 words with nikud. Some of them are Yiddish words.

amitdo commented 6 years ago

best/heb.traineddata has only 6 nikud signs:

5b0 HEBREW POINT SHEVA 5b4 HEBREW POINT HIRIQ 5b6 HEBREW POINT SEGOL 5b7 HEBREW POINT PATAH 5b8 HEBREW POINT QAMATS 5bc HEBREW POINT DAGESH OR MAPIQ

9 nikud signs are missing.

Also missing are the 3 unique Hebrew punctuation marks I mentioned earlier.

theraysmith commented 6 years ago

OK I have added desired/forbidden characters for heb and yid I assume that apart from the 3 unique characters that you listed (for each) the list of nikuds should be the same?

EastEriq commented 4 years ago

IMHO you're forgetting an important script variant here, which is Rashi (https://en.wikipedia.org/wiki/Rashi_script). I think it would be very useful to have support for it, as there is a huge body of mostly rabbinical literature traditionally typeset in it. While most of the characters are similar in shape to their square script counterpart, there are notable differences and different sets of false friends. For example א is easily misrecognised for ח or ת in Rashi typeface. Also the vocabulary and orthography of such text may differ a little from that of modern hebrew.

Looong ago I tried to have a basic go with it, I attach my older files, by now certainly no more relevant, for reference.

heb-rashi.zip

I think it should be easy to find good ground truth for training. For example some wikisourced text for which the original image is known. If anybody happens to have one at hand, I might have a new look at it...

An advanced challenge would be how to deal with images which have both script variants coexisting, which is rather common.

amitdo commented 4 years ago

I didn't 'forget' it, just preferred not to mention it in this issue.

Mixing Rashi with a modern general purpose Hebrew traineddata is probably not a good idea.

EastEriq commented 4 years ago

Maybe posting in https://github.com/tesseract-ocr/tesseract/issues/1543 would have been more proper, but that is closed. I thought I could abuse this issue for any todo wish.

I agree on a different purpose training set.

EastEriq commented 4 years ago

I had a cursive look at training for 4x, and was intimidated.

Passing by, I found this paper: Auto-ML Deep Learning for Rashi Scripts OCR. For future reference. They discuss specificities of the rashi script, implement a scheme which include an LSTM layer, train it on a corpus from the Responsa project, and evaluate its performance.

AvtechScientific commented 3 years ago

I think Rashi script conversion should be an integral part of ordinary Hebrew conversion. Rational behind this:

most, if not all, texts that have Rashi script in them have also regular square script in them. So both must be processed simultaneously;
it's just another font of the same language, not a separate language;
that's what ABBYY FineReader does actually.

If you need some material for training, here are two scans of the responsa

"Sheilat Yaavetz":

http://aleph.nli.org.il:80/F/?func=direct&doc_number=001282387&local_base=NNL01
https://vilnacollections.yivo.org/?ca=/item.php\~id=pub-000008428%7C%7Ccol=v

that were converted and manually edited into text:

Sheilat Yavetz (text version).

Yavetz on Brakhot/Seder Zraim (scan):

https://dlib.rsl.ru/viewer/01006624106#?page=285

and it's manually edited text: Yavetz on Brakhot/Seder Zraim (text).

Zekher Yehosef (scan):

https://digitalassets.yivo.org/Books/000007666pt1.pdf
https://digitalassets.yivo.org/Books/000007666pt2.pdf

Zekher Yehosef (text).

Ateret Zkenim (scan): https://digitalassets.yivo.org/Books/000055504OPT.pdf

Ateret Zkenim (text)

Yad David (scan):

https://hebrewbooks.org/15248

Yad David (text).

Meshivat Nefesh (scan):

https://rosetta.nli.org.il/delivery/DeliveryManagerServlet?dps_pid=IE94883214
https://vilnacollections.yivo.org/?ca=((item.php!id__pub-000008293\*col__v

Meshivat Nefesh (text)

Midrash Talpiot (scan):

http://digipres.cjh.org:1801/delivery/DeliveryManagerServlet?dps_pid=IE3499603

Midrash Talpiot (text)

This should be enough for training tesseract, but if you want us to manually edit/proof-read more books with Rashi Script - just let us know at:

https://pninim.org/en/contact/

Thanks!

EastEriq commented 3 years ago

My comments on what you write, at least for what I understand of it:

I think Rashi script conversion should be an integral part of ordinary Hebrew conversion.

yes, but no.

To name the elephant, here I think we are talking of printed rabbinical literature of the past five centuries. Vocabulary differs somehow from modern Hebrew. Orthography sometimes (many ktiv haser vs male', but also variants). There is a whole apparatus of accepted abbreviations which you don't find in modern texts. Aramaic is also to be found interspersed. (and I'm leaving out other variants like Weibertaischt, vernacular side translations, which properly are even different languages)

Rational behind this:

1. most, if not all, texts that have Rashi script in them have also regular square script in them. So both must be processed simultaneously;

The text area should be properly segmented to recognize regions to be analysed in the other or the other script. Methinks the process could be similar to what tesseract does with multiple languages, but I'm completely ignorant.

2. it's just another font of the same language, not a separate language;

Trying to recognize one with the training set of the other produces pitiful results anyway. Notoriously different sets of false friend letters in the one or the other script, and glyphs in one resembling too much a different letter in the other.

3. that's what ABBYY FineReader does actually.

I so much wish there was OS software competing with the achievements of proprietary ones...

If you need some material for training, ...

The problem I see with these wonderful references is that they miss the fundamental point: they are not sets of images of a single line of text, coupled with the corresponding line of ground truth characters, and only them. The work needed to reduce them to that format is huge... IIUC we are talking here of hundreds of thousands of lines needed for a decent recognition, and it's beyond my capabilities to think at an automatic way to generate such dataset from the input material, without deep human intervention.

The traditional alternative used to be to take such texts, and to generate training images by rendering them in Rashi fonts. The problem is, there is only a couple of suitable free fonts out there, and they are barely representative of the whole typographical corpus.

Having said that, it would really be wonderful if tesseract could cope with the texts we have in mind. Making them machine readable and freely accessible would mean so much...

AvtechScientific commented 3 years ago

... they are not sets of images of a single line of text, coupled with the corresponding line of ground truth characters, and only them. The work needed to reduce them to that format is huge... IIUC we are talking here of hundreds of thousands of lines needed for a decent recognition...

Let's image that we might be able to manually cut the above mentioned scanned images into separate lines and provide corresponding texts. Let's say generating some 100K-300K tupels... Are there knowledgeable people in the community who will volunteer to finish the job of training tesseract to recognize Rashi script?

Doragon-S commented 3 years ago

I'm new to coding, and I'm just trying to understand:

Why is the language important? If the letters are the same, shouldn't it just grab the letters and symbols and stick them in order? I know I'm dramatically oversimplifying the whole process, but I thought that was all it was trying to do - process the image to derive the letters and then put them in a text file.
And if Tesseract could recognize letters in various handwritings (can it?), then why does a different script matter? Its just another "handwriting", right?

benemanuel commented 3 years ago

what defines a letter? some are close but not the same, are OQ same? are O0? are lI1?

On Sun, 27 Sep 2020, 7:30 Doragon-S, notifications@github.com wrote:

I'm new to coding, and I'm just trying to understand:

Why is the language important? If the letters are the same, shouldn't it just grab the letters and symbols and stick them in order? I know I'm dramatically oversimplifying the whole process, but I thought that was all it was trying to do - process the image to derive the letters and then put them in a text file. And if Tesseract could recognize letters in various handwritings (can it?), then why does a different script matter? Its just another "handwriting", right?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/82#issuecomment-699583818, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACTOLN3KJJQI6M3AUUIUGWDSH25VFANCNFSM4DUV5TPA .

Doragon-S commented 3 years ago

I guess I didn't explain my question right.

I know that tesseract has to differentiate between letters. But I thought that was what it built for. What it seemed like you were saying was a problem was that Aramaic was different from hebrew which is defferent from Yiddish etc. But I didn't understand why tesseract needs to know the language. For an example: Why does tesseract need to know if the 'm' it is dealing with is a English 'm', rather than a Spanish or French 'm'? if it needs for the squiggles and dots above the letters, then maybe, but why does it need to know the dictionary?

Also for the various scripts, why should Rashi script be different than normal script, or English script? Tesseract can still process those right? And if it is different, why not just set it up as a second language with a different set of input characters and the same set of output characters? (I'm assuming that tesseract works something like this. I have no idea how it works, so I'm probably very wrong.)

EastEriq commented 3 years ago

Because recognition success of isolated characters without context would be ridiculously low. Language files tell which character associations are most frequent, and which vocabulary words are to be expected.

Doragon-S commented 3 years ago

That makes sense. But then how does tesseract deal with non-dictionary words? like names, or maybe an intentionally misspelled word (h8, l33t, etc.)? can it deal with those? once tesseract has the text in black and white, what is so hard about identifying each letter individually? ('u' is different that 'a' because it isn't connected on top, etc.) isn't that how children learn the letters? is it possible to program computers on the differences?

EastEriq commented 3 years ago

But then how does tesseract deal with non-dictionary words? like names, or maybe an intentionally misspelled word (h8, l33t, etc.)? can it deal with those?

it gives weights. If the recognition score of the non-dictionary word exceeds that of the possible alternatives it outputs it, if the vocabulary alternative seems more likely, it goes for that. This is why the training corpus should be representative of the texts which are aimed to.

once tesseract has the text in black and white, what is so hard about identifying each letter individually? ('u' is different that 'a' because it isn't connected on top, etc.) isn't that how children learn the letters?

Not on real world unregistered text, faint, distorted, faded and whatnot. Anyway children who learn single letters aren't yet readers.

Doragon-S commented 3 years ago

ok. I had thought I saw that tesseract de-skews the image, and also turns it to binary black and white.

EastEriq commented 3 years ago

problems don't end there

Doragon-S commented 3 years ago

I saw that you said that someone needs to 'train' tesseract. (I did find out what 'training' was). I was confused though why you couldn't just take a database that is already in text, and make a program to save some of the lines as pictures, and have it train tesseract automatically. If it has both the image and answer, it should only have to check the answer against tesseract's response, which is what a human would be doing anyway, right?

EastEriq commented 3 years ago

If you're going to undertake that, that would be a certainly a great contribution. See what has been said in the messages above.

chsanch commented 3 years ago

We are working with some Rashi based books (in judeoespañol), we have the scanned pdf from the originals, I try to use tesseract to try to extract some texts from the images but it didn't work that well (of course), not sure if there is someone already working on adding support for Rashi fonts, but it would be nice. If some help is needed just let me know.

matantech commented 3 years ago

Hi all, I can't get tesseract scan Hebrew at all and keep getting the error

actual_tessdata_num_entries_ <= TESSDATA_NUM_ENTRIES:Error:Assert failed:in file ../../ccutil/tessdatamanager.cpp, line 53

Anyone has a valid Hebrew trained data or any other solution?

Thanks

Shreeshrii commented 3 years ago

@matantech

See https://github.com/tesseract-ocr/tesseract/issues/1613

https://stackoverflow.com/questions/21555887/tesseract-3-01-actual-tessdata-num-entries-tessdata-num-entries

I suggest you try the latest version of tesseract.

matantech commented 3 years ago

@Shreeshrii I didn't mention I'm using tesseract for iOS. my bad. Is there any valid Hebrew trained data for it? Thanks!

amitdo commented 3 years ago

@matantech

Please use the forum for technical support.

MDjavaheri commented 3 years ago

You should know that ABBYY FineReader does a good but not perfect job with Rashi script. Even after training it above the standard recognition pattern, I still can't get it to tell the difference between a final Mem and a Samech which is understandable but somewhat of a nuisance. It works maybe 10% of the time. A similar problem exists with a Taf and a Het, but training has improved that to only be an issue about 20% of the time. If you're going to teach Tesseract what to do, keep that in mind.

amirbachar commented 3 years ago

What would be the best way to train tesseract on these fonts? (http://freefonts.co.il/) We've tried manually creating a tif image for each one, and then tagging the bounding boxes, but it's a tedious process, and the results are not optimal (perhaps due to inconsistencies in the tagging). Is there a library that automate the whole training process by using the font file? (bounding boxes also have intersections for some fonts)

AvtechScientific commented 2 years ago

Hebrew has two writing forms:
* Hebrew with nikud

* Hebrew without nikud

@amitdo - is there a way to tell tesseract whether to recognize or to ignore nikud during OCR with the current official heb.traineddata?

If there is no way to ignore nikud during OCR - is there an easy way to delete after recognition? And what needs to be done to make nikud recognition optional?

Thank you!

amitdo commented 2 years ago

Tesseract has an option to blacklist characters . Consult the docs and/or ask in the forum about this option. but note that people reported it does not work well with the lstm based models.

is there an easy way to delete after recognition?

Yes. With a few lines of bash/perl/python script that removes the diacritics from the txt/hocr output. You'll have to write this yourself...

Please use the forum to ask questions.

AvtechScientific commented 2 years ago

@amitdo - I asked here because my question relates directly to this issue and contributes to its treatment.

By blacklisting - do you mean forbidden_characters? If yes, then it seems like it has any meaning only during training data generation. If so then it looks like in order to cause tesseract to recognize ignoring nikud (the feature you've requested above) - you have to create two separate files: heb.traineddata (like the current one) and heb_nikudless.traineddata (with training data lacking nikud). Am I right? Or is there a way to "blacklist" certain characters during recognition? If so - you can run tesseract with nikud-aware traineddata, but tell it to ignore the blacklisted letters (despite the fact that they are being recognized)...

tesseract-ocr / langdata

Hebrew issues #82

Hebrew Alef-Bet (Alphabet)

Numerals

Nikud

Unique Hebrew punctuation marks

Common marks

Links