tesseract-ocr / langdata

Source training data for Tesseract for lots of languages
Apache License 2.0
827 stars 886 forks source link

Yiddish #85

Open amitdo opened 6 years ago

amitdo commented 6 years ago

From #82

@theraysmith commented

OK I have added desired/forbidden characters for heb and yid I assume that apart from the 3 unique characters that you listed (for each) the list of nikuds should be the same?

I'm not sure. I don't speak (or write) Yiddish.

amitdo commented 6 years ago

https://en.wikipedia.org/wiki/Yiddish_orthography

amitdo commented 6 years ago

https://en.wikipedia.org/wiki/Yiddish_orthography#Punctuation So, the 3 Punctuation marks I mentioned are used for Yiddish too.

05BE ‫־‬ HEBREW PUNCTUATION MAQAF 05F3 ‫׳‬ HEBREW PUNCTUATION GERESH 05F4 ‫״‬ HEBREW PUNCTUATION GERSHAYIM

theraysmith commented 6 years ago

Thank for looking it up!

On Thu, Aug 10, 2017 at 6:45 AM, Amit D. notifications@github.com wrote:

https://en.wikipedia.org/wiki/Yiddish_orthography#Punctuation So, the 3 Punctuation marks I mentioned are used for Yiddish too.

05BE ‫־‬ HEBREW PUNCTUATION MAQAF 05F3 ‫׳‬ HEBREW PUNCTUATION GERESH 05F4 ‫״‬ HEBREW PUNCTUATION GERSHAYIM

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/85#issuecomment-321555520, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056ShNrl8Li2GtYZB0tfdKi1j-HH_Bks5sWwmQgaJpZM4OzaRx .

-- Ray.

amitdo commented 6 years ago

I'm still not sure about the nikud for Yiddish.

It seems that it does not use all the Hebrew nikud signs.

theraysmith commented 6 years ago

I don't think it will be harmful, unless there are a lot of Hebrew words marked as Yiddish, which is possible I imagine.

On Thu, Aug 10, 2017 at 11:30 AM, Amit D. notifications@github.com wrote:

I'm still not sure about the nikud for Yiddish.

It seems that it does not use all the Hebrew nikud signs.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/85#issuecomment-321635506, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056X9acLR3aM8fT-WChqatfKB8IGctks5sW0wsgaJpZM4OzaRx .

-- Ray.

amitdo commented 6 years ago

Actually, it seems that there are quite a lot of Hebrew words in Yiddish.