tesseract-ocr / langdata

Source training data for Tesseract for lots of languages
Apache License 2.0
836 stars 887 forks source link

Superscripts & subscripts #62

Open amitdo opened 7 years ago

amitdo commented 7 years ago

Copied from 59:


@Shreeshrii commented ​ Just checking whether this new training will also address:

  1. Correct handling of superscripts

@theraysmith commented

  1. Correct handling of superscripts

Beyond the scope of this change. Sub/superscript are much harder to deal with, as they have to be trained, and that means incorporating them correctly into the training path, and how to pass the information back out of the line recognizer to the output. At the moment it seems the iterator supports discovery of sub/super, but there is no output renderer that handles it. (Not even hocr?)

Question: For which languages/scripts is is desirable to support sub/super?


Shreeshrii commented

Regarding superscripts/subscripts etc, I can point out three cases based on the languages I know.

a. English - books, thesis etc. have a number of footnotes referred to in the text with superscripts. I guess this will apply to all languages written in Latin script. Usually this will be at end of words.

b. Tamil - Sanskrit texts transliterated in Tamil scripts use superscripts/subscripts 2,3,4 (sometimes 1 also) to distinguish between different sounds (to support sanskrit alphabet which does not have direct mapping in Tamil script). These can actually be in middle of Tamil words.

c. Hindi, Sanskrit and other Indian languages - Hindi books, thesis etc use superscripts for referring to footnotes (similar to English above). The difference is that in some cases these will be using the Latin alphabet 0-9 and in some cases using Devanagari digits (in case of Hindi, Sanskrit etc). Unicode has superscripts 0-9 for Latin script but not for Devanagari script. I would suggest support for the Latin script superscript numbers.

Scanned pages with devanagari superscripts should also be mapped to the Latin script superscript numbers. Similarly for other Indian languages.


@stweil commented

English - books, thesis etc. have a number of footnotes referred to in the text with superscripts. I guess this will apply to all languages written in Latin script. Usually this will be at end of words.

At least it applies to German. There are also superscripts after punctuation characters at the end of sentences.

Should all superscripts be handled in the same way, or do we need a different handling for those superscripts which have a special UTF-8 code like ¹, ² or ³.


Shreeshrii commented

See page 3 in http://sanskritdocuments.org/doc_ganesha/gaNanAyak8-ta.pdf for superscripts usage in Tamil.

Unicode has subscripted and superscripted versions of a number of characters including a full set of Arabic numerals.

The most common superscript digits (1, 2, and 3) were in ISO-8859-1 and were therefore carried over into those positions in the Latin-1 range of Unicode. The rest were placed in a dedicated section of Unicode at U+2070 to U+209F.


Shreeshrii commented

Should all superscripts be handled in the same way, or do we need a different handling for those superscripts which have a special UTF-8 code like ¹, ² or ³.

All superscripts have a special UTF-8 code, though in different ranges. Not all fonts have support for all superscripts and subscripts.

Shreeshrii commented 7 years ago

Thanks, Amit!

Unicode Ranges

http://www.alanwood.net/unicode/latin_1_supplement.html

http://www.alanwood.net/unicode/superscripts_and_subscripts.html

Samples

See page 3 in http://sanskritdocuments.org/doc_ganesha/gaNanAyak8-ta.pdf for superscripts usage in Tamil.

Sample of subscript numbers usage in Tamil - http://srivaishnavam.com/stotras/sristuti_tamil.pdf

Shreeshrii commented 7 years ago

Sample for Sanskrit text with superscripts in devanagari digits

bhaktimanjari61

Shreeshrii commented 7 years ago

Please see https://github.com/tesseract-ocr/langdata/issues/40 for more devanagari samples

Shreeshrii commented 7 years ago

Sample of English text with numbers as well as asterix as superscripts for links to footnotes

pages 26-28 in http://gretil.sub.uni-goettingen.de/gretil_elib/Suk9441__Sukthankar_MemorialEd_1_CritStud_Mbh_1944.pdf

amitdo commented 7 years ago

Hebrew also uses superscripts for referring to footnotes.

הפנייה[12]

amitdo commented 7 years ago

At the moment it seems the iterator supports discovery of sub/super, but there is no output renderer that handles it. (Not even hocr?)

For hOCR see https://kba.github.io/hocr-spec/1.2/#sub-sup

Shreeshrii commented 7 years ago

Devanagari extended range has combining devanagari digits, but these are not 'superscripts' but rather vedic accent signs. See

http://www.alanwood.net/unicode/devanagari-extended.html

supported in Siddhanta font.

Shreeshrii commented 7 years ago

I would suggest adding them to https://github.com/tesseract-ocr/langdata/blob/master/eng/desired_characters

²
³
¹
⁰
⁴
⁵
⁶
⁷
⁸
⁹
₀
₁
₂
₃
₄
₅
₆
₇
₈
₉

but maybe it is too late for this training run.

  x² x³      x¹  
x⁰    x⁴ x⁵ x⁶ x⁷ x⁸ x⁹
x₀ x₁ x₂ x₃ x₄ x₅ x₆ x₇ x₈ x₉
Shreeshrii commented 7 years ago

sample of english page

http://www.britishmuseum.org/research/collection_online/collection_object_details.aspx?objectId=72549&partId=1&searchText=roberts+bentley&page=1

theraysmith commented 7 years ago

Thanks for the answers on sub/superscripts. Those Hindi superscripts are very different to the way they are shown in Latin. They seem to appear at the beginning of words. How would you expect them to appear in the output character sequence? Before the word?

I previously had the unicode superscript 1-9 in the desired characters, but took them out, as they didn't occur in the training corpus text. I don't think that is a sufficiently general solution. The network needs to learn a sub/superscript begin and end code, analogous to way they are encoded in HTML, and learn to allow at least 0-9, [], +/-, a-z as sub/superscript characters, learning to put them between the start and end codes. This method could generalize to other scripts, but I think I either need to go back to the www crawl and keep the / tags or make totally artificial sub/superscript content. The former could open a whole can of worms as it could impact all the language model generation, and the latter is just mess in generating the training text, hence the question, as it would be easy enough to make fake content for Latin, but not so easy for other scripts. The Hindi examples don't encourage me that it is easy to generate realistic fake content for non-Latin scripts.

On Fri, Mar 31, 2017 at 7:57 AM, Shreeshrii notifications@github.com wrote:

sample of english page

http://www.britishmuseum.org/research/collection_online/ collection_object_details.aspx?objectId=72549&partId=1& searchText=roberts+bentley&page=1

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/62#issuecomment-290735511, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056Q3NA-b1SfrISjDvRsA6QWokNGCOks5rrRRogaJpZM4MvYNR .

-- Ray.

Shreeshrii commented 7 years ago

There is no standard regarding superscripts with Hindi, it seems.

Please see the other sample at https://cloud.githubusercontent.com/assets/5095331/21923440/1d19cc32-d99a-11e6-9a2b-ccaaa8cc86f2.png also.

I think if they can be supported for end of words or after punctuation, similar to Latin scripts, that in itself would be great.

On 31-Mar-2017 10:52 PM, "theraysmith" notifications@github.com wrote:

Thanks for the answers on sub/superscripts. Those Hindi superscripts are very different to the way they are shown in Latin. They seem to appear at the beginning of words. How would you expect them to appear in the output character sequence? Before the word?

I previously had the unicode superscript 1-9 in the desired characters, but took them out, as they didn't occur in the training corpus text. I don't think that is a sufficiently general solution. The network needs to learn a sub/superscript begin and end code, analogous to way they are encoded in HTML, and learn to allow at least 0-9, [], +/-, a-z as sub/superscript characters, learning to put them between the start and end codes. This method could generalize to other scripts, but I think I either need to go back to the www crawl and keep the / tags or make totally artificial sub/superscript content. The former could open a whole can of worms as it could impact all the language model generation, and the latter is just mess in generating the training text, hence the question, as it would be easy enough to make fake content for Latin, but not so easy for other scripts. The Hindi examples don't encourage me that it is easy to generate realistic fake content for non-Latin scripts.

On Fri, Mar 31, 2017 at 7:57 AM, Shreeshrii notifications@github.com wrote:

sample of english page

http://www.britishmuseum.org/research/collection_online/ collection_object_details.aspx?objectId=72549&partId=1& searchText=roberts+bentley&page=1

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/62# issuecomment-290735511, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056Q3NA- b1SfrISjDvRsA6QWokNGCOks5rrRRogaJpZM4MvYNR .

-- Ray.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/62#issuecomment-290774853, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_oxQ4KBYoIRvzaQEg_0P06h3ZFcEJks5rrTZggaJpZM4MvYNR .

Shreeshrii commented 7 years ago

For cases similar to first sample, where they seem to be at beginning of words, it will be ok to put them before the word.

On 31-Mar-2017 11:00 PM, "ShreeDevi Kumar" shreeshrii@gmail.com wrote:

There is no standard regarding superscripts with Hindi, it seems.

Please see the other sample at https://cloud.githubusercontent.com/assets/5095331/21923440/1d19cc32- d99a-11e6-9a2b-ccaaa8cc86f2.png also.

I think if they can be supported for end of words or after punctuation, similar to Latin scripts, that in itself would be great.

On 31-Mar-2017 10:52 PM, "theraysmith" notifications@github.com wrote:

Thanks for the answers on sub/superscripts. Those Hindi superscripts are very different to the way they are shown in Latin. They seem to appear at the beginning of words. How would you expect them to appear in the output character sequence? Before the word?

I previously had the unicode superscript 1-9 in the desired characters, but took them out, as they didn't occur in the training corpus text. I don't think that is a sufficiently general solution. The network needs to learn a sub/superscript begin and end code, analogous to way they are encoded in HTML, and learn to allow at least 0-9, [], +/-, a-z as sub/superscript characters, learning to put them between the start and end codes. This method could generalize to other scripts, but I think I either need to go back to the www crawl and keep the / tags or make totally artificial sub/superscript content. The former could open a whole can of worms as it could impact all the language model generation, and the latter is just mess in generating the training text, hence the question, as it would be easy enough to make fake content for Latin, but not so easy for other scripts. The Hindi examples don't encourage me that it is easy to generate realistic fake content for non-Latin scripts.

On Fri, Mar 31, 2017 at 7:57 AM, Shreeshrii notifications@github.com wrote:

sample of english page

http://www.britishmuseum.org/research/collection_online/ collection_object_details.aspx?objectId=72549&partId=1& searchText=roberts+bentley&page=1

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/62#issueco mment-290735511, or mute the thread https://github.com/notifications/unsubscribe-auth/ AL056Q3NA-b1SfrISjDvRsA6QWokNGCOks5rrRRogaJpZM4MvYNR .

-- Ray.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/62#issuecomment-290774853, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_oxQ4KBYoIRvzaQEg_0P06h3ZFcEJks5rrTZggaJpZM4MvYNR .

amitdo commented 7 years ago

For Hebrew 0-9 [] would cover most common cases for superscript.

It always appears at end of word - left to it.

Shreeshrii commented 7 years ago

https://hi.m.wikipedia.org/wiki/%E0%A4%AD%E0%A4%BE%E0%A4%B0%E0%A4%A4

Shows superscripts links to footnotes/sources - within square brackets.

On 31-Mar-2017 11:24 PM, "Amit D." notifications@github.com wrote:

For Hebrew 0-9 [] -/+ would cover most common cases for superscript.

It always appears at end of word - left to it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/62#issuecomment-290782999, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_oz5tpfC5LcRPP7e02D99LUOFAk6Hks5rrT3kgaJpZM4MvYNR .

Shreeshrii commented 6 years ago

@theraysmith

I thought I could add these via plusminus type of finetune training, but the number superscripts are getting normalized to the regular numbers.

I changed the Common.unicharset to use the number superscripts as the normalized form, but it is getting overwritten when the starter traineddata is created.

Lines from modified Common.traineddata

² 0 123,156,224,255,0.638911,0.105973,0.12853,0.0857674,0.752544,0.121773 Common 103 2 103 ²    # ² [b2 ]
³ 0 120,155,224,255,0.627422,0.109856,0.124728,0.0940968,0.720739,0.123547 Common 124 2 124 ³   # ³ [b3 ]
¹ 0 124,158,221,255,0.420234,0.124322,0.182872,0.109747,0.550712,0.155342 Common 121 2 121 ¹    # ¹ [b9 ]
⁰ 0 134,167,235,255,0.697626,0.0792543,0.133325,0.0942472,0.723922,0.120788 Common 79 2 79 ⁰    # ⁰ [2070 ]
⁴ 0 131,171,233,255,0.670567,0.0739127,0.104638,0.115274,0.692275,0.10575 Common 22 2 22 ⁴  # ⁴ [2074 ]
⁵ 0 129,167,235,255,0.634866,0.0973672,0.352014,0.125467,0.641662,0.108137 Common 40 2 40 ⁵ # ⁵ [2075 ]
⁶ 0 134,167,235,255,0.679818,0.0828177,0.322725,0.141146,0.695846,0.119094 Common 66 2 66 ⁶ # ⁶ [2076 ]
⁷ 0 132,171,235,255,0.623803,0.0793659,0.573739,0.832972,0.638961,0.106538 Common 116 2 116 ⁷   # ⁷ [2077 ]
⁸ 0 134,167,235,255,0.670599,0.0863238,0.362624,0.18757,0.681561,0.107008 Common 65 2 65 ⁸  # ⁸ [2078 ]
⁹ 0 131,170,235,255,0.678503,0.0852865,0.330318,0.10444,0.690038,0.108946 Common 60 2 60 ⁹  # ⁹ [2079 ]

But the generated starterdata unicharset has the following:

² 0 126,192,205,255,46,116,0,92,61,236 Common 92 2 92 2 # ² [b2 ]
³ 0 122,192,205,255,46,110,0,90,63,236 Common 9 2 9 3   # ³ [b3 ]
¹ 0 126,192,207,255,22,84,1,97,48,236 Common 108 2 108 1    # ¹ [b9 ]
⁰ 0 125,144,226,249,53,99,0,52,75,167 Common 106 2 106 0    # ⁰ [2070 ]
⁴ 0 128,151,224,247,57,92,0,54,75,167 Common 111 2 111 4    # ⁴ [2074 ]
⁵ 0 125,144,224,249,51,92,0,50,75,167 Common 112 2 112 5    # ⁵ [2075 ]
⁶ 0 125,144,226,249,55,96,0,54,75,167 Common 113 2 113 6    # ⁶ [2076 ]
⁷ 0 128,144,224,247,53,88,0,56,75,167 Common 114 2 114 7    # ⁷ [2077 ]
⁸ 0 125,144,226,249,57,99,0,52,75,167 Common 107 2 107 8    # ⁸ [2078 ]
⁹ 0 127,142,228,249,55,98,0,56,75,167 Common 115 2 115 9    # ⁹ [2079 ]

So, these values are NOT coming from Common.unicharset, but being generated.

Please see https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization

Unicode provides standard normalization algorithms that produce a unique (normal) code point sequence for all sequences that are equivalent; the equivalence criteria can be either canonical (NF) or compatibility (NFK). Since one can arbitrarily choose the representative element of an equivalence class, multiple canonical forms are possible for each equivalence criterion. Unicode provides two normal forms that are semantically meaningful for each of the two compatibility criteria: the composed forms NFC and NFKC, and the decomposed forms NFD and NFKD. Both the composed and decomposed forms impose a canonical ordering on the code point sequence, which is necessary for the normal forms to be unique.

In order to compare or search Unicode strings, software can use either composed or decomposed forms; this choice does not matter as long as it is the same for all strings involved in a search, comparison, etc. On the other hand, the choice of equivalence criteria can affect search results. For instance some typographic ligatures like U+FB03 (ffi), roman numerals like U+2168 (Ⅸ) and even subscripts and superscripts, e.g. U+2075 (⁵) have their own Unicode code points. Canonical normalization (NF) does not affect any of these, but compatibility normalization (NFK) will decompose the ffi ligature into the constituent letters, so a search for U+0066 (f) as substring would succeed in an NFKC normalization of U+FB03 but not in NFC normalization of U+FB03. Likewise when searching for the Latin letter I (U+0049) in the precomposed Roman Numeral Ⅸ (U+2168). Similarly the superscript "⁵" (U+2075) is transformed to "5" (U+0035) by compatibility mapping.

Should canonical normalization be used instead of compatibility mapping?

@stweil Have you tried training for these?

Shreeshrii commented 6 years ago

FYI, I was looking at this in the context of a test training for handling mathematical formule. Here is the training text that I was using.

eng.superscripts.txt

Shreeshrii commented 6 years ago

Related NKFC Normalization https://github.com/tesseract-ocr/tesseract/issues/1852