Counted Gurmukhi Glyphs Analysis (Excel worksheet based on my TextPipe filter output)

DavidHaslam commented 7 years ago

Though I've already referred to this in two other issues, and even though this covers some of the other particular issues, it's probably useful to give this its own issue as a general topic.

I've just updated my Excel worksheet to include Column E for the Unicode Names of the original Gurmukhi Unicode codepoints in each counted glyph. In addition, I have formatted in red font the names of the invalid parts of the 99 glyphs that break the rules for the Gurmukhi script as an Abugida.

Gurmukhi Glyphs Before & After NFC.xlsx

It's conceivable that some of the 99 badly formed glyph types were not reported in my earlier issues. This report therefore serves as a checklist or reference point for search and replace operations.

NB. The worksheet is protected (with no password) merely to prevent accidental edits. Use of AutoFilter is permitted while it's protected.

DavidHaslam commented 7 years ago

If you filter on column B or C by colour, to display 99 of the 979 records, the total number of badly formed glyphs can be added by selecting the count range in column A. The sum comes to 343.

The total number of glyphs is 1,725,674. Thus the error rate is 199 PPM.

Only 5 of the records are common to the bad glyphs and those that change due to NFC normalization.

DavidHaslam commented 7 years ago

The difficulty proofreaders face in detecting these badly formed glyphs should not be underestimated.

It's quite likely that the editing environment does not clearly display the letter placeholder dotted circle for any vowel or other sign that is either wrongly attached to a valid glyph or completely unattached.

DavidHaslam commented 7 years ago

Row 666 of the worksheet is peculiar. I have formated cell E666 with yellow fill. This badly formed glyph is ਲੰਿ LETTER LA TIPPI VOWEL SIGN I

The signs are in the reverse order! They should be swapped to become valid, thus: ਲਿੰ LETTER LA VOWEL SIGN I TIPPI

This occurs just once in Zephaniah 3:19 which reads:

\v 19 ਵੇਖੋ, ਮੈਂ ਉਸ ਸਮੇਂ ਤੇਰੇ ਸਭ ਦੁਖ ਦੇਣ ਵਾਲਿਆਂ ਨਾਲ ਨਜਿੱਠਾਂਗਾ, ਮੈਂ ਲੰਿਙਆਂ ਨੂੰ ਬਚਾਵਾਂਗਾ, ਅਤੇ ਹੱਕੇ ਹੋਇਆਂ ਨੂੰ ਇੱਕਠਾ ਕਰਾਂਗਾ, ਅਤੇ ਮੈਂ ਸਾਰੀ ਧਰਤੀ ਵਿੱਚ ਓਹਨਾਂ ਦੀ ਸ਼ਰਮ ਉਸਤਤ ਅਤੇ ਜਸ ਬਣਾਵਾਂਗਾ |

The word with the bad glyph is ਲੰਿਙਆਂ.

Even if this is "corrected" by swapping the two diacritic signs, the word still doesn't get translated properly by Google, so it's more likely that a better solution needs to be found here. A missing letter, maybe?

This example also illustrates what I already observed above, that my systematic analysis has detected something that my earlier manual searches failed to find.

DavidHaslam commented 7 years ago

NB. My counted glyphs filter cannot in principle detect the occurrence of a duplicated Gurkukhi vowel letter. Those occurrences reported earlier were found by manual searches.

DavidHaslam commented 7 years ago

Note that this analysis can be readily repeated once the related issues have been fixed. This will serve as a confirmation test for closing those issues.

DavidHaslam commented 7 years ago

Superseded by more recent analysis.

tfbf / Bible-Punjabi-Pavitr-Bible-1945

Counted Gurmukhi Glyphs Analysis (Excel worksheet based on my TextPipe filter output) #49