Closed DavidHaslam closed 7 years ago
The results are certainly of interest.
File merged.words.count.txt has 25190 lines. File merged.nfc.words.count.txt has 24898 lines.
Difference: 292 words were removed from the counted list as a result of the normalization. This is approximately 1.16% of the original number of counted words.
Which words were changed by the Unicode normalization can be examined using WinMerge or any other good comparison utility.
This is an important issue for module making.
If the Punjabi Bible text turns out to be in any way incorrect as a consequence of normalization to NFC, then the instructions will require to be made very clear to CrossWire when the project is submitted for making the SWORD module.
There is a command line switch for osis2mod that prevents normalization to NFC. It can be used. Because this is NOT the default, instructions must be explained clearly.
The comparison of the word count character frequencies is done in the attached Excel file. It merits detailed consideration.
merged.vs.merged.nfc.words.count.character.frequency.xlsx
Six Gurmukhi codepoints disappear as a result of the conversion to NFC. See below.
Unicode Normalization does have a significant effect on these four GURMUKHI LETTERS: KHHA, GHHA, ZA, FA (code points in the range U+0A59 to U+0A5E).
These complex letters are split into a simpler letter and sign. The lower dot that is part of the original letters becomes the separate codepoint U+0A3C GURMUKHI SIGN NUKTA : pairin bindi NB. This transformation can be inspected using Fonts | Simple Rendering in BabelPad.
The same effect also occurs with GURMUKHI LETTERS LLA, SHA.
Of particular significance is that some Unicode fonts then show the NUKTA dot displaced to the right. e.g. In Code2000 font.
However, this also depends on which Font Engine is used in the application. e.g. BabelPad displays these letters different to SIL FieldWorks WorldPad.
I am still concerned that converting to NFC may not meet the requirements of accuracy.
FIO. Information on the Gurmukhi alphabet.
Today, I have enhanced my TextPipe filter to include a secondary output that counts all the unique Glyphs in verse text. The method was to divert the uncounted words to a T-filter which then split each word by inserting a line feed before each Gurmukhi LETTER, thus leaving all signs as part of a Glyph. Unique Glyphs were then counted and sent in sorted order to the secondary output file.
Further analysis of this followed. A copy of the counted Glyphs file was converted to NFC so that it could be compared. 74 normalizations occurred, all due to the separation of the GURMUKHI SIGN NUKTA from the six complex letters noted earlier.
I have pasted the before and after results into an Excel worksheet, and added a column that checks for equal LENgth of the cell text.
Gurmukhi Glyphs Before & After NFC.xlsx
Use Data | Filter | AutoFilter to select those where column D = FALSE to see these 74 items.
The same worksheet also illustrates the issues of spurious vowel signs and other signs. I manually formated these cells with red font. AutoFilter can be used to select cells by colour.
Initial test builds of a SWORD module should be done with -N as an option in the osis2mod command line.
i.e. At least until we have reached a decision regarding normalization.
Note for making the OSIS XML file, please take care!
usfm2osis.py
does not normalize UTF-8 to NFCu2o.py
has an option not to normalize to NFCThe language code for Eastern Punjabi is pan
If the SWORD module is made with the Unicode not normalized, users of most front-end apps should encounter no difficulty in search for a word that contains (say} the Gurmukhi letter SHA (e.g. ਪਰਮੇਸ਼ੁਰ ). However, this is not the full story.
If the SWORD module is made with the Unicode normalized to NFC, searching for ਪਰਮੇਸ਼ੁਰ (that contains the letter SHA) will not give ANY matches at all. They would have to search instead for ਪਰਮੇਸ਼ੁਰ (in which the SHA has been normalized to SA plus NUKTA).
NB. Not all front-ends have a (hidden?) feature to automatically normalize the search string when it is entered.
Users accustomed to writing the letters LLA SHA KHHA GHHA ZA FA instead of the letters LA SA KHA GA JA PHA with a NUKTA sign will thus experience difficulties with the search feature.
I would still recommend that the module be made with the Unicode normalized to NFC, because some front-ends (e.g. Xiphos) will not display the search results correctly otherwise.
The much more significant observation is that none of the four Normalization Forms include the compound letters LLA SHA KHHA GHHA ZA FA. They all separate out the NUKTA sign.
The Gurmukhi block has been part of Unicode since version 1.0.0 (October 1991).
It seems to me that Unicode Normalization treats these six letters as if they were more like Presentation Forms. They are certainly precomposed characters.
I imagine we'll face a similar problem for other North Indian languages whose script is descended from the ancient Brahmi script.
Indeed, composition exclusions are in these Indic scripts: Devanagari, Bengali, Gurmukhi, Oriya. (q.v.)
Canonically decomposable characters that are generally not the preferred form for particular scripts.
I have since updated the issue in Xiphos.
The preview pane display issue is not caused by Normalisation.
It's caused by the existence of the module's Lucene search index.
I wrote to an expert to see if he could shed any light on the subject.
The thing that puzzles me is why NFC (or NFKC) does not leave the precomposed characters alone?
Here's the helpful reply I had from Andrew West of BabelStone.
I don't know the reasons why, but there are certain precomposed Indic letters that decompose but do not recompose under any normalization form. The list of these characters is given here:
http://unicode.org/Public/UNIDATA/CompositionExclusions.txt
As indicated at http://www.unicode.org/reports/tr15/#Primary_Exclusion_List_Table the decomposed forms are preferred over the precomposed forms.
I updated my Excel file to make use of the EXACT function to test for cell equality.
Gurmukhi Glyphs Before & After NFC.xlsx
That's better than having resorted earlier to the LENgth function.
The following text ought to be added to the README.md file.
This is very important for anyone processing this translation for use with Bible software.
Canonically decomposable characters ... are generally not the preferred form for particular scripts.
I have just added that note to the README.md
file in the process branch of my fork.
Pull request to follow.
See pull request #113
By default, the module making tool osis2mod normalizes the input Unicode text to NFC.
Using BabelPad to convert the concatenated USFM file to NFC gives rise to 24022 normalizations.
I imagine that this should have no semantic or presentational issues for Punjabi in the Gurmukhi script. However, I'm not the Punjabi expert.
The way to check would be to perform a detailed compare between the original text and the file saved after conversion to NFC.
Doing the compare on the concatenated USFM file would be very tedious. Better for gaining understanding would be to do it on the counted word list. Converting that gave only 2142 normalizations.
Likewise, generating the counted word list from the normalized concatenated USFM file should have fewer lines, because any word that had been keyed differently would become identical as a result of normalization.