IMPORTANT: Analysis results after the last merge #81 by Joshy on 2017-01-11

DavidHaslam commented 7 years ago

After Joshy's merge #81 today, I did another pass on the USFM files using my various TextPipe filters.

Here are my analysis results.

Character frequency of concatenated USFM files: merged.usfm.character.frequency.txt

USFM Tag Statistics: merged.usfm.tags.count.txt

Gurmukhi words count: merged.words.count.txt

NB. Hyphenated words were split for this count.

Gurmukhi glyphs count: merged.glyphs.count.txt

As above but analysed in Excel: Gurmukhi glyphs count.xlsx NB. The worksheet is protected but without a password. Autofilter is still permitted.

With Valid?=FALSE, there are 87 types of invalid glyph, in a sum total of 323 locations.

Those figures do not include instances of repeated vowel letters, as opposed to repeated diacritics.

btw. The total number of glyphs in the text is 1726418. Hence 323 locations is equivalent to 187 parts per million.

DavidHaslam commented 7 years ago

Extracted from the Excel worksheet, filtered on Valid?=FALSE.

merged.glyphs.invalid.count.txt

I just did a search count in this file using regexp [\x{0A3E}-\x{0A4C}]{2,} and found only 22 matches.

This means we need a much more complex regexp search pattern to find all 87 types.

Further analysis required on my part.

DavidHaslam commented 7 years ago

The following complicated regexp has exactly 323 matches in the concatenated USFM file.

(\x{0A02}\x{0A02}|\x{0A05}\x{0A3E}|\x{0A05}\x{0A48}|\x{0A05}\x{0A4C}|\x{0A41}\x{0A41}|\x{0A41}\x{0A42}|\x{0A42}\x{0A42}|\x{0A47}\x{0A47}|\x{0A47}\x{0A48}|\x{0A4B}\x{0A4B}|\x{0A4C}\x{0A4C}|\x{0A70}\x{0A70}|\x{0A71}\x{0A02}|\x{0A71}\x{0A71})

It can be shortened to:

(\x{0A02}{2,}|\x{0A05}\x{0A3E}|\x{0A05}\x{0A48}|\x{0A05}\x{0A4C}|\x{0A41}{2,}|\x{0A41}\x{0A42}|\x{0A42}{2,}|\x{0A47}{2,}|\x{0A47}\x{0A48}|\x{0A4B}{2,}|\x{0A4C}{2,}|\x{0A70}{2,}|\x{0A71}\x{0A02}|\x{0A71}{2,})

Or the even shorter:

(\x{0A02}{2,}|\x{0A05}(\x{0A3E}|\x{0A48}|\x{0A4C})|\x{0A41}(\x{0A41}|\x{0A42})|\x{0A42}{2,}|\x{0A47}(\x{0A47}|\x{0A48})|\x{0A4B}{2,}|\x{0A4C}{2,}|\x{0A70}{2,}|\x{0A71}(\x{0A02}|\x{0A71}))

DavidHaslam commented 7 years ago

Now here's a more creative application of the regexp search.

From the SWORD module, I had already exported the text using the diatheke utility such that each verse has its full scripture reference at the start of the line.

This is Verse Per Line (VPL) format with full book names rather than abbreviations.

Searching this file with Notepad++ gives the 323 invalid glyph verse locations:

Search results for complex regexp to locate 323 invalid glyphs.txt

Ignore the line numbers. It's the references that are the key.

Armed with this information, the brothers should now be in a much better position to make the corrections.

NB. These results do not include instances of repeated vowel letters,

DavidHaslam commented 7 years ago

For convenience and less distracting, here's the search results file without the line numbers.

Search results for complex regexp to locate 323 invalid glyphs.txt

Look for the dotted circle placeholder[s] in each line of the search results. btw. At least one verse has more than one invalid glyph. (e.g. Deuteronomy 6:11 has 3 )

Notes:

Best opened with a good Unicode text editor such as BabelPad.
Use Raavi or Code2000 as the font.

DavidHaslam commented 7 years ago

In order to make it even simpler for the anomalous glyphs to be located, here is a derived copy of the search results.

Search results for complex regexp to locate 323 invalid glyphs.rdl.tags.txt

Notes:

Duplicate lines were removed.
Each regexp match location is tagged by inserting the degree symbol ° immediately after the glyph.

So all you have to do now is for each of the 323 tagged items which you can find by a search for the ° tag, decide what needs doing and correct the text at the same place in the USFM file.

If you view the file with BabelPad, I suggest you set Options | Display Colours | Colour Code by Script

Gurmukhi text will be coloured red, and the tags will be black - just like numbers and other punctuation.

DavidHaslam commented 7 years ago

It might help for me to go even further....

I just made a TextPipe filter to process the 66 USFM files to tag every suspect glyph in exactly the same manner with the degree symbol °

The output filter is set to Only output modified files.

There were just 25 files output. The Excel worksheet records the details.

Invalid.Glyphs.Tagged.Count.xlsx

The total number of replaces was 323.

NB. My processed input files had already been tidied up as described in issue #20 and issue #26

DavidHaslam commented 7 years ago

Observation:

118 of the 323 replace(s) were in Deuteronomy.

These were all for the same word. ਨਹੀਂਂ which means "not".

This might look right here in github, but in fact there are two BINDI signs where there should be _only_one.

I imagine this typo was propagated by a copy&paste error.

DavidHaslam commented 7 years ago

I will await a response to pull request #96 before I provide a copy of the 25 tagged usfm files.

Before I do that, I will rerun the TextPipe filter that outputs the tagged usfm files.

DavidHaslam commented 7 years ago

I have now rerun the TextPipe filter. The tagged 25 usfm files are in the uploaded Zip file.

Tagged.zip (replaced 2017-01-23)

Please use these to facilitate making suitable corrections to each of the 323 defective locations. Search for the degree sign ° (\xB0) Use a Unicode text editor that's capable of displaying the dotted circle placeholder[s] in an invalid glyph.

tfbf / Bible-Punjabi-Pavitr-Bible-1945

IMPORTANT: Analysis results after the last merge #81 by Joshy on 2017-01-11 #83