openscriptures / morphhb

Open Scriptures Hebrew Bible
https://hb.openscriptures.org
Other
182 stars 63 forks source link

Downstream uses of the text and Unicode Normalization #22

Closed DavidHaslam closed 7 years ago

DavidHaslam commented 7 years ago

According to the SBL Hebrew Font User Manual

Unicode normalisation can easily break Biblical Hebrew text. (page 9)

The implications of this need to be clearly understood by any agency making a downstream use of the text. The ordering of the diacritics in Biblical Hebrew is important for the proper rendering of several composite characters.

Some Bible software utilities generally normalize the source text to NFC during the module build.

If the detailed arguments in the aforesaid User Manual are sound, this conversion should be avoided.

Something about this signficant issue should be included in the project README file.

DavidTroidl commented 7 years ago

The readme file has been updated.

DavidHaslam commented 7 years ago

Thanks, David.

As regards the custom normalization of Hebrew, the implementation in BabelPad differs only from the WLC source files in regard to the relative ordering of the puncta extraordinaria found in Psalm 27:13.

The OSMHB module at Tyndale STEP was normalized during build and has:

$$$Psalms 27:13
<w lemma="strong:H3884" n="1.1"><seg>לׅׄוּלֵׅׄ֗אׅׄ</seg></w> <note n="4">Puncta extraordinaria -- a \u05c4 is used to mark such marks in the text when they are above the line and a \u05c5 when they are below the line. </note> <w lemma="strong:H539" n="1.0"><seg>הֶ֭אֱמַנְתִּי</seg></w> <w lemma="strong:H7200"><seg>לִ</seg><seg>רְא֥וֹת</seg></w> <w lemma="strong:H2898"><seg>בְּֽ</seg><seg>טוּב</seg></w><seg type="x-maqqef">־</seg><w lemma="strong:H3068" n="1"><seg>יְהוָ֗ה</seg></w> <w lemma="strong:H776"><seg>בְּ</seg><seg>אֶ֣רֶץ</seg></w> <w lemma="strong:H2416" n="0"><seg>חַיִּֽים</seg></w><seg type="x-sof-pasuq">׃</seg>

Try the first word in BabelPad. Compare it with custom normalization, as well as with this same word from the original XML file at tanach.us I think that you'll find the latter is different than after either NFC or Custom, which both happen to agree here.

I believe this particular difference is what lies behind the phrase "a minor expansion" in their page about Coding.

This ordering is a minor expansion of the custom mark ordering proposed by John Hudson of http://www.tiro.com in his SBL Hebrew Hebrew Font User Manual that is part of the SBL Hebrew font release.

Visually, with SBL Hebrew font, with complex rendering, it's pretty impossible to see any difference. It's not clear to me why they diverged from Hudson's advice in this one particular area.

btw. Some of Hudson's advice was based on earlier research done in 2003 by Peter Kirk. See http://www.qaya.org/academic/hebrew/Issues-Hebrew-Unicode.html

DavidHaslam commented 7 years ago

NB. The above details have also been communicated to the Tyndale STEP development team.

DavidTroidl commented 7 years ago

Thanks, David. I wasn't aware of the difference between tanach.us and the SBL Manual. The original OSIS files for the module follow tanach.us in mark ordering.

It should also be noted that OSMHB is a legacy module, produced with KJV versification at a time when alternate versification was not widely implemented in SWORD. It is superseded by OSHB, in the SWORD repository, using MT versification. The Tyndale STEP module OHB is derived from the OSHB.

On 1/28/2017 7:22 AM, David Frank Haslam wrote:

Thanks, David https://github.com/DavidTroidl.

As regards the custom normalization of Hebrew, the implementation in BabelPad http://www.babelstone.co.uk/Software/BabelPad.html differs only from the WLC source files in regard to the relative ordering of the puncta extraordinaria found in Psalm 27:13.

The OSMHB module at Tyndale STEP https://www.stepbible.org/ was normalized during build and has:

|$$$Psalms 27:13 <w lemma="strong:H3884" n="1.1">לׅׄוּלֵׅׄ֗אׅׄ Puncta extraordinaria -- a \u05c4 is used to mark such marks in the text when they are above the line and a \u05c5 when they are below the line. <w lemma="strong:H539" n="1.0">הֶ֭אֱמַנְתִּי <w lemma="strong:H7200">לִרְא֥וֹת <w lemma="strong:H2898">בְּֽטוּב<seg type="x-maqqef">־<w lemma="strong:H3068" n="1">יְהוָ֗ה <w lemma="strong:H776">בְּאֶ֣רֶץ <w lemma="strong:H2416" n="0">חַיִּֽים<seg type="x-sof-pasuq">׃ |

Try the first word in BabelPad. Compare it with custom normalization, as well as with this same word from the original XML file at tanach.us http://tanach.us/ I think that you'll find the latter is different than after either NFC or Custom, which both happen to agree here.

I believe this particular difference is what lies behind the phrase "a minor expansion" in their page about Coding http://tanach.us/Pages/Coding.xml.

This ordering is a *minor expansion* of the custom mark ordering
proposed by *John Hudson* of http://www.tiro.com in his *SBL
Hebrew Hebrew Font User Manual* that is part of the *SBL Hebrew*
font release.

Visually, with SBL Hebrew font, with complex rendering, it's pretty impossible to see any difference. It's not clear to me why they diverged from Hudson's advice in this one particular area.

btw. Some of Hudson's advice was based on earlier research done in 2003 by Peter Kirk. See http://www.qaya.org/academic/hebrew/Issues-Hebrew-Unicode.html

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/openscriptures/morphhb/issues/22#issuecomment-275845323, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKwBWMmaq8STEuINmHgxDb1NEu7XzRiks5rWzMjgaJpZM4LttRQ.


This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus

DavidHaslam commented 7 years ago

I've already noted in the CrossWire tracker the need to rebuild the WLC module without using NFC, There were other mistakes made by Chris Little when he built it, such as there being a space after every maqaf.

I've recently added a comment in issue 287 to remind CrossWire that the OSHB module should also be rebuilt without using NFC.