openscriptures / morphhb

Open Scriptures Hebrew Bible
https://hb.openscriptures.org
Other
178 stars 64 forks source link

Using the solidus to separate morpheme segments is against OSIS philosophy #50

Open DavidHaslam opened 6 years ago

DavidHaslam commented 6 years ago

The general philosophy of OSIS is to use XML elements for all the semantic markup.

Using the solidus within the text to separate morpheme segments within Hebrew words goes against this OSIS philosophy. One friend has described this as "bad, bad, very bad".

cf. The XML files for the CrossWire WLC module are more conformant with this principle where they used the XML seg element for this purpose. The original data was obtained from the website tanach.us but further preprocessing was done before building the latest version of module, which differs from it's earliest version in this respect.

e.g. Taken from the mod2imp output of the CrossWire WLC module, they are generally like this:

$$$Genesis 1:1
<w><seg type="x-morph">בְּ</seg><seg type="x-morph">רֵאשִׁ֖ית</seg> </w>
<w><seg type="x-morph">בָּרָ֣א</seg> </w>
<w><seg type="x-morph">אֱלֹהִ֑ים</seg> </w>
<w><seg type="x-morph">אֵ֥ת</seg> </w>
<w><seg type="x-morph">הַ</seg><seg type="x-morph">שָּׁמַ֖יִם</seg> </w>
<w><seg type="x-morph">וְ</seg><seg type="x-morph">אֵ֥ת</seg> </w>
<w><seg type="x-morph">הָ</seg><seg type="x-morph">אָֽרֶץ</seg> </w>
<w type="x-sofpasuq">׃ </w>

NB. In this extract, the output was also converted to Word Per Line format afterwards.

Aside: That is not to say that the WLC module is perfect. Irrespective of any text critical issues, at least these mistakes were made when it was first built.

  1. The Hebrew text should not have been normalized to NFC.
  2. There should not be a space either before or after each MAQAF.
  3. The space between Hebrew words should be outside the w elements.

These are not your responsibility. I mention them merely in passing.

Those defects were rectified in the WLC module after I created this issue in 2017.

DavidHaslam commented 1 year ago

@dowens76 @DavidTroidl

Does nobody involved in this project take any notice of issues?

This was posted in December 2017 so what's going on?

jag3773 commented 12 months ago

Hi @DavidHaslam, I suspect many people agree with you on that, myself included. Making such a change in the text as it is now would certainly cause all sorts of backwards incompatibility issues.

I'd be in favor of offering an alternate version of the files in the repo that has the fields separated according to OSIS philosophy. If you want to put in PR with the changes as you suggest I think we'd be willing to incorporate it.

DavidHaslam commented 12 months ago

@jag3773

Since I added this issue in 2017, the website tanach.us has had a change of title.

There are other significant changes, but one relevant to this issue is that all the solidus / markers that used to separate morphological segments have all been removed!