openscriptures / morphhb

Open Scriptures Hebrew Bible
https://hb.openscriptures.org
Other
178 stars 63 forks source link

Implicit Articles and Syntax Trees #77

Closed jonathanrobie closed 3 years ago

jonathanrobie commented 3 years ago

For words with an implicit article ("Rd"), WLC and OHSB differ in the number of morphemes. Here is a comparison for one word:

<w morphshort="01001005003" osisref="Gen.1.5">
  <oshb>
    <w lemma="l/216" n="1.1.0" morph="HRd/Ncbsa" id="01Wkf">לָ/אוֹר֙</w>
  </oshb>
  <wlc>
    <m morphid="010010050031" text="לָ‎"/>
    <m morphid="010010050032" text="‎"/>
    <m morphid="010010050033" text="אוֹר֙‎"/>
  </wlc>
</w>

This is not a showstopper for our syntax trees, but it would be convenient if OSHB could add an explicit representation for the implicit article. One approach would be to change this:

<w lemma="l/216" n="1.1.0" morph="HRd/Ncbsa" id="01Wkf">לָ/אוֹר֙</w>

To this:

 <w lemma="l/216" n="1.1.0" morph="HRd/Ncbsa" id="01Wkf">לָ//אוֹר֙</w>

I don't know if it would be useful to change the @morph attribute as well.

jdejoode commented 3 years ago

Thanks @jonathanrobie for your clear example.

Here are my two cents.

Whether a prefixed preposition hides the consonantal definite article is currently captured in the morphology and not in the tokenization. This is probably where it should be, I think, because a) the token isn't consonantally realized, b) because the information is intrinsically part of the morphology (and after all this is a morphological database).

If we were to add an extra token that token would not have a consonant, but it would have a vowel (the patah or qamets, for instance, indicate definiteness).

I'm not a big fan of working with empty tokens or with vowel-only tokens for a number of reasons: a) if you tokenize the text you'll be tempted to get rid of empty tokens, b) the token is only implicit and hence it would not have a position in a string representation (you can't content[2:3] to find the definite article when it is only realized as a vowel, c) adding a new convention (for instance, \ would mean users have to go back to the docs (and we know they don't).

All that said, this is a matter of preference. There probably is no right-wrong way to do this.

What would be great, I think, is to include an XSLT transform that does exactly what you propose.

jonathanrobie commented 3 years ago

Whether a prefixed preposition hides the consonantal definite article is currently captured in the morphology and not in the tokenization. This is probably where it should be, I think, because a) the token isn't consonantally realized, b) because the information is intrinsically part of the morphology (and after all this is a morphological database).

That makes sense to me. And this is not at all a showstopper for me. Perhaps this is worth more careful documentation?

I see no need for an XSLT transform, because:

So I would recommend just documenting this carefully in the overview of morphology codes, with an example in Hebrew.