Closed jonathanrobie closed 3 years ago
Thanks @jonathanrobie for your clear example.
Here are my two cents.
Whether a prefixed preposition hides the consonantal definite article is currently captured in the morphology and not in the tokenization. This is probably where it should be, I think, because a) the token isn't consonantally realized, b) because the information is intrinsically part of the morphology (and after all this is a morphological database).
If we were to add an extra token that token would not have a consonant, but it would have a vowel (the patah or qamets, for instance, indicate definiteness).
I'm not a big fan of working with empty tokens or with vowel-only tokens for a number of reasons: a) if you tokenize the text you'll be tempted to get rid of empty tokens, b) the token is only implicit and hence it would not have a position in a string representation (you can't content[2:3] to find the definite article when it is only realized as a vowel, c) adding a new convention (for instance, \ would mean users have to go back to the docs (and we know they don't).
All that said, this is a matter of preference. There probably is no right-wrong way to do this.
What would be great, I think, is to include an XSLT transform that does exactly what you propose.
Whether a prefixed preposition hides the consonantal definite article is currently captured in the morphology and not in the tokenization. This is probably where it should be, I think, because a) the token isn't consonantally realized, b) because the information is intrinsically part of the morphology (and after all this is a morphological database).
That makes sense to me. And this is not at all a showstopper for me. Perhaps this is worth more careful documentation?
I see no need for an XSLT transform, because:
So I would recommend just documenting this carefully in the overview of morphology codes, with an example in Hebrew.
For words with an implicit article ("Rd"), WLC and OHSB differ in the number of morphemes. Here is a comparison for one word:
This is not a showstopper for our syntax trees, but it would be convenient if OSHB could add an explicit representation for the implicit article. One approach would be to change this:
To this:
I don't know if it would be useful to change the @morph attribute as well.