openscriptures / morphhb

Open Scriptures Hebrew Bible
https://hb.openscriptures.org
Other
182 stars 63 forks source link

What are the n attribute values for ? #46

Open DavidHaslam opened 6 years ago

DavidHaslam commented 6 years ago

The w elements in OSIS XML files contain an n attribute with various values.

The n attribute does not seem to be documented anywhere, or, if it is mentioned, it's not easy to find.

What do these values signify in this context?

DavidTroidl commented 6 years ago

The "n" marks the hierarchy of the disjunctive cantillation marks in the verse. Its use is demonstrated in our Verse demo: OshbVerse Demo. We have a new version of the demo in the main repository, just waiting some technical details to be worked out, so we can deploy it.

DavidHaslam commented 6 years ago

That might make sense to you, but what the n values actually mean is no clearer from your answer or from the demo. The demo didn't seem to be obviously related to the n values for each word.

For newcomers to the repository, I think you need a much more detailed description of what these mean and how they were actually generated.

DavidHaslam commented 6 years ago

Thinking further, it might seem worthwhile to consider if the SWORD API might be enhanced to make use of the n attribute. Much work for just one module? That's as may be. One thing is clear. We would need to understand its function first.

DavidTroidl commented 6 years ago

There is a much fuller description of the "n" attributes and the hierarchy of the verse they represent, under structure. The identification of the cantillation marks, in the popups of the demo, is another example of the usage of the "n". SWORD does have a UTF8Cantillation option already, so it may be useful in that context.

DavidHaslam commented 6 years ago

The n attribute is already assigned in SWORD for marking enumerating words.

This would mean that any module using the n attribute for a different function could not at the same time support the enumerated words feature.

i.e. While OSHB uses n for accents structure, it could not also be enhanced to enumerate words.

DavidHaslam commented 6 years ago

The SWORD feature

GlobalOptionFilter=UTF8Cantillation

merely strips the cantillation points from the Hebrew text.

jonathanrobie commented 3 years ago

The n attribute is already assigned in SWORD for marking enumerating words.

This would mean that any module using the n attribute for a different function could not at the same time support the enumerated words feature.

i.e. While OSHB uses n for accents structure, it could not also be enhanced to enumerate words.

A lot of TEI vocabularies use the @n attribute to identify the element that contains it. That's consistent with the definition in the TEI specification.

Is there a better way to encode cantillation hierarchy in OSIS?

image

DavidHaslam commented 3 years ago

Is there a better way to encode cantillation hierarchy in OSIS?

In theory, cantillation marks and vowel accents in Biblical Hebrew ought to be the sole domain of Unicode Normalisation rather than something to be implemented in XML.

The issue was that "Unicode Normalisation breaks Biblical Hebrew", as Peter Kirk described in detail in the SBL Hebrew Font Manual. His proposal was to define a custom normalisation for Biblical Hebrew - one which does not change the order of Hebrew diacritics providing they were keyed in the same order as they were in the earliest digitisation of the Hebrew Bible made before Unicode was developed. (ie. the Michigan-Claremont encoding of the Westminster Electronic Hebrew Bible by Alan Groves).

NB. BabelPad - (a Unicode text editor for Windows developed by Andrew West) - supports such a custom normalisation of Hebrew. This feature was added at my suggestion about seven years ago.

This reply may seem to be tangential to the context of XML schema and encoding cantillation by means of attributes, but it does address the more fundamental underlying issue (assuming I've understood the question correctly).

jonathanrobie commented 3 years ago

Is there a better way to encode cantillation hierarchy in OSIS?

In theory, cantillation marks and vowel accents in Biblical Hebrew ought to be the sole domain of Unicode Normalisation rather than something to be implemented in XML.

The issue was that "Unicode Normalisation breaks Biblical Hebrew", as Peter Kirk described in detail in the SBL Hebrew Font Manual.

There just might be a better way to do that now, at least according to the Unicode Consortium. I have just started looking into this, though, and I may not understand this correctly.

According to this FAQ, a former problem was fixed:

Q: But isn't there is still a problem with Biblical Hebrew?

A: There was a problem, but it has been addressed. Because the Hebrew points are defined to have distinct combining classes, their character semantics is such that their ordering is immaterial in the standard. To handle those cases where visual ordering is material, see the discussion of the Combining Grapheme Joiner (CGJ) in Section 23.2, Layout Controls, in the Unicode Standard.

It looks like they are using CGJs for this purpose at Tanch.us. According to their Change Log:

Combining Grapheme Joiners (CGJs) have been added in 728 instances to make leading meteg-then-vowel displays more reliable. Ben Denckla suggested this change. The Coding page has been substantially updated.

I can use ugrep to confirm that these characters are there in the text from Tanach.us:

% ugrep '[\x{034F}]' *
Daniel.xml:        <w>יְרוּשָׁלַ֖͏ִם</w>
Daniel.xml:        <w>קֽ͏ָדָמַ֖י</w>
Daniel.xml:        <q>עֽ͏ָלִּ֔ין</q>
Daniel.xml:        <w>וֽ͏ַאֲחַשְׁדַּרְפְּנַיָּא֙</w>
Daniel.xml:        <w>קֽ͏ָדָמַ֔יהּ</w>
Daniel.xml:        <w>קֽ͏ָדָמ֣וֹהִי</w>
Daniel.xml:        <w>עֽ͏ַד־</w>
Daniel.xml:        <w>וְנֽ͏ֶחֱלֵ֙יתִי֙</w>
!!! SNIP !!!

If you take one of those strings and put it into Tim Whitlock's Unicode inspector, you can confirm that it is there, e.g.

יְרוּשָׁלַ֖͏ִם

But the CGJs are not present in OSHB files.

One way of checking this would be to use the output of ugrep and doing searches on the corresponding Tanach.us webpages, e.g. to compare the instances from Daniel to this representation of it, checking to see if the cantillation displays correctly:

Daniel (Tanach.us)

I have not yet done this, it's on my to-do list.

But does anyone have a list of issues with cantillation that are not correctly handled by this? Or does anyone have time to take a good look at whether this fixed the problem?

jonathanrobie commented 3 years ago

Question: Is there a way I can use the @n attribute from OSHB to identify things to check in order to make sure they are now fixed in Unicode if CGJs are added at the right place? Or am I completely off-base on this? If I am missing something, please let me know.

DavidTroidl commented 3 years ago

The @n in the OSHB has nothing to do with character placement. They are used to record the cantillation hierarchy of the verse, as illustrated in OshbVerse.

jonathanrobie commented 3 years ago

The @n in the OSHB has nothing to do with character placement. They are used to record the cantillation hierarchy of the verse, as illustrated in OshbVerse.

Thanks for the clarification. And sorry for the bunny trail. If I want to raise the CGJ issue, should I start a new issue for that and copy the text over?

David's point about the use of @n still matters, though:

The n attribute is already assigned in SWORD for marking enumerating words.

And as I pointed out, that's a normal TEI usage. Is there a better way to do cantillation hierarchy?

DavidTroidl commented 3 years ago

The SWORD usage is not a standard, just a choice. As I understand it, the point is to show the correspondence of a translation to the original text word order. If that is the case, it would be useless for the original language text itself.

jonathanrobie commented 3 years ago

But TEI is a standard, OSIS is a standard. I think this is relevant:

The @n attribute is very helpful for queries - if I get a bunch of results for a search or a query, it gives me an easy key to sort on to get it back in document order and to identify the BCV where it came from.

Not a showstopper. If you use @n for cantillation hierarchy, I can add a different variable for that, or do something else, when I import your upstream into my system.

DavidTroidl commented 3 years ago

The OSIS manual says:

This attribute is identical to the TEI n attribute, and may be used to provide a name or number to identify the particular element instance.

From the TEI specification for @n you quoted earlier, the attribute is flexible and not pinned to any one specific interpretation. If you are not using the cantillation structure in the OSHB, you are free to reassign the attribute as you please.

pdurusau commented 3 years ago

Jonathan - I've read the thread and remain puzzled over the use of an encoded cantillation hierarchy using @n?

Even if you intend to query: cantillation (word separator) cantillation (word separator) cantillation

(reading right to left)

So you can search for a pattern of cantillation marks occurring on separate words, I'm not sure what an @n attribute would add?

The hierarchy between marks isn't consistently applied but unless you are working with textual witnesses that's unlikely to be an issue.

What am I missing about your question? Do you have an example of the result you are seeking? (Noting that you could have an operator that allows multiple words to occur between cantillation marks, assuming you want to study larger cantillation structures in the text. I've seen that done in some research in England.