usfm-bible / tcdocs

Technical Committee Documents
Other
9 stars 9 forks source link

Michael Johnson on + Notation #6

Closed jonathanrobie closed 5 months ago

jonathanrobie commented 2 years ago

Michael Johnson said this to me in an email response:

The worst problem with USFM is the + notation, and especially the + notation in footnotes. It is incomprehensible to the ordinary working linguist. Although technically, I can unambiguously convert USFM to USFX and back in a similar manner that Paratext converts from USFM to USX and back, far more often than not I have to edit the text to make it comply with the broken standard. (USFX is an alternate way to represent USFM in XML that I invented before USX was invented, and which I still use as an internal hub standard, because it sheds some of the defects of USFM that USX retains.)

Here are the roots of the problem:

At one point in time, no nesting of character attributes was allowed. This was inadequate, but it made it easy to tell when a character attribute ended: either with the next character attribute (which may have been a "revert to normal" tag), or when the paragraph style ended, or when the footnote/cross reference/end note ended.

When conversion to XML was anticipated, we switched to explicit end tags in the main body of the text using the on the end of the opening tag, like \bk ...\bk instead of a "return to normal" character tag. Except we didn't do that in footnotes, so the end a character style with the next character style rule still applied.

When it became apparent that nesting of character tags was needed for real (and it is to avoid a huge set of combined tags to do the same thing in a less understandable way), then instead of doing like XML does and explicitly ending all character styles and requiring proper nesting, we introduced the + notation that I hate. I hate it because although it solved a backward compatibility issue (but wasn't the only possible solution), it has wasted many man-hours of my time tediously correcting problems caused by ordinary working linguists not understanding that any non-footnote character style in a footnote needs the "+" even though it doesn't seem to be nested from their point of view, like \f + \fr 1:1 \ft xxx \tl yyy\tl zzz\f. The intent is clear, but Paratext chokes on it with a schema check spewing incomprehensible barf because there was no + between the \ and the tl.

Haiola has an option to relax the "+" rules and assume all non-footnote-specific character styles are nested if begun before another is ended, then generate the + syntax on export.

The requirement of the "+" notation is accepted by the Paratext team as a good solution for backward compatibility. I think it is not, because (1) its rules are inconsistent because of the inconsistent handling of footnote vs. main text character styles with respect to end tags, (2) it is not handled automatically by Paratext and most other USFM software, and (3) very few people really understand it.

USFM and USX suffer another technical compatibility issue which seems to be less of a practical problem. Strict nesting of markers is required by XML syntax, but not USFM tag syntax, although Paratext seems to handle enforcing that OK. In other words, \qt \wj XXX\qt YYY\wj would make sense to a human, but it should be \wj \qt XXX\qt YYY\wj in an ideal world, or \wj +qt XXX+qt YYY\wj in the fantasy land where OWLs understand "+" notation.

Now pointing out the problem without suggesting a solution would be obnoxious, right? So I won't do that.

I suggest making the "+" notation optional. Full stop.

Instead of making the linguists (or more likely, publishing personnel, like me) do the disambiguation work, implement some more logic based on class of tag to determine where the XML end tags belong. USX, being XML doesn't suffer from this ambiguity. It is just getting from USFM to XML that is the trick. Here is some logic that could solve backward compatibility issues:

  1. If there are "+" marks in the USFM, they can be used as they are now, or ignored.
  2. Best practice is to always explicitly end character styles in USFM as is done in XML, but this is not always required because of the following rules.
  3. Character styles each have their own class, except for footnote/cross reference/end note character styles, which are all in the same class.
  4. If a character style is begun before another character style of the same class ends, it is assumed to be nested.
  5. If there is no character style explicit end marker before the end of a paragraph style or verse marker, the character style is assumed to end there but a warning is generated.
  6. If a character style of the same class is started before another of the same class ends, the first one is assumed to end before the next one started.
jonathanrobie commented 2 years ago

A follow-up email from Michael:

Hello, Jonathan and all.

I currently have 1,257 Bible translations online, so I have seen some things. To see the markup of most of them, see https://ebible.org/Scriptures/ (with apologies for the ones that copyright and permission issues preclude me sharing USFM source).

In my last email (preserved, below), I spoke of the mess that is the + notation for nested character styles. Here are six more: quotation start/stop; treatment of character styles when footnotes, crossreferences, end notes, side bars, or figures are inserted; treatment of custom tags; handling of links (especially in cross references); glossaries; and the optional line break, //.

Logically, the text of complete Bibles, and especially study Bibles, defies adherence to a single hierarchy, because it is inherently overlapped and/or parallel in presentation. The following hierarchies do not naturally nest: (1) testament or extra matter/book/chapter/verse, (2) testament or extra matter/book/paragraphs or poetry stanzas and lines/headings, (3) quotations (especially of Jesus for red letter editions), and (4) reference material (footnotes/cross references/end notes/inline references/word markers like Strong's numbers and glossary entries/sidebars for commentaries, word studies, etc). Bible texts also refuse to stick to a standard versification or even a small number of versifications. Verse numbers aren't even always in order, nor are they always numeric. Chapter numbers can be out of order and non-numeric, too, in the Apocrypha/Deuterocanon. When shoehorning a Bible text into an XML document, current USX practice as I understand it is to structure the XML as one file per book (thus evading the issue of the testament/extra/matter level), and within that file to structure the XML primarily as a print publication view: book chapters, each with paragraph-level styles, with character-level styles nested in them. We then use milestones to indicate verse markers (beginning only in USFM, beginning and end in USX). We stuff notes into an element that can be anywhere in a paragraph. Same with sidebars, etc. We don't actually mark quotation begin and end like OSIS attempted to (and failed for various reasons), but paint just one type of quotation (\wj ) as a character style that ends and restarts to jump across paragraph and verse boundaries. But now there is the brand new \qt#-s |sid="##" who="Petros"*qotation\qt#-e |eid="##"* notation that breaks my current USFM parser, and contains another ambiguity, to wit: which natural language should the speaker be identified in, and what about Cephas vs. Peter? We will count that as one ambiguity, although it is arguably two, with now two ways to mark Jesus' words.

Two: when I insert a footnote at a point with an active character style, do I need to also stop and restart active character styles around the footnote? I should not have to. Paratext seems to think otherwise. For example: \v 1 Aaaaa bbb: \wj "ccc ddd +qt eee fff\x + \xt EXO 20:1\xgggg+qt hhhh.\wj* will flunk error checks, even though there should be no ambiguity, in that the cross reference note has its own separate character style context that should never inherit the character attributes of the main text.

Three: treatment of custom tags. In https://ubsicap.github.io/usfm/about/syntax.html#syntax-znamespace, I find that I as a software developer reading USFM am free to ignore any tag starting with \z. Great, but does that mean I discard the tag only, the tag and following text, or what? Does this thing have an end marker? Is it paragraph, character, or metadata markup? In general, if you send me USFM with \zAnything, it MUST come with a full, unambiguous description of what it is and how to properly ignore it without damaging the actual canon of Scripture. Right now, when I encounter something like that, it generally isn't necessary markup, and can be stripped out some way with regular expressions before I import it. I can't write general software for it, though, because I have no idea how it will be used. It is truly ambiguous.

Four: handling of links. There are five kinds of links: external URLs, links to a verse in another Paratext project, links to an article or tag in the current work, links to a specific Bible verse explicitly declared, and human-readable links that may or may not have been hammered into compliance with Paratext reference checks. The first few are handled by the \jmp ...\jmp* tag. This is roughly equivalent to the USFX ref tag, but not quite. The last one causes all kinds of problems in encoding historical texts, because honestly, the assumptions made by the programmers and designers just don't hold true of all the ways humans have encoded references. For starters, it shouldn't matter if I refer to Psalm 23 instead of Psalms 23. The first is grammatically correct, but the second is more likely to pass the Paratext reference checks. It also should not be a hindrance to internal link generation if I specify "Lk" or "Luke" in a reference, provided they are in \toc3 and \toc2. I should be able to preserve notes in historical texts that put chapter numbers in Roman numerals. There is more, but you get the idea. As for the links to verses in another Paratext project, those are pretty much useless outside of Paratext (i.e. in a published format). The \jmp tag says something about it being for use when no character style is active. Huh? Why?

Five: reference material. There is no standard way to format a glossary. It is pretty much a wild West of ideas. I try to deal with anything reasonable, but the results are sometimes strange. The weirdest ones are the ones where people insert chapter and verse numbers into glossaries. This one needs work. I took a good stab at it in engWEB14, but it isn't standardized well enough that I feel ready to properly automate conversion to a digital glossary for a Bible study app without doing custom preprocessing on a case-by-case basis. Another one that choked my processes was a hymnal inserted into XXA with chapter and verse numbers: and more songs than Psalms has chapters. I fixed my limits, but this gets strange.

Six: the optional line break, //, is a bit ambiguous as to just how optional it is. It is also not specified how to write // that is to be output literally, as in a URL for a web site where the Bible is hosted or where the translators have more information, like https://png.bible/ or https://wycliffe.org/.

Seven: the incomprehensible "+" notation, which is almost never understood by ordinary working linguists. The worst problem with USFM is the + notation, and especially the + notation in footnotes. Although technically, I can unambiguously convert USFM to USFX and back in a similar manner that Paratext converts from USFM to USX and back, far more often than not I have to edit the text to make it comply with the broken standard. (USFX is an alternate way to represent USFM in XML that I invented before USX was invented, and which I still use as an internal hub standard, because it sheds some of the defects of USFM that USX retains.)

Here are the roots of the problem:

At one point in time, no nesting of character attributes was allowed. This was inadequate, but it made it easy to tell when a character attribute ended: either with the next character attribute (which may have been a "revert to normal" tag), or when the paragraph style ended, or when the footnote/cross reference/end note ended.

When conversion to XML was anticipated, we switched to explicit end tags in the main body of the text using the on the end of the opening tag, like \bk ...\bk instead of a "return to normal" character tag. Except we didn't do that in footnotes, so the end a character style with the next character style rule still applied.

When it became apparent that nesting of character tags was needed for real (and it is to avoid a huge set of combined tags to do the same thing in a less understandable way), then instead of doing like XML does and explicitly ending all character styles and requiring proper nesting, we introduced the + notation that I hate. I hate it because although it solved a backward compatibility issue (but wasn't the only possible solution), it has wasted many man-hours of my time tediously correcting problems caused by ordinary working linguists not understanding that any non-footnote character style in a footnote needs the "+" even though it doesn't seem to be nested from their point of view, like \f + \fr 1:1 \ft xxx \tl yyy\tl zzz\f. The intent is clear, but Paratext chokes on it with a schema check spewing incomprehensible barf because there was no + between the \ and the tl.

Haiola has an option to relax the "+" rules and assume all non-footnote-specific character styles are nested if begun before another is ended, then generate the + syntax on export.

The requirement of the "+" notation is accepted by the Paratext team as a good solution for backward compatibility. I think it is not, because (1) its rules are inconsistent because of the inconsistent handling of footnote vs. main text character styles with respect to end tags, (2) it is not handled automatically by Paratext and most other USFM software, and (3) very few people really understand it.

USFM and USX suffer another technical compatibility issue which seems to be less of a practical problem. Strict nesting of markers is required by XML syntax, but not USFM tag syntax, although Paratext seems to handle enforcing that OK. In other words, \qt \wj XXX\qt YYY\wj would make sense to a human, but it should be \wj \qt XXX\qt YYY\wj in an ideal world, or \wj +qt XXX+qt YYY\wj in the fantasy land where OWLs understand "+" notation.

Now pointing out the problem without suggesting a solution would be obnoxious, right? So I won't do that.

I suggest making the "+" notation optional. Full stop.

Instead of making the linguists (or more likely, publishing personnel, like me) do the disambiguation work, implement some more logic based on class of tag to determine where the XML end tags belong. USX, being XML doesn't suffer from this ambiguity. It is just getting from USFM to XML that is the trick. Here is some logic that could solve backward compatibility issues:

  1. If there are "+" marks in the USFM, they can be used as they are now, or ignored.
  2. Best practice is to always explicitly end character styles in USFM as is done in XML, but this is not always required because of the following rules.
  3. Character styles each have their own class, except for footnote/cross reference/end note character styles, which are all in the same class.
  4. If a character style is begun before another character style of the same class ends, it is assumed to be nested.
  5. If there is no character style explicit end marker before the end of a paragraph style or verse marker, the character style is assumed to end there but a warning is generated.
  6. If a character style of the same class is started before another of the same class ends, the first one is assumed to end before the next one started.
KentSpiel commented 2 years ago

Items 4. and 6. appear to be contradictory

  1. If a character style is begun before another character style of the same class ends, it is assumed to be nested.

  2. If a character style of the same class is started before another of the same class ends, the first one is assumed to end before the next one started.

I do not think we should make an exception for sloppy markup in order to make life easier for technicians and publishers.

My solution is to clean up the SFM It is possible to clean up common markup errors relatively easily, Perhaps there should be some methods added to PT to clean these up.

  1. This regex removes a closing \fx tag and replaces it with the pre-preceding \fx_ tag: `Fix footnote closers#r#(?<=\(e?f)\s)(?s).?(?=\\1*):::(?<=\(f\w+\s)([^\]|\(?!f))?\(f\w+).?)\\3*(\s)(\\1)?#\4\\1` \ft text1 \fq text2\fq text3 => \ft text1 \fq text2 \ft text3

  2. This regex looks for embedded markers and adds \+ markup where necessary: Fix embedded markers #r#(?<=(\\\w+)\s)(?s).*?(?=\1\*):::\\(\w+)(?s)(.*?)(\s*)\\\1\*#\\+\1\2\\+\1*\3

  3. This regex checks and then closes and reopens \wj where necessary: wj Cleanup#r#(?<=\\wj\s)(?s)((?!\\wj\*).)*(?=\u200F*\\wj\*):::((\s*(\\(b|p\w*|mi?|q\w*)\s|(\\(m?r|m?s\w*)\s.*)+|\s+\\v\s\S+\s|\\(x|ef|f|add)\s.*?\\(x|ef|f|add)\*))+)(\s*)#\\wj*\1\9\\wj

kahunapule commented 2 years ago

I was not real clear with 4. and 6. By "same class", I mean character styles that are always mutually exclusive (like the original footnote character styles, such as \ft, \fq, \fl, etc.), and those that normally are nested and always terminated, like \wj , \add , and \qt. Actually, Haiola already has automatic disambiguation logic in it that handles most cases. In my view, this is not a matter of sloppy markup creation by users, but sloppy markup design by designers and implementers who don't fully consider just how illogical their design is to the ordinary working linguist. (Yes, I have strong feelings about this, based on a great deal of experience "correcting" such markup.)

KentSpiel commented 2 years ago

@kahunapule I am not saying the syntax is not confusing or that it does not need to be cleaned up. There does need to be better definition of how it is to be applied. For example \ft xxx \fq yyy\fq* zzz and \ft xxx \fq yyy \ft zzz are equivalent in USFM but are not equivalent in USX (Note how the real space moves from _zzz to yyy_.)

I think understand what you are saying if 4. is written:

  1. If a character style is begun before another character style of a different class ends, it is assumed to be nested.

If I am wrong about this please clarify.

In USFM the \+ notation also has the effect of telling PT that Cascading Style Sheet (CSS) should be applied to the text. In the default style sheet Regular text is 12pt and footnote text is 10pt. If I do not put the + \f + \fr 1:1 \ft xxx \tl yyy\tl* zzz\f*. \tl yyy\tl* is formatted as 12pt. When I add the + \+tl yyy\+tl* is 10pt. I wonder if that difference could be reflected in USFM. If not, then I agree that the + notation would appear to be an artificial distinction, (a technical requirement without any meaning).

kahunapule commented 2 years ago

To my way of thinking, nested character styles always imply using a cascading style sheet or equivalent, no matter if "+" syntax is used or not. Flattening texts to include all possible combinations of nested styles is inefficient, ineligant, and error-prone. For example, the \add ...\add style in KJV tradition would add italics to whatever is underlying, be that canonical text or a footnote in a different size text quoting some portion of canonical text. Likewise, \wj ...\wj would add optional red print to the underlying text. Trying to figure out what combinations of character styles can be combined will ultimately fail, if for no other reason than that someone will eventually add yet another character style, or another word-level attribute.

Anyway, the real issue is that once upon a time, no character style nesting was allowed, so any new character style was assumed to cancel any previous character style, and there were no end tags (like \fq*). That did not meet the needs of all real Bible translations, and it did not map well to XML. When end tags were introduced, it was done while trying to keep backwards compatibility (a good thing) and also trying to keep the very flat, one-style-at-a-time, no CSS mindset (a bad thing). When the flat model proved inadequate (predictably), the "+" syntax was applied like a band-aid over a compound fracture. It sort of helped, but left most end users terminally confused. While pretty much anyone reading this comment fully understands it, the vast majority of my customers do not, and never will.

On 3/28/22 04:46, Kent Spielmann wrote:

@kahunapule https://github.com/kahunapule I am not saying the syntax is not confusing or that it does not need to be cleaned up. There does need to be better definition of how it is to be applied. For example |\ft xxx \fq yyy\fq* zzz| and |\ft xxx \fq yyy \ft zzz| are equivalent in USFM but are not equivalent in USX (Note how the real space moves from |zzz| to |yyy|.)

I think understand what you are saying if 4. is written:

 4. If a character style is begun before another character style of /a different/ class ends, it is assumed to be nested.

If I am wrong about this please clarify.

In USFM the + notation also has the effect of telling PT that Cascading Style Sheet (CSS) should be applied to the text. In the default style sheet Regular text is 12pt and footnote text is 10pt. If I do not put the + |\f + \fr 1:1 \ft xxx \tl yyy\tl zzz\f.| |\tl yyy\tl| is formatted as 12pt. When I add the + |+tl yyy+tl| is 10pt. I wonder if that difference could be reflected in USFM. If not, then I agree that the + notation would appear to be an artificial distinction, (a technical requirement with out any meaning).

One of my biggest bugaboos is the need to flatten text before publishing. The only way to properly style a text in HTML or IDML is to create all of the styes that result from nesting. For example we commonly have Hebrew transliterated word in the Psalm superscriptions: \d /xxx/ \t /yyy/\tl*` In publication yyy needs to be not italic. The need to transition between hierarchical and flattened expressions of the USX should not be overlooked since the final goal of the format is not archiving but publication.

— Reply to this email directly, view it on GitHub https://github.com/usfm-bible/tcdocs/issues/6#issuecomment-1080748189, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATEO2YYBQ7PHVIJRVCIM2LVCHA5BANCNFSM5QDUD72A. You are receiving this because you were mentioned.Message ID: @.***>

-- signature

Aloha, */Michael Johnson/* 26 HIWALANI LOOP • MAKAWAO HI 96768-8747• USA mljohnson.org https://mljohnson.org/ • eBible.org https://eBible.org • WorldEnglish.Bible https://WorldEnglish.Bible • PNG.Bible https://PNG.Bible Signal/Telegram/WhatsApp/Telephone: +1 808-333-6921 Skype: kahunapule • Telegram/Twitter: @kahunapule • Facebook: fb.me/kahunapule https://www.facebook.com/kahunapule

mhosken commented 5 months ago

The good news is that by requiring character styles to close, we have removed the need for +. It is now optional in USFM and deprecated.