usfm-bible / tcdocs

Technical Committee Documents
Other
9 stars 9 forks source link

Implicit marker closure #38

Closed mhosken closed 3 months ago

mhosken commented 1 year ago

Implicit closure is nice. Given we always know what is a starting marker and we also know whether something is embedded, it is possible to implicitly close things. For example, the start of a new paragraph implicitly closes all character styles in the previous paragraph. Starting a new character style closes all open character styles including any currently open embedded character styles.

The difficulty is with parsing. Most parsers are based on some notion of recursive descent. This makes actual implicit closure hard and can turn run sequences into embedded runs. For example:

\f + \fr 1:17 \ft This is a footnote\fr*\f*

Is obviously invalid, since the \ft closes the \fr. But if we simply say that \fr and \ft are optional and use a typical recursive descent parser, then this example is usually valid and the \ft section is assumed to be embedded within the \fr. Adding support to invalidate this example takes a lot of work in a grammar. One has to say the end of a run is either the closing marker or the start of what might possibly come next. That 'what might possibly come next' can be a tricky and long list to come up with in each context.

Based on this, it is proposed to tighten the USFM specification to remove more implicit closure than has already been removed. The proposed rules are:

  1. All embedded and non-embedded character styles must be explicitly closed.
  2. Notes (footnotes, cross references, etc.) have internal structural markers like \fr and \ft. These must not be explicitly closed and runs are separated by other structural markers or the end note marker, which is required. Note \xt is structural within a cross reference, but may also be used as a character style elsewhere, where it must be explicitly closed. Notes have required explicit closure.
  3. Paragraphs have no end paragraph marker and are implicitly closed by the start of another paragraph or by a chapter milestone.
  4. Table rows are terminated by another table row or by the start of a non-table paragraph marker.
  5. Table cells are terminated by the start of another cell or row or by a non-table paragraph marker. This is problematic if we want to support multi-paragraph table cells.

The astute reader will have caught the implication of rule 1. By explicitly closing character styles, the need for + type markers is removed. While it is planned to remove them (or at least treat them redundantly as equivalent to their non-plussed cousins), this change is not planned as part of the first phase of documenting the USFM standard as it stands.

RobH123 commented 1 year ago

Paragraphs have no end paragraph marker and are implicitly closed by the start of another paragraph or by a chapter milestone. (emphasis added)

Are there no paragraphs in the Bible that cross a chapter boundary? I think there might be. (I know that "sections" do.)

Also, paragraphs should also be closed by things like section headings. (Or do you regard them as paragraphs?)

davidg-sil commented 8 months ago

Yes, there are certainly mid-paragraph chapters. The \nb continuation marker exists to tell the typesetting engine that the verse text continues without a paragraph break.

mhosken commented 8 months ago

Section heads, etc. are all paragraphs. The basic structure of a USX/USFM document is as a sequence of paragraphs of various kinds: section heads, main text, header fields, etc. There can also be milestones interspersed between paragraphs as well. But that's about it. It's actually pretty simple. Although seeing the wood for the trees can be tricky!