usfm-bible / tcdocs

Technical Committee Documents
Other
9 stars 9 forks source link

Text Definition #8

Closed KentSpiel closed 6 months ago

KentSpiel commented 2 years ago

Define Text. I suggest Text (Text) is a string of characters not beginning with a White Space imagebut can end with an optional WS. Two or more consecutive WS within a Text are treated (normalized?) as a single regular space (U+0020). Moreover Text cannot contain

USFM canonical form should allow only a single normal space at the end of a Text. Paratext has a tool under Project Menu > Advanced > Standardized whitespace which normalizes WS.

Currently a Paratext allows and retains spaces and paragraph returns in the underlying USFM. This is for the convenience and request of certain power Users who find this feature useful for working on the underlying USFM. Our USFM standard needs to allow this as a non-cannonical form of the USFM. However, normalization of WS should occur before transformation of of USFM to USX. Non-canonical WS will not be round tripped: USFM->USX->USFM

mhosken commented 2 years ago

I'll share my informal understanding of what a Text terminal is. At its core a Text terminal is a string of characters, which may include whitespace and even backslash delimited characters such as *, \, etc.. Thus Text also does include ~ and //. Whether we therefore treat Text as a non-terminal in a parser is up to the implementation.

An additional part of the definition of Text is that it is the minimum string of characters such that the rest of the parse succeeds. Thus in the example of a category we have: category This ("\cat" WS Text WS "\cat") says that the single whitespace character following the '\cat' is not part of the Text and that any whitespace characters preceding the '\cat*' are also not part of the Text. Thus, for example if there are two spaces following the '\cat' the second is considered part of the Text terminal. Since a category is not vernacular text, it is probably that we want all initial and final whitespace to be stripped from the Text in this context, and the grammar should be adjusted accordingly.

In a situation where vernacular text is involved, whitespace before a marker is considered part of the Text and therefore care should be taken that the grammar definition not have a WS terminal that might remove such a significant space.