usfm-bible / tcdocs

Technical Committee Documents
Other
9 stars 9 forks source link

USFM has an escape problem. #19

Closed kahunapule closed 6 months ago

kahunapule commented 2 years ago

If you wanted to put a \ in Bible text, could you? No. Not even with \. If you wanted to put a ~ in Bible text, could you? Note: this is in the orthography for one Papua New Guinean language. If you wanted to put a // in Bible text, could you? This is needed in front matter often when an Internet URL is included. If you wanted to put a | in Bible text, could you? Note: this is used as punctuation in at least 3 Bible translations.

Sure, there are work-arounds. For the PNG language, I replaced tilde with math operator tilde. Wrong symbol, but it looks right. Bad use of Unicode. For //, I programmed Haiola to ignore // as a line break when preceded by P: or S: (case insensitive). For |, I am narrowing the scope of where I recognize this as markup. Does Paratext do it the same way? I doubt it.

KentSpiel commented 2 years ago

Great comment. I am wondering though is this a USFM problem, a USX problem, a Paratext editor problem, or a combination of all three? It seems this is a case where the tool is driving the standard. Most problematic to me is the //. Not only because it is common in URLs, but also because its use as an <optbreak /> is not well documented. It is apparently understood by PubAssist but I do not think it is a feature that would be correctly interpreted by HTML or native InDesign. Am I wrong? Another issue is that I want to be able to write documentation in Paratext about using Paratext in the XXG book. I am not able to do that because I can't actually display the markers I want to talk about. It would be great it there was a way to display markers as literal text \p. Perhaps a ` tick would be a good choice for an escape character and a way to display literal markers.

kahunapule commented 2 years ago

The escape problem is first a USFM problem, but it is also a Paratext and DBL problem, and (to a lesser degree) a USX problem. As you observed, the tools drive the standard, and the biggest driver of the standard is Paratext. Yes, // is the most problematic. It almost assumes a human operator to interpret it every instance. When converting to HTML, I just convert it to
UNLESS it is immediately preceded by p: or s: (case insensitive), as in https://wycliffe.org or http://biblica.org.

Picking another character, like backtick, is not a good solution to the escape problem. Someone will (and already has) use that character in text, just like tilde and vertical bar. The standard information theory approach is to make a way to escape back to the unescaped character. For example, \ could be defined to be a literal single \, like with regular expressions. There are lots of good options. Just saying "nobody needs to do that" is not satisfying, nor is it likely to remain true. This is already a solved problem in XML, in that <, >, and & have special meanings that are not part of the encoded text, but each one of them can be encoded with entities, like <, >, and &. The only reason this is a problem in USX, an XML format, is that it is crippled by its round tripping to USFM.

On 5/27/22 12:05, Kent Spielmann wrote:

Great comment. I am wondering though is this a USFM problem, a USX problem, a Paratext editor problem, or a combination of all three? It seems this is a case where the tool is driving the standard. Most problematic to me is the |//|. Not only because it is common in URLs, but also because its use as an || is not well documented. It is apparently understood by PubAssist but I do not think it is a feature that would be correctly interpreted by HTML or native InDesign. Am I wrong? Another issue is that I want to be able to write documentation in Paratext about using Paratext in the XXG book. I am not able to do that because I can't actually display the markers I want to talk about. It would be great it there was a way to display markers as literal text |\p |. Perhaps a ` tick would be a good choice for an escape character and a way to display literal markers.

— Reply to this email directly, view it on GitHub https://github.com/usfm-bible/tcdocs/issues/19#issuecomment-1140070096, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATEO22S3FFFV4WJG6C5KP3VMFBKPANCNFSM5XCELL4A. You are receiving this because you authored the thread.Message ID: @.***>

-- signature

Aloha, */Michael Johnson/* 26 HIWALANI LOOP • MAKAWAO HI 96768-8747• USA mljohnson.org https://mljohnson.org/ • eBible.org https://eBible.org • WorldEnglish.Bible https://WorldEnglish.Bible • PNG.Bible https://PNG.Bible Signal/Telegram/WhatsApp/Telephone: +1 808-333-6921 Skype: kahunapule • Telegram/Twitter: @kahunapule • Facebook: fb.me/kahunapule https://www.facebook.com/kahunapule

KentSpiel commented 1 year ago

I would propose backslash as an escape character so a backslash as text is \\= \ (this would be useful for allowing documentation of USFM in USFM format), tilde is \~, double forward slash as text is \/\/. If a pipe needs escaped then \|. I do not think there are any other characters that need escaped.

davidg-sil commented 11 months ago

PTXprint response to these challanges:

For //, a zero width non-breaking space in the middle breaks its magic properties. This can be manually entered or applied with a changes.txt entry (regex search/replace), e.g. "(\S+)://" > "\1:/\uFEFF/" For \ I agree, this hasn't been possible. If paratext allowed, . making it possible (via \\) is however not very hard. ~ TeX defaults to interpreting \~ as an accent. I don't know if anyone uses it like this, but it is certainly easy to fix. | is recognised as a begin-attributes marker in character styles, and interpreted literally outside. \| is always a vertical bar / pipe.

Thus I propose that the backslash be the escape code. However we could, I suppose, allow a milestone \backslash\* to insert backslashes with no ambiguity about spaces after it.

mhosken commented 6 months ago

Done