ubsicap / usfm

Unified Standard Format Markers
39 stars 18 forks source link

Scope of \rem #130

Open mhosken opened 3 years ago

mhosken commented 3 years ago

The specification document gives no indication of the range of coverage of a \rem. The stylesheet specifies it as being a paragraph marker. But my impression is that \rem covers everything up to the next newline.

If the specification is that it is a pargraph marker, then it is impossible to put \rem around any paragraph marker and the content must be a well formed paragraph (with appropriate closing markers where necessary). If the specification is everything up to the next newline, then any marker etc. can be remarked away including paragraph markers and the like. My impression is that this is how most people of think of it.

PTXprint currently treats it as a paragraph marker with no consideration of the impact of a newline.

All clarification welcomed.

If we say that the \rem marker scope is a single line, then that raises the question of whether there are other markers that are similarly scoped (\toc#, \h). But that is a wider debate with each one being taken case by case.

cmahte commented 3 years ago

The important bit about \rem being a paragraph marker is that it begins after a newline. The function of \rem is to disable parsing until the next newline.

The comment about it being a paragraph is to suggest it does not belong mid sentence, and that there no well-formed \rem* end-tag.

Your understanding seems correct to me,

  1. \rem disables USFM parsing until the next end-of-line tag. but with the further understanding that
  2. \rem must begin after a newline/return
  3. there is no such thing as a \rem ... \rem* character tag.
  4. \there is no multi-line \rem tag. if the comment goes multiple lines, multiple \rem tags are required.

On Mon, Oct 25, 2021 at 1:49 AM mhosken @.***> wrote:

The specification document gives no indication of the range of coverage of a \rem. The stylesheet specifies it as being a paragraph marker. But my impression is that \rem covers everything up to the next newline.

If the specification is that it is a pargraph marker, then it is impossible to put \rem around any paragraph marker and the content must be a well formed paragraph (with appropriate closing markers where necessary). If the specification is everything up to the next newline, then any marker etc. can be remarked away including paragraph markers and the like. My impression is that this is how most people of think of it.

All clarification welcomed.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ubsicap/usfm/issues/130, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC2DE4QBOQC7SJ4LEBKBCWDUIT4V5ANCNFSM5GUPMMMQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

klassenjm commented 3 years ago

A bit of additional input:

As noted, \rem is a paragraph type marker. As with other paragraph markers, the scope would continue until the next paragraph type marker, or the end of the file.

An attempt has been made to describe whitespace in USFM. In a well-formed USFM document all paragraph markers should be preceded by a newline -- but a newline does not indicate the end of the current paragraph or the start of a new paragraph. The whitespace notes indicate that USFM considers space (U+0020), tab (U+0009), and newline characters to be whitespace, and that multiple whitespace within the body text of a paragraph are not significant and should be normalized when trying to produce a 'well-formed' document (vs just valid USFM).

davidg-sil commented 1 year ago

So... to summarise:

\rem Someone needs to write this section
\rem \is1 Sometime I'll write the heading
\rem \ip We need an introduction too!

(a) Is badly formed USFM, (b) Should produce output. (c) Therefore PTXprint's current code (requested by people who think \rem kills everything until the new line) is an error. (d) If it were written out as properly formed USFM (with new line before \is1, etc) then, PTXprint's current interpretation would not give an error.

mnjames commented 1 year ago

My own 2 cents is that PTXprint's interpretation (that \rem kills everything until the new line) should be considered the correct one, and that the USFM documentation should be updated to clarify that.

If that's the case, then either \rem becomes a new, unique kind of marker, or it's a paragraph marker but with some significant caveats (like the fact that other paragraph markers can occur on the same line and don't have to start a new line in that case).

If the above is not true, then the result is that there is no way within USFM to have a comment in the text which has a meta-reference to a paragraph marker. E.g.: \rem The following \p marker was changed to \m because...

I find these sorts of comments particularly helpful when creating Bible Modules.

cmahte commented 1 year ago

I agree with the sentiment to have clarification that \rem is terminated at the first end of the actual paragraph, regardless of tags that might appear within the remark.

However, I regularly scrub backslashes out of \rem lines and replace the USFM tag with bracketed information: \p becomes [p] within the comment. This is done in USFM, not on conversion. That is, any backslash within a "(^\rem [^\n]+)\(\w+)" search is an error, and replaced with $1 [$2]. I do this specifically to limit overprocessing the file on conversion. This check occurs after Jeff's described "processing into well formed USFM whitespace" which means stray line endings with no tag following them are already replaced with a single space.

My USFM parser is extensible, meaning any slash it finds following a newline is treated like a paragraph tag, whether it has a style for it or not. And any tag it finds midline that hasn't already been processed is treated like a character style. This limits the entire USFM->XML filter to < 200 replacements, and that's for all 800ish tags that the USFM manual(s) imply.

So "well-formed" PSFM in your case would look like

\rem The following [p] marker was changed to [m] because...

Which to me doesn't affect readability, but does ensure the tags don't escape into print.

On Wed, Aug 23, 2023 at 6:04 AM mnjames @.***> wrote:

My own 2 cents is that PTXprint's interpretation (that \rem kills everything until the new line) should be considered the correct one, and that the USFM documentation should be updated to clarify that.

If that's the case, then either \rem becomes a new, unique kind of marker, or it's a paragraph marker but with some significant caveats (like the fact that other paragraph markers can occur on the same line and don't have to start a new line in that case).

If the above is not true, then the result is that there is no way within USFM to have a comment in the text which has a meta-reference to a paragraph marker. E.g.: \rem The following \p marker was changed to \m because...

I find these sorts of comments particularly helpful when creating Bible Modules.

— Reply to this email directly, view it on GitHub https://github.com/ubsicap/usfm/issues/130#issuecomment-1689761132, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC2DE4SJ2NSZSVAVT43LSCTXWXPTLANCNFSM5GUPMMMQ . You are receiving this because you commented.Message ID: @.***>