Markup recognition rules

mbakeranalecta commented 7 years ago

This issue is provoked by #108. Some of the details there are worth looking at in considering this issue.

The issue is how the SAM parser should recognize when the writer is attempting to create a block vs when they are just creating a paragraph that happens to contain a colon. The issue arises in cases where a colon appears in a paragraph before the first space in that paragraph. Block names cannot contain spaces, so as long as there is a space before the colon, there is no confusion.

If the writer wishes to being a paragraph with a sequence that includes a string with a colon before the first space, they have to escape the colon.

In the case where the colon appears inside a code annotation, it is impossible to escape the colon, since code annotations are interpreted literally. So the name recognition rules must preclude recognition of a backtick in a name.

SAM name rules are that all names must be valid XML element names. This is to simplify compatibility with XML so that we can use existing XML tool chains more easily to process SAM (this could include plugging a SAM parser into an XSTL engine, for instance, so that you could use XSLT to process SAM directly. )

However, there is the issue of when and if SAM should recognize that the writer is trying to create a block even if the name they are giving it is not a valid block name. For instance, if they write:

foo+bar: Foo bar

Plus is not a valid XML name character so this is not a valid SAM block name. But the question is whether we should recognize that the writer is trying to create a block and warn them that the name is invalid, of if we should just interpret this line as a paragraph.

Note that there will be cases that where we cannot detect that they intended to create a block and must interpret it as a paragraph:

foo bar: Foo bar

foo`bar`: Foo bar

Any attempt to recognize the attempt to create a block requires some rule about what constitutes a recognizable attempt vs what is actually a valid name. The questions are:

Should we make this distinction at all?
If so, what should the rule for recognizing the attempt be?

mbakeranalecta commented 7 years ago

Changed the title to Markup Recognition Rules because the same issue applies for annotations. For example, given the following markup, what should the parser do:

This document is the {output}(foo of the test document

Recognize the phrase and treat the opening ( immediately aftewards as plain text, thus producing the following XML:
```
<p>This document is the <phrase>output</phrase>(foo of the test document</p>
```

Treat the entire thing as plain text, thus producing:

 <p>This document is the {output}(foo of the test document</p>

Treat it as an incomplete annotation and raise a markup error?

I am inclined towards option 3. The current implementation does option 1.

mbakeranalecta commented 7 years ago

It is worth noting that there is another option to consider. We could have strict rules for markup recognition, which would make it easy to define what is an is not a well-formed SAM document, but allow the parser optionally to warn the user if it sees a sequence that might have been intended as markup. Thus in the previous comment it would treat (foo as plain text but could optionally issue a warning that it might be an incomplete annotation.

mbakeranalecta commented 7 years ago

Overall my preference here is not to come up with a mathematically pure definition of markup recognition, but one that is most likely to allow intentional text constructs to be entered without modification but also spot accidental typos in what were intended to be markup constructs. This is very much a matter of common usage rather than logical patterns. Thus the likelihood that {output}(foo represents the intention to create an unannotated phrase output followed, with no spaces, by the string (foo, is very slight, and so it would be preferable to raise an error, or at least a warning.

mbakeranalecta commented 7 years ago

When it comes to annotations, though, the rules for recognizing the intent to annotate might be complex. Do we step through all the various forms an annotation can take in determining if a well formed annotation exists. This would probably require a parser rewrite to reduce the reliance on complex REs. (Though this should happen anyway at some point.)

mbakeranalecta commented 7 years ago

Adding @dram's suggestion from #108:

How about define SAM's own rule for valid block names? Something like:

 all characters except special characters already used as markup in SAM. e.g. `, {, (.

mbakeranalecta commented 6 years ago

On reflection, the most lucid way to handle issues of markup intent are not actually with parser errors or warnings, but with syntax highlighting at the editor level. If sam goes anywhere, the people will write syntax highlighters for it for common editors and that will handle 99% of this problem.

I would rather minimize the number of times writers have to escape characters to avoid false markup recognition, so I don't want to raise an error in these cases. Any parser implementation, however, should feel free to warn about it if it wants to. Closing.

mbakeranalecta / sam

Markup recognition rules #109