mbakeranalecta / sam

Semantic Authoring Markdown
Other
79 stars 8 forks source link

Invalid block name for `foo::bar` #108

Closed dram closed 7 years ago

dram commented 7 years ago

Following content will cause error for SAM:

section: foo

  `foo::bar` baz.

Error message:

SAM parser ERROR: Invalid block name: `foo
Process completed with 1 errors.
mbakeranalecta commented 7 years ago

There is an interesting issue here about how the SAM parser should recognize block names. Block names have to be valid XML names. But the definition of a valid XML name is complex, mostly because the definition of the XML character set is complex. This is not an issue in XML because XML's explicit markup syntax means it is always clear when you are trying it create an element, so you can treat the validation of the name separately from the recognition of an element start.

But in SAM's compact syntax, the recognition of the intent to create a block and the validation of the block name get mixed together. There are several possible choices:

  1. Only recognize valid XML names followed by a colon as an attempt to create a block. In this case, if someone does this:

     foo+bar: Foo and Bar

    This will be interpreted silently as:

     <p>foo+bar: Foo and Bar</p>
  2. (The current implementation) Recognize almost any string of non-whitespace characters ending in in a colon at the beginning of a line as an attempt to create a block. Then is someone does this:

     foo+bar: Foo and Bar

    This will be recognized as an attempt to create a block with an invalid name, and an error will be reported. However, it also means that if someone does this:

     section: foo
           `foo::bar` baz.

    It will also recognize `foo: as an attempt to create a block and raise an error, forcing the writer to write:

     section: foo
           `foo\:\:bar` baz.
  3. Come up with an intermediate protocol of recognizing the intent to create a block, without restricting recognition only to valid XML names. Thus, for instance, come up with a protocol that recognizes:

     foo+bar: Foo and Bar

    as an attempt to create a block and raises and error on the name, but does not recognize

      `foo::bar` baz.

    as an attempt to create block and lets it pass as regular text.

The problem with option 1 is that is will silently suppress errors in naming blocks.

The problem with option 2 is that while it catches more errors, it also requires escaping some sequences that you would not normally think would be recognized as an attempt to start a block.

Options 3 attempts to strike a balance, but then the question becomes where exactly the balance should be stuct. What is the right recognition protocol to catch the most naming errors while avoiding unnecessary escaping of colons ordinary paragraph text?

One possible protocol is that the first letter of the sequence must be a Unicode letter or number. That would eliminate cases like

 `foo::bar` baz.

while still catching

  foo+bar: Foo and Bar

Another possible rule in the protocol would be be that there must be a space (or the opening ( of an annotation) after the colon. This would help with recognition, but it would mean that if someone forgot the space, they would not get an error, their text would just be silently recognized as a paragraph.

All markup languages have edge cases like this, where what is intended as text is recognized as markup (an unescaped & character in an XML document, for instance) or vice versa. The question is where to locate the edge for the best balance of error recognition and ease of use. Any thoughts?

mbakeranalecta commented 7 years ago

Another thought: It makes a difference whether you have a dedicated editor, or even syntax coloring, for a markup language. With a SAM aware editor, even if all it provided was syntax coloring, you would have an immediate indication of whether your text was being recognized as a block name or not, and that could shift the usability argument in favor of a more strict recognition protocol. (This may require, though, that syntax coloring is based on full SAM parsing, not just on a set of syntax coloring rules such as many editors provide.)

However, there is currently no SAM-aware editor, and SAM, overall, is very easy to write without any form of support from the editor. So should the possibility of future SAM-aware editors affect the language definition now? It is worth noting that XML was designed on the assumption that it would always be hidden behind an XML-aware editor interface, and that has not really worked, which is why SAM was desinged in the first place.

dram commented 7 years ago

I kind of not like option 1, as I think that SAM should be neutral with result markups (maybe XML, maybe others), so it should not be coupling with XML.

How about define SAM's own rule for valid block names? Something like:

all characters except special characters already used as markup in SAM. e.g. `, {, (.

BTW, following example still has problem, as \ is retained in the result, maybe caused by the backquotes?

section: foo

  `foo\:\:bar` baz.

Output:

<?xml version="1.0" encoding="UTF-8"?>
<section>
<title>foo</title>
<p><phrase><annotation type="code">foo\:\:bar</annotation></phrase> baz.</p>
</section>
mbakeranalecta commented 7 years ago

The reason for making SAM compatible with XML is to take advantage of the existing XML tool chain, which is vast and powerful. I think SAM has a much better chance of being adopted for its intended purpose if it is easy to direct content written in SAM into an XML back end. And I don't really think this limits the usefulness of SAM as a standalone language. XML's name rules are generous enough for any practical purpose I can think of.

But the example you provide certain tells us that the current implementation is not satisfactory, since I can't think of any way to get around the problem of escaping colons inside a code annotation. The rules of code annotations are that everything is literal (apart from backticks) which is obviously easier for writing code. But it means that there we cannot allow backticks to be recognized as part of a potential block name as there is no way to escape the colon within them.

So I think at minimum we have to exclude them from the block name pattern. And extending that to all characters except special characters already used as markup in SAM. makes sense as well. That list would include | as well as `, {, (.

I'm going to implement that and open another issue on the block recognition protocol.

mbakeranalecta commented 7 years ago

Changes to name rule in 0c470683ced22c8c58ab141d4827ae7808f49707 address the specific case in this issue but not the overall issue of the precise rules for name recognition. That requires a new issue.

mbakeranalecta commented 7 years ago

Last commit introduced a bug. Fixed in 7e79e323f3a84ac202b6dc1432159f46426a4291.

mbakeranalecta commented 7 years ago

Closing this as it is superseded by #109.