Invalid tag name? - Githubissues

dram commented 8 years ago

Currently tag name is matched with following regexp:

re_name = r'(?P<name>\S+?)'

It will make following document failed to parse, as foo{bar}(http will be treated as a tag, which is not acceptable for lxml.

article: Foo

    foo{bar}(http://www.example.com/).

Can we make re_name test more strict?

mbakeranalecta commented 8 years ago

It's a bit of a trade-off. The current regex catches some errors where the user types a tag name that they think is valid but isn't. So if they type:

foo(bar:

The parser recognizes that they are trying to type a tag and informs that that that tag name is invalid.

If the parser only recognizes valid tag names, then

 foo(bar:

will be treated as a paragraph and no error will be raised.

On the other hand, if someone want to type

foo(bar:

as a paragraph, they have to escape the colon:

foo(bar\:

While this is inconvenient, they do at least get an error message so that they know there is a problem, as opposed to the error passing silently.

So it is a trade-off between more fluid and natural writing (not needing to use escapes for these situations), vs. avoiding errors passing silently.

This trade-off is inherent in any lightweight markup language because the distinction between markup and text is less explicit than it is in XML. It is easy for a single incorrect character to make what was supposed to be markup turn into plain text without an error being raised.

Because SAM is intended for structured writing, I am inclined to come down on the side of not letting errors pass silently. But I am very willing to hear arguments for the opposite position.

dram commented 8 years ago

Quite reasonable.

In some languages where words are not needed to be seperated by spaces (e.g. Chinese), this inconvenient will be widened. See this for example.

This problem also exists in languages like English, if you try to add a link for the first word of a paragraph.

For that reason, I'd like to use {foo}(link "http://...") instead.

BTW, I have created a simple plugin for Pelican (a static site generator) for SAM, you may have some interest. :)

I will close this issue.

mbakeranalecta commented 8 years ago

Thanks for the link to the Pelican plug in. That was on the to do list, so its great to be able to check that off.

The point about languages like Chinese where there are no spaces between words is a good one. Essentially, SAM is using a space as a tag delimiter, which works in languages where spaces are a delimiter, but not so much in languages where they are not. I will open issue #95 to discuss this.

mbakeranalecta commented 8 years ago

The proposal in #95 to require a space after the colon for block tag recognition would seem to address this case. I'm reopening this issue to track that possible solution as it is potentially more limited in scope than #95, even if it is actually adequate for that case too.

mbakeranalecta commented 8 years ago

Added requirement for space after colon or attribute list in block markup in e6cb3d389fdae447b3c033a702c40cbf3dbc0341 in branch require-space-after-colon-in-block-tags. Added test based on this issue. That test passes with this change. This change does not break any existing tests but could conceivably break documents which don't have the space. (Not that we are guaranteeing backward compatibility at this point.)

dram commented 8 years ago

Great! I have tried that branch, and it works fine.

mbakeranalecta commented 7 years ago

The branch that implements that change has now been merged to master.

mbakeranalecta / sam

Invalid tag name? #94