ggrossetie / asciidoctor-inline-parser

15 stars 4 forks source link

use posix character class to represent word boundary character #8

Closed mojavelinux closed 6 years ago

mojavelinux commented 6 years ago

A constrained formatting mark may not be followed by any word character or an underscore.

mojavelinux commented 6 years ago

Technically this still isn't right because the following should not be parsed:

*foo*_
mojavelinux commented 6 years ago

The rules for a constrained formatting mark are as follows:

Underscore is a word character. It's not allowed for strong, but obviously it's allowed for emphasis. Therefore, we may need two different matchers.

ggrossetie commented 6 years ago

There cannot be a word character (\p{Word}) immediately outside the formatting marks

Ok, I thought it was a limitation of the current implementation that should be resolved with the new parser.

Underscore is a word character. It's not allowed for strong, but obviously it's allowed for emphasis. Therefore, we may need two different matchers.

Indeed...

mojavelinux commented 6 years ago

I thought it was a limitation of the current implementation that should be resolved with the new parser.

I'd say this is a very logical definition of what constrained is. No space inside, no word character outside.

ggrossetie commented 6 years ago

I'd say this is a very logical definition of what constrained is. No space inside, no word character outside.

The definition is indeed very logical but I find it odd that a number is not allowed. In fact, we are using the definition of the Ruby Regexp engine to define what is a word character:

/\p{Word}/ - A member of one of the following Unicode general category Letter, Mark, Number, Connector_Punctuation

https://ruby-doc.org/core-2.1.1/Regexp.html#class-Regexp-label-Character+Properties

I don't know, maybe it's the most reasonable definition but I find this rule a bit restrictive on this one use case. Anyway I don't think we should change the behavior, we just need to make sure that we (the writers) share the same definition of a word character :nerd_face:

mojavelinux commented 6 years ago

The definition is indeed very logical but I find it odd that a number is not allowed.

But a number is a word character. Therefore, if the number is immediately adjacent to a formatting mark, that's not a word boundary. In other words:

*formula*1

It's completely logical that the formatting is not applied in this case because the * is in the middle of the "word". If we violated that rule, it would very likely break tons of AsciiDoc documents.

I find this rule a bit restrictive on this one use case.

I think it would be much harder to explain that a number is a word boundary. Right now, you look at the sequence of characters, see that the * sandwiched inside the "word" and conclude that you would need unconstrained marks. That's much easier to understand IMO.

It makes sense that punctuation is a word boundary, like in this case:

*fin*.
ggrossetie commented 6 years ago

I was reading 2 as a single character (ie. not part of the "word" mc2), it makes sense with *formula*1.

It's completely logical that the formatting is not applied in this case because the * is in the middle of the "word". If we violated that rule, it would very likely break tons of AsciiDoc documents.

Absolutely.

I think it would be much harder to explain that a number is a word boundary. Right now, you look at the sequence of characters, see that the * sandwiched inside the "word" and conclude that you would need unconstrained marks. That's much easier to understand IMO.

I think it's easier to understand with boundary.

Quoted must be bounded by white space or commonly adjoining punctuation characters.

https://www.methods.co.nz/asciidoc/chunked/ch10.html