Open mojavelinux opened 5 years ago
Now that we've proven that the parser is going to work (breadth), I think we need to focus on parsing downwards all the way and producing the model we want. In order to do that, I think we should spin off a branch that just handles strong, emphasis, link, and passthrough, and ignore all the rest for now. Once we have that working perfectly, then we can go back and restore the rest of the syntax. Otherwise, I think it's going to be too complicated to think about.
In particular, the we should think about how the parser handles the following source:
**strong _emphasis_ strong**
I think it should be handled by defining a recursive grammar. Currently, it's parsing the top-level elements, then invoking the parser again on each element.
It may turn out that we need to do it that way in some cases (or maybe all cases). But I think we should at least try to have the parser handle the recursion in the rules.
You can see an example of the recursive parsing grammar in Asciidoctor PDF:
I also think we should switch to compiling the grammar using tt
instead of loading it at runtime. This gets us closer to the parser we will put into core.
I think it should be handled by defining a recursive grammar. Currently, it's parsing the top-level elements, then invoking the parser again on each element.
I didn't know we could define a recursive grammar 😀
What are the benefits (and drawbacks) of this solution ?
I also think we should switch to compiling the grammar using tt instead of loading it at runtime. This gets us closer to the parser we will put into core.
I can extract the Rake task from https://github.com/Mogztter/asciidoctor-inline-parser/pull/1/files#diff-52c976fc38ed2b4e3b1192f8a8e24cff and submit a pull request ?
Actually, I'm rethinking the idea of a recursive grammar. Stay tuned.
As I thought more about it, I realized that the recursive grammar would quickly get too complicated to reason about, at least for me right now. The treetop parser is greedy by default, so we'd end up needing extremely complex negation rules to prevent the parser from matching too much.
But it goes deeper than that. In many cases, the AsciiDoc syntax isn't recursive. Consider the following source:
**one **two** three**
If we parsed recursively, we'd end up with:
<strong>one <strong>two</strong> three</strong>
However, that would not be AsciiDoc compliant. Instead, it should be:
<strong>one </strong>two<strong> three</strong>
Admittedly, the first interpretation seems more correct. But the goal here isn't to change the AsciiDoc syntax rules (even if we really want to).
Therefore, it's probably best to stick with the flat parser grammar, then recurse into those nodes as necessary. It feels less sophisticated, but ultimately will be easier to manage.
We might still use recursion in the grammar in some places, but it would be places where it is obvious. One example might be in the attribute list of a macro, which is more like an XML tag.
To answer your other question, yes, let's switch to compiling the grammar now.
Now that I've had a chance to evaluate where we are, I've concluded that this is going to be a long journey. It could take a year or more for this change to arrive in core because we have to address some fundamental assumptions in the parser. We just need to take it a step at a time.
But don't despair, we'll be able to start using it a lot sooner. The first place we'll use the inline grammar is in the validator. That will allow us not only to give it a real world test, but also give us plenty of time to refine the rules before we have to commit to using it in the processor.
Since we plan to build the validator in JavaScript to take advantage of eslint / textlint, I think it makes sense after all to start tinkering with transpiling to JavaScript. If that doesn't work out, then we might need another parser written directly in JavaScript. But let's make that plan B.
I'd also like to come at this from the other end and design the AST/DOM we want to make. That gives us the freedom to explore how it can be adapted to the existing converters even before the parsing is totally correct.
One thing I realized is that once we find an unconstrained formatted phrase (like unconstrained strong), we don't have to look for that when recursing. There may be other cases like this.
So I proved myself wrong. It's actually possible to support recursive parsing of the formatted text. It is different than how Asciidoctor currently works because it searches downwards for matches. But nonetheless, this demonstrates how it could be done if we agree that's how it should be parsed.
module Asciidoctor
grammar Inline
rule root
(
strong_unconstrained /
emphasis_unconstrained /
code_unconstrained /
open_unconstrained /
strong /
emphasis /
code /
open /
normal
)+
end
rule strong_unconstrained
'**' content:strong_unconstrained_content '**'
end
rule strong_unconstrained_content
( emphasis_unconstrained / code_unconstrained / open_unconstrained / strong / emphasis / code / open / '[^*]|\*(?!\*)'r )+
end
rule emphasis_unconstrained
'__' content:emphasis_unconstrained_content '__'
end
rule emphasis_unconstrained_content
( strong_unconstrained / code_unconstrained / open_unconstrained / strong / emphasis / code / open / '[^_]|_(?!_)'r )+
end
rule code_unconstrained
'``' content:code_unconstrained_content '``'
end
rule code_unconstrained_content
( strong_unconstrained / emphasis_unconstrained / open_unconstrained / strong / emphasis / code / open / '[^`]|`(?!`)'r )+
end
rule open_unconstrained
'##' content:open_unconstrained_content '##'
end
rule open_unconstrained_content
( strong_unconstrained / emphasis_unconstrained / code_unconstrained / strong / emphasis / code / open / '[^#]|#(?!#)'r )+
end
rule strong
!word_char '*' !space content:strong_content !space '*' !word_char
end
rule strong_content
( emphasis / code / open / '[^*]|\*(?=\p{Alnum})'r )+
end
rule emphasis
!word_char '_' !space content:emphasis_content !space '_' !word_char
end
rule emphasis_content
( strong / code / open / '[^_]|_(?=\p{Alnum})'r )+
end
rule code
!word_char '`' !space content:code_content !space '`' !word_char
end
rule code_content
( strong / emphasis / open / '[^`]|`(?=\p{Alnum})'r )+
end
rule open
!word_char '#' !space content:open_content !space '#' !word_char
end
rule open_content
( strong / emphasis / code / '[^#]|#(?=\p{Alnum})'r )+
end
rule space
' ' / "\t"
end
rule word_char
'\p{Alnum}'r
end
rule normal
.
end
end
end
NOTE: \p{Alnum}
is how I'm representing a word character. That will need to be refined to be more accurate.
The one rule currently is that you can't nested one type of formatting inside of itself (no constrained strong inside of constrained strong). That seems like a reasonable trade-off.
It's also missing unconstrained inside of constrained. I'll see if that can be added without breaking the whole thing.
Okay, I got all the permutations except for the same type of formatting within itself.
module Asciidoctor
grammar Inline
rule root
( formatted / normal )+
end
rule formatted
( formatted_unconstrained / formatted_constrained )
end
rule formatted_unconstrained
( strong_unconstrained / emphasis_unconstrained / code_unconstrained / open_unconstrained )
end
rule formatted_constrained
( strong / emphasis / code / open )
end
rule strong_unconstrained
'**' content:strong_unconstrained_content '**'
end
rule strong_unconstrained_content
( emphasis_unconstrained / code_unconstrained / open_unconstrained / formatted_constrained / '[^*]|\*(?!\*)'r )+
end
rule emphasis_unconstrained
'__' content:emphasis_unconstrained_content '__'
end
rule emphasis_unconstrained_content
( strong_unconstrained / code_unconstrained / open_unconstrained / formatted_constrained / '[^_]|_(?!_)'r )+
end
rule code_unconstrained
'``' content:code_unconstrained_content '``'
end
rule code_unconstrained_content
( strong_unconstrained / emphasis_unconstrained / open_unconstrained / formatted_constrained / '[^`]|`(?!`)'r )+
end
rule open_unconstrained
'##' content:open_unconstrained_content '##'
end
rule open_unconstrained_content
( strong_unconstrained / emphasis_unconstrained / code_unconstrained / formatted_constrained / '[^#]|#(?!#)'r )+
end
rule strong
!word_char '*' !space content:strong_content !space '*' !word_char
end
rule strong_content
( formatted_unconstrained / emphasis / code / open / '[^*]|\*(?=\p{Alnum})'r )+
end
rule emphasis
!word_char '_' !space content:emphasis_content !space '_' !word_char
end
rule emphasis_content
( formatted_unconstrained / strong / code / open / '[^_]|_(?=\p{Alnum})'r )+
end
rule code
!word_char '`' !space content:code_content !space '`' !word_char
end
rule code_content
( formatted_unconstrained / strong / emphasis / open / '[^`]|`(?=\p{Alnum})'r )+
end
rule open
!word_char '#' !space content:open_content !space '#' !word_char
end
rule open_content
( formatted_unconstrained / strong / emphasis / code / '[^#]|#(?=\p{Alnum})'r )+
end
rule space
' ' / "\t"
end
rule word_char
'\p{Alnum}'r
end
rule normal
.
end
end
end
@abelsromero We're going to continue the AsciiDoc grammar work here. The focus is primarily on the inline parsing since that's the hardest part and the part that needs to be most urgently addressed in core. There's hardly a need to make a grammar for the block parsing since that's so well defined already.
I'd also like to come at this from the other end and design the AST/DOM we want to make.
I think it's a good idea because it will define what we aim for. We should create a dedicated issue for that. I think you have the most experience on this subject and hopefully you have a clear vision of the "perfect" AST for Asciidoctor. I could give it a try but I don't think it will be very efficient :neutral_face:
Sounds like a plan. I'll start a dedicated issue where we can start designing.
We won't aim for perfect. That's a recipe for failure. I think we just need something that's workable and hopefully will allow us to reuse existing converters.
This issue hosts the planning discussion for how take the current code from a prototype state to the point its ready to use in core. Once we've established a plan of action and created the issues to track the individual tasks, this issue can be closed.
First, let's remind ourselves why we're doing this. The existing regex-based parser uses multiple passes over the same source text while performing conversion at the same time. That means it operates on a hybrid of source and output text. It's also blind to context, so it ends up processing the same text multiple times, which makes escaping nearly impossible to get right.
There are two cases in particular we really want to solve:
At a high level, here are some of the goals we want to meet:
It's very likely we're going to have to take a hybrid approach to parsing. The idea is to use the treetop parser/grammar as a tool. It does not mean the parser has to grab everything in one shot. If we have to call it recursively, we can do that.