ggrossetie / asciidoctor-inline-parser

15 stars 4 forks source link

Plan to get from prototype to core #9

Open mojavelinux opened 5 years ago

mojavelinux commented 5 years ago

This issue hosts the planning discussion for how take the current code from a prototype state to the point its ready to use in core. Once we've established a plan of action and created the issues to track the individual tasks, this issue can be closed.

First, let's remind ourselves why we're doing this. The existing regex-based parser uses multiple passes over the same source text while performing conversion at the same time. That means it operates on a hybrid of source and output text. It's also blind to context, so it ends up processing the same text multiple times, which makes escaping nearly impossible to get right.

There are two cases in particular we really want to solve:

  1. Be able to parse an inline passthrough without having to extract the text. This is just a matter of terminating the recursion when a passthrough is found.
  2. Be able to parse a link macro without risk of substitutions being applied to the target URL. This is similar to the passthrough in that the recursion just needs to be terminated.

At a high level, here are some of the goals we want to meet:

It's very likely we're going to have to take a hybrid approach to parsing. The idea is to use the treetop parser/grammar as a tool. It does not mean the parser has to grab everything in one shot. If we have to call it recursively, we can do that.

mojavelinux commented 5 years ago

Now that we've proven that the parser is going to work (breadth), I think we need to focus on parsing downwards all the way and producing the model we want. In order to do that, I think we should spin off a branch that just handles strong, emphasis, link, and passthrough, and ignore all the rest for now. Once we have that working perfectly, then we can go back and restore the rest of the syntax. Otherwise, I think it's going to be too complicated to think about.

In particular, the we should think about how the parser handles the following source:

**strong _emphasis_ strong**

I think it should be handled by defining a recursive grammar. Currently, it's parsing the top-level elements, then invoking the parser again on each element.

It may turn out that we need to do it that way in some cases (or maybe all cases). But I think we should at least try to have the parser handle the recursion in the rules.

mojavelinux commented 5 years ago

You can see an example of the recursive parsing grammar in Asciidoctor PDF:

https://github.com/asciidoctor/asciidoctor-pdf/blob/master/lib/asciidoctor-pdf/formatted_text/parser.treetop#L10-L28

mojavelinux commented 5 years ago

I also think we should switch to compiling the grammar using tt instead of loading it at runtime. This gets us closer to the parser we will put into core.

ggrossetie commented 5 years ago

I think it should be handled by defining a recursive grammar. Currently, it's parsing the top-level elements, then invoking the parser again on each element.

I didn't know we could define a recursive grammar 😀
What are the benefits (and drawbacks) of this solution ?

I also think we should switch to compiling the grammar using tt instead of loading it at runtime. This gets us closer to the parser we will put into core.

I can extract the Rake task from https://github.com/Mogztter/asciidoctor-inline-parser/pull/1/files#diff-52c976fc38ed2b4e3b1192f8a8e24cff and submit a pull request ?

mojavelinux commented 5 years ago

Actually, I'm rethinking the idea of a recursive grammar. Stay tuned.

mojavelinux commented 5 years ago

As I thought more about it, I realized that the recursive grammar would quickly get too complicated to reason about, at least for me right now. The treetop parser is greedy by default, so we'd end up needing extremely complex negation rules to prevent the parser from matching too much.

But it goes deeper than that. In many cases, the AsciiDoc syntax isn't recursive. Consider the following source:

**one **two** three**

If we parsed recursively, we'd end up with:

<strong>one <strong>two</strong> three</strong>

However, that would not be AsciiDoc compliant. Instead, it should be:

<strong>one </strong>two<strong> three</strong>

Admittedly, the first interpretation seems more correct. But the goal here isn't to change the AsciiDoc syntax rules (even if we really want to).

Therefore, it's probably best to stick with the flat parser grammar, then recurse into those nodes as necessary. It feels less sophisticated, but ultimately will be easier to manage.

We might still use recursion in the grammar in some places, but it would be places where it is obvious. One example might be in the attribute list of a macro, which is more like an XML tag.

mojavelinux commented 5 years ago

To answer your other question, yes, let's switch to compiling the grammar now.

mojavelinux commented 5 years ago

Now that I've had a chance to evaluate where we are, I've concluded that this is going to be a long journey. It could take a year or more for this change to arrive in core because we have to address some fundamental assumptions in the parser. We just need to take it a step at a time.

But don't despair, we'll be able to start using it a lot sooner. The first place we'll use the inline grammar is in the validator. That will allow us not only to give it a real world test, but also give us plenty of time to refine the rules before we have to commit to using it in the processor.

Since we plan to build the validator in JavaScript to take advantage of eslint / textlint, I think it makes sense after all to start tinkering with transpiling to JavaScript. If that doesn't work out, then we might need another parser written directly in JavaScript. But let's make that plan B.

mojavelinux commented 5 years ago

I'd also like to come at this from the other end and design the AST/DOM we want to make. That gives us the freedom to explore how it can be adapted to the existing converters even before the parsing is totally correct.

mojavelinux commented 5 years ago

One thing I realized is that once we find an unconstrained formatted phrase (like unconstrained strong), we don't have to look for that when recursing. There may be other cases like this.

mojavelinux commented 5 years ago

So I proved myself wrong. It's actually possible to support recursive parsing of the formatted text. It is different than how Asciidoctor currently works because it searches downwards for matches. But nonetheless, this demonstrates how it could be done if we agree that's how it should be parsed.

module Asciidoctor
  grammar Inline
    rule root
      (
        strong_unconstrained /
        emphasis_unconstrained /
        code_unconstrained /
        open_unconstrained /
        strong /
        emphasis /
        code /
        open /
        normal
      )+
    end

    rule strong_unconstrained
      '**' content:strong_unconstrained_content '**'
    end

    rule strong_unconstrained_content
      ( emphasis_unconstrained / code_unconstrained / open_unconstrained / strong / emphasis / code / open / '[^*]|\*(?!\*)'r )+
    end

    rule emphasis_unconstrained
      '__' content:emphasis_unconstrained_content '__'
    end

    rule emphasis_unconstrained_content
      ( strong_unconstrained / code_unconstrained / open_unconstrained / strong / emphasis / code / open / '[^_]|_(?!_)'r )+
    end

    rule code_unconstrained
      '``' content:code_unconstrained_content '``'
    end

    rule code_unconstrained_content
      ( strong_unconstrained / emphasis_unconstrained / open_unconstrained / strong / emphasis / code / open / '[^`]|`(?!`)'r )+
    end

    rule open_unconstrained
      '##' content:open_unconstrained_content '##'
    end

    rule open_unconstrained_content
      ( strong_unconstrained / emphasis_unconstrained / code_unconstrained / strong / emphasis / code / open / '[^#]|#(?!#)'r )+
    end

    rule strong
      !word_char '*' !space content:strong_content !space '*' !word_char
    end

    rule strong_content
      ( emphasis / code / open / '[^*]|\*(?=\p{Alnum})'r )+
    end

    rule emphasis
      !word_char '_' !space content:emphasis_content !space '_' !word_char
    end

    rule emphasis_content
      ( strong / code / open / '[^_]|_(?=\p{Alnum})'r )+
    end

    rule code
      !word_char '`' !space content:code_content !space '`' !word_char
    end

    rule code_content
      ( strong / emphasis / open / '[^`]|`(?=\p{Alnum})'r )+
    end

    rule open
      !word_char '#' !space content:open_content !space '#' !word_char
    end

    rule open_content
      ( strong / emphasis / code / '[^#]|#(?=\p{Alnum})'r )+
    end

    rule space
      ' ' / "\t"
    end

    rule word_char
      '\p{Alnum}'r
    end

    rule normal
      .
    end
  end
end

NOTE: \p{Alnum} is how I'm representing a word character. That will need to be refined to be more accurate.

The one rule currently is that you can't nested one type of formatting inside of itself (no constrained strong inside of constrained strong). That seems like a reasonable trade-off.

It's also missing unconstrained inside of constrained. I'll see if that can be added without breaking the whole thing.

mojavelinux commented 5 years ago

Okay, I got all the permutations except for the same type of formatting within itself.

module Asciidoctor
  grammar Inline
    rule root
      ( formatted / normal )+
    end

    rule formatted
      ( formatted_unconstrained / formatted_constrained )
    end

    rule formatted_unconstrained
      ( strong_unconstrained / emphasis_unconstrained / code_unconstrained / open_unconstrained )
    end

    rule formatted_constrained
      ( strong / emphasis / code / open )
    end

    rule strong_unconstrained
      '**' content:strong_unconstrained_content '**'
    end

    rule strong_unconstrained_content
      ( emphasis_unconstrained / code_unconstrained / open_unconstrained / formatted_constrained / '[^*]|\*(?!\*)'r )+
    end

    rule emphasis_unconstrained
      '__' content:emphasis_unconstrained_content '__'
    end

    rule emphasis_unconstrained_content
      ( strong_unconstrained / code_unconstrained / open_unconstrained / formatted_constrained / '[^_]|_(?!_)'r )+
    end

    rule code_unconstrained
      '``' content:code_unconstrained_content '``'
    end

    rule code_unconstrained_content
      ( strong_unconstrained / emphasis_unconstrained / open_unconstrained / formatted_constrained / '[^`]|`(?!`)'r )+
    end

    rule open_unconstrained
      '##' content:open_unconstrained_content '##'
    end

    rule open_unconstrained_content
      ( strong_unconstrained / emphasis_unconstrained / code_unconstrained / formatted_constrained / '[^#]|#(?!#)'r )+
    end

    rule strong
      !word_char '*' !space content:strong_content !space '*' !word_char
    end

    rule strong_content
      ( formatted_unconstrained / emphasis / code / open / '[^*]|\*(?=\p{Alnum})'r )+
    end

    rule emphasis
      !word_char '_' !space content:emphasis_content !space '_' !word_char
    end

    rule emphasis_content
      ( formatted_unconstrained / strong / code / open / '[^_]|_(?=\p{Alnum})'r )+
    end

    rule code
      !word_char '`' !space content:code_content !space '`' !word_char
    end

    rule code_content
      ( formatted_unconstrained / strong / emphasis / open / '[^`]|`(?=\p{Alnum})'r )+
    end

    rule open
      !word_char '#' !space content:open_content !space '#' !word_char
    end

    rule open_content
      ( formatted_unconstrained / strong / emphasis / code / '[^#]|#(?=\p{Alnum})'r )+
    end

    rule space
      ' ' / "\t"
    end

    rule word_char
      '\p{Alnum}'r
    end

    rule normal
      .
    end
  end
end
mojavelinux commented 5 years ago

@abelsromero We're going to continue the AsciiDoc grammar work here. The focus is primarily on the inline parsing since that's the hardest part and the part that needs to be most urgently addressed in core. There's hardly a need to make a grammar for the block parsing since that's so well defined already.

ggrossetie commented 5 years ago

I'd also like to come at this from the other end and design the AST/DOM we want to make.

I think it's a good idea because it will define what we aim for. We should create a dedicated issue for that. I think you have the most experience on this subject and hopefully you have a clear vision of the "perfect" AST for Asciidoctor. I could give it a try but I don't think it will be very efficient :neutral_face:

mojavelinux commented 5 years ago

Sounds like a plan. I'll start a dedicated issue where we can start designing.

We won't aim for perfect. That's a recipe for failure. I think we just need something that's workable and hopefully will allow us to reuse existing converters.