bug parsing HTML (<pre><code>)

pragdave / earmark

Markdown parser for Elixir

Other

860 stars 135 forks source link

bug parsing HTML (<pre><code>) #356

Open eksperimental opened 4 years ago

eksperimental commented 4 years ago

Two semantically equal expressions, but the second one fails.

iex(1)> string = """
...(1)> <pre><code>
...(1)> 1 & 2
...(1)> 1 > 2
...(1)> </code>
...(1)> </pre>
...(1)> """
iex(2)> Earmark.as_ast(string)
{:ok,
 [
   {"pre", [], ["<code>", "1 & 2", "1 > 2", "</code>"],
    %{meta: %{verbatim: true}}}
 ], []}

iex(4)> string = """
...(4)> <pre><code>
...(4)> 1 & 2
...(4)> 1 > 2
...(4)> </code></pre>
...(4)> """
iex(5)> Earmark.as_ast(string)
{:error,
 [
   {"pre", [], ["<code>", "1 & 2", "1 > 2", "</code></pre>"],
    %{meta: %{verbatim: true}}}
 ], [{:warning, 1, "Failed to find closing <pre>"}]}

RobertDober commented 4 years ago

Status Quo

Here is what kind of HTML Earmark supports, and I will update the documentation which is not good (was even missing lately)

Oneline HTML Tags

   <tag...>{content}</tag>{suffix}

which will render

   {"tag", [], ["content"], %{verbatim: true}} # 1.4.6 format

One level of a block

<tag>
    {content}
</tag>

{"tag", [], [content], %{verbatim: true}}

where both, <tag> and </tag> must be on their own line (original definition by Dave). However your first example works as the result of permissive parsing, so maybe to avoid regressions I will rephrase that accordingly in the documentation.

So I will take two actions,

definitely add the above paragraph to the documentation --> 1.4.6
investigate about the second example (rule would be: opening tag must be on start of line, closing tag must be on end of line)

Ok with you?

eksperimental commented 4 years ago

I just found that when testing it, and thought I would be good to report it. there are no worries about regressions. Thanks for the info.

RobertDober commented 4 years ago

You have just named the game, I believe that all the issues you brought up are very valid and while investigating I have some hopes to recursively parse HTML with cleaner code, but not sure yet, however this cannot go into 1.4.6. but I will try to treat HTML nicely (against my will :wink:) in 1.5 simply because of GFM.

eksperimental commented 4 years ago

Would it be possible to have an option to leave a copy of the original HTML element in the metadata whenever vertabim: true? I think it will be useful in case we want to delegate to an specialized library, such as Floki to deal with the HTML parsing. Well, I'm experimenting with that idea in ExDoc. Thank you.

RobertDober commented 4 years ago

Do you mean

   {"div", [{"class", "elixir"}] [best code ever] %{verbatim: true}}

--->

   {"div", [{"class", "elixir"}] [best code ever] %{verbatim: true, html: ~s[<div class="elixir">best code ever</div>]}}

sure sounds like a sound idea to me.

eksperimental commented 4 years ago

yes. exactly that!

RobertDober commented 4 years ago

This issue should be obsoleted by #358 (which is https://github.com/RobertDober/earmark_parser/issues/7) and the Verbatim Annotation Part is implemented by https://github.com/RobertDober/earmark_parser/issues/8