thephpleague / commonmark

Highly-extensible PHP Markdown parser which fully supports the CommonMark and GFM specs.
https://commonmark.thephpleague.com
BSD 3-Clause "New" or "Revised" License
2.74k stars 194 forks source link

Heading is not parsed correctly on line following HTML tag #400

Closed taylorotwell closed 4 years ago

taylorotwell commented 4 years ago

Version(s) affected: ^1.x

Description

A simple heading tag does not parse correctly if it is on a line following an HTML tag. I see nothing in the CommonMark spec that states this should not work. In fact, the spec explicitly states that headings may be sandwiched in between other content with no blank lines in between.

How to reproduce

$environment = Environment::createCommonMarkEnvironment();

$converter = new CommonMarkConverter([
    'allow_unsafe_links' => false,
], $environment);

dump($converter->convertToHtml("<table></table>\n## Heading"));

Output:

<table></table>
## Heading

Additional context

Noticed by reports of broken behavior in Laravel Markdown mailable objects.

GrahamCampbell commented 4 years ago

NB Expected output is:

<table></table>
<h2>Heading</h2>
colinodell commented 4 years ago

This behavior does seem to be correct per the CommonMark spec. <table></table> is a type 6 HTML block per the spec which only ends when there's a completely blank line after it:

image

Example 131 in the spec demonstrates this:

image

It seems that all CommonMark-compliant parsers do follow this same behavior: https://babelmark.github.io/?text=%3Ctable%3E%3C%2Ftable%3E%0A%23%23+Heading

Inserting a blank line between the HTML block and the heading will produce the expected output that Graham shared: https://babelmark.github.io/?text=%3Ctable%3E%3C%2Ftable%3E%0A%0A%23%23+Heading

GrahamCampbell commented 4 years ago

Is there an easy way to turn off this behaviour, so we can get something closer to how Parsedown handled blocks?

colinodell commented 4 years ago

None that I can think of, unfortunately :-/

If you wanted to try and modify that behavior, the key to doing so would be these two methods in HtmlBlock:

image

isCode() determines whether or not that block should be "greedy" and attempt to include any subsequent lines instead of parsing them as new blocks. matchesNextLine() ultimately determines whether or not the next line is a continuation of the current HtmlBlock or whether something else should try to parse it.

To help illustrate some of the difficulty, imagine the logic you'd need to properly handle these cases:

<pre>
# This new line looks like Markdown, but it's instead a pre tag
</pre>
<pre>
# This is similar, but there's no closing tag
So where do you draw the line between where the HTML block opens and where it ends?
Did the HTML end with the pre tag?

Does it end here because there's a blank line?

What if several lines later there was a new tag?
<hr>
Does the HTML continue down to here?

<pre>What about here?</pre>
<pre>
# What if the HTML is malformed?
<pre>
Would you consider this line to be in the original HTML block, a new HTML block, or plain Markdown?
<p>
Should *this* line be parsed as HTML? It's not on the same line as the paragraph tag.
</p>
<p>
If you said that example above would not be an HTML block, then what about this one?</p>
<p>

Or *this* one?</p>

To quote the author of the CommonMark spec:

Given that we're not doing a full HTML parse, there are always going to be cases where the spec doesn't "do the right thing" as a human would judge it.

That's why the spec ultimately settled on requiring a blank line to separate them - of all the possible ways to differentiate between HTML and Markdown content, this is arguably the simplest to implement and understand.

I'm definitely open to any ideas you might have, I'm just struggling to identify an easy solution that would:

  1. Deviate from the spec in this one specific case
  2. Be something that Laravel could easily opt-in to without affecting other users of this library
  3. Not break other aspects of the spec
  4. Not require significant overhead to parse and understand the HTML

Is there any chance that something could be done on the Laravel side to add newlines after (or inside of) a component if it knows Markdown is being used? I apologize for not being as familiar with Laravel as I'd like to be, I'm just wondering whether producing "correct" CommonMark that this library would handle "correctly" might be a viable way to handle this particular case.

taylorotwell commented 4 years ago

I think I can just work around it on our end for now. Thanks.

GrahamCampbell commented 4 years ago

Thanks for the detailed replies.