thephpleague / html-to-markdown

Convert HTML to Markdown with PHP
MIT License
1.77k stars 205 forks source link

Headers are not always on a new line #143

Closed bkuhl closed 5 years ago

bkuhl commented 6 years ago

This library is incredibly useful, thanks to the maintainers for the countless hours they've invested in making this available.

Using this library I ran into an issue where if the HTML header had text before it that wasn't wrapped in a tag, it would include the Markdown header on the same line. For example:

<br />\n
This is sample text that for whatever reason the author has not decided to wrap in any kind of HTML tag!\n
<h3>My Header</h3>\n

was turned into

This is sample text that for whatever reason the author has not decided to wrap in any kind of HTML tag!  ###My Header

I worked around this by adding the below method and test to my parsing steps:

    public function headersNeverAtEndOfLine(string $markdown) : string
    {
        // any non-whitespace charcater
        $regex = "([\S])";

        // don't match pound signs that are part of a markdown link, are already next to a newline
        // or are part of a larger markdown header, such as "##### My Header"
        $regex .= "[^\n(#]";

        // match h1-h5 equivalents
        $regex .= "(#{1,5})";

        return trim(preg_replace("/".$regex."/", "$1".PHP_EOL."$2", $markdown));
    }

the test:

    /** @test */
    public function verifyHeadersNeverAtEndOfLine()
    {
        $original = <<<EOF
This is a quick # Header test to ##See how
### Headers are #### Never at the end of an #####Existing line
EOF;
        $expected = <<<EOF
This is a quick
# Header test to
##See how
### Headers are
#### Never at the end of an
#####Existing line
EOF;

        $this->assertEquals($expected, $this->parsesMarkdown->headersNeverAtEndOfLine($original));
    }

I don't know how to proceed with submitting a PR or contributing to this project for this issue. It seems everything is very object-based rather than using regex to parse things. So I'll leave it up to the maintainers to determine the appropriate course of action.

andreskrey commented 6 years ago

I think this is not a bug.

In your example, the text without a tag gets parsed by the DefaultConverter because it doesn't have a tag. DefaultConverter only gets the text and dumps it as markdown. Adding a <br /> or directly a \n breaks the resulting MD because your text is wrapped around a <body> tag (assumed, because the text is outside any node) and adding the br there casts another closing body tag, breaking everything.

If you parse the following text:

<br />
<p>This is sample text that for whatever reason the author has not decided to wrap in any kind of HTML tag!</p>
<h3>My Header</h3>

(Notice the p tag surrounding the initial text). You get the following MD:

This is sample text that for whatever reason the author has not decided to wrap in any kind of HTML tag!

### My Header

So the behavior is right when we have the proper tag. So I think this is a matter if we should care about badly formatted HTML or not. I think is your responsibility to feed the parser the correct HTML, instead of the parser having to make an educated guess of what do you want.

My 2 cents :)

bkuhl commented 6 years ago

That makes sense... I wasn't sure if this would be considered a bug but thought I'd bring it up anyways.

On Sun, Oct 29, 2017, 2:13 PM Andres Rey notifications@github.com wrote:

I think this is not a bug.

In your example, the text without a tag gets parsed by the DefaultConverter because it doesn't have a tag. DefaultConverter only gets the text and dumps it as markdown. Adding a
or directly a \n breaks the resulting MD because your text is wrapped around a tag (assumed, because the text is outside any node) and adding the br there casts another closing body tag, breaking everything.

If you parse the following text:


This is sample text that for whatever reason the author has not decided to wrap in any kind of HTML tag!

My Header

(Notice the p tag surrounding the initial text). You get the following MD:

This is sample text that for whatever reason the author has not decided to wrap in any kind of HTML tag!

My Header

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/thephpleague/html-to-markdown/issues/143#issuecomment-340282240, or mute the thread https://github.com/notifications/unsubscribe-auth/AAgChU-9JC86qXID5Wdw7W7oal3oiKCLks5sxMBdgaJpZM4QKCzb .

-- Ben Kuhl

colinodell commented 6 years ago

While this may not be a bug in the strictest sense, it's certainly not the output one would expect.

I believe this is happening because certain block-level elements like paragraphs and headers will automatically add a newline (\n) afterwards - plain text will not.

To resolve this, we'd need to detect situations where a block-level element comes immediately after plain text and insert the missing new line. This would be slightly kludgy though - a cleaner approach would be implementing an intermediate AST (via #72) and using that to determine where new lines should go.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.