Closed bkuhl closed 5 years ago
I think this is not a bug.
In your example, the text without a tag gets parsed by the DefaultConverter because it doesn't have a tag. DefaultConverter only gets the text and dumps it as markdown. Adding a <br />
or directly a \n
breaks the resulting MD because your text is wrapped around a <body>
tag (assumed, because the text is outside any node) and adding the br
there casts another closing body tag, breaking everything.
If you parse the following text:
<br />
<p>This is sample text that for whatever reason the author has not decided to wrap in any kind of HTML tag!</p>
<h3>My Header</h3>
(Notice the p tag surrounding the initial text). You get the following MD:
This is sample text that for whatever reason the author has not decided to wrap in any kind of HTML tag!
### My Header
So the behavior is right when we have the proper tag. So I think this is a matter if we should care about badly formatted HTML or not. I think is your responsibility to feed the parser the correct HTML, instead of the parser having to make an educated guess of what do you want.
My 2 cents :)
That makes sense... I wasn't sure if this would be considered a bug but thought I'd bring it up anyways.
On Sun, Oct 29, 2017, 2:13 PM Andres Rey notifications@github.com wrote:
I think this is not a bug.
In your example, the text without a tag gets parsed by the DefaultConverter because it doesn't have a tag. DefaultConverter only gets the text and dumps it as markdown. Adding a
tag (assumed, because the text is outside any node) and adding the br there casts another closing body tag, breaking everything.
or directly a \n breaks the resulting MD because your text is wrapped around aIf you parse the following text:
This is sample text that for whatever reason the author has not decided to wrap in any kind of HTML tag!
My Header
(Notice the p tag surrounding the initial text). You get the following MD:
This is sample text that for whatever reason the author has not decided to wrap in any kind of HTML tag!
My Header
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/thephpleague/html-to-markdown/issues/143#issuecomment-340282240, or mute the thread https://github.com/notifications/unsubscribe-auth/AAgChU-9JC86qXID5Wdw7W7oal3oiKCLks5sxMBdgaJpZM4QKCzb .
-- Ben Kuhl
While this may not be a bug in the strictest sense, it's certainly not the output one would expect.
I believe this is happening because certain block-level elements like paragraphs and headers will automatically add a newline (\n
) afterwards - plain text will not.
To resolve this, we'd need to detect situations where a block-level element comes immediately after plain text and insert the missing new line. This would be slightly kludgy though - a cleaner approach would be implementing an intermediate AST (via #72) and using that to determine where new lines should go.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This library is incredibly useful, thanks to the maintainers for the countless hours they've invested in making this available.
Using this library I ran into an issue where if the HTML header had text before it that wasn't wrapped in a tag, it would include the Markdown header on the same line. For example:
was turned into
I worked around this by adding the below method and test to my parsing steps:
the test:
I don't know how to proceed with submitting a PR or contributing to this project for this issue. It seems everything is very object-based rather than using regex to parse things. So I'll leave it up to the maintainers to determine the appropriate course of action.