mixmark-io / turndown

🛏 An HTML to Markdown converter written in JavaScript
https://mixmark-io.github.io/turndown
MIT License
8.62k stars 870 forks source link

HTML newlines leak into Markdown #394

Closed NfNitLoop closed 3 years ago

NfNitLoop commented 3 years ago

HTML newlines do not represent line breaks, they're just whitespace. So they should be collapsed before rendering HTML. But Turndown is letting them leak through to the markdown, which breaks rendering:

import tds from "https://cdn.skypack.dev/turndown@7.1.1"
import {DOMParser} from "https://github.com/b-fuze/deno-dom/raw/188d7240e5371caf1b4add8bb7183933d142337e/deno-dom-wasm.ts"

const parser = new DOMParser()
const service = new tds()

function example(html: string) {
    const doc = parser.parseFromString(html, "text/html")
    if (!doc) { throw `failed to parse doc`}
    const result = service.turndown(doc)

    console.log("output:")
    console.log(JSON.stringify(result))
    console.log()
}

// Works as expected:
example(`<p>Foo<br>bar`)

// The bug: This creates separate paragraphs in Markdown:
example(`<p>Foo
<br>bar`)

// It seems to be because Turndown is just passing through newlines,
// though newlines have different semantics in Markdown.
example(`<p>Foo
bar`)

example(`<p>Foo

bar`)
> deno run .\turndown.ts
output:
"Foo  \nbar"

output:
"Foo\n  \nbar"

output:
"Foo\nbar"

output:
"Foo\n\nbar"
NfNitLoop commented 3 years ago

I'm working around this by just removing newlines in my input HTML:

    // Work around: https://github.com/mixmark-io/turndown/issues/394
    html = html.trim().replaceAll(/\s+/g, " ")
NfNitLoop commented 3 years ago

Though, this workaround would break any <pre> tags. (Thankfully I know my input doesn't have any.)

NfNitLoop commented 3 years ago

Hmm, ok, my workaround introduces other issues. For example,

// This works with newlines separating tags:
example(`<blockquote>
<p>Test</p>
</blockquote>`)

// But spaces result in empty paragraphs:
example(`<blockquote> <p>Test</p> </blockquote>`)
output:
"> Test"

output:
">  \n> \n> Test\n> \n>"

I'm starting to wonder if this is an issue w/ Turndown or an issue w/ the bleeding-edge DOMParser I'm using. 😛

martincizek commented 3 years ago

I'm starting to wonder if this is an issue w/ Turndown or an issue w/ the bleeding-edge DOMParser I'm using. 😛

The second reason is likely to be the case. My humble guess is that it does not support DOM modifications and that's why collapseWhitespace() does not do its job. When a DOM is passed, Turndown performs cloneNode() and then it expects the result to be modifiable (see root-node.js).

You can always check the behavior here: https://mixmark-io.github.io/turndown/ (uses your browser's DOM parser).

NfNitLoop commented 3 years ago

ah, darn. OK, I guess we can close this issue then. Sorry for the false alarm. But thanks for the link, that's handy!