mixmark-io / turndown

🛏 An HTML to Markdown converter written in JavaScript
https://mixmark-io.github.io/turndown
MIT License
8.62k stars 870 forks source link

referenced shortcut links can be broken #393

Open NfNitLoop opened 3 years ago

NfNitLoop commented 3 years ago

If the input HTML uses the same link text, shortcut links result in duplicate link refs:

import tds from "https://cdn.skypack.dev/turndown@7.1.1"
import {DOMParser} from "https://github.com/b-fuze/deno-dom/raw/188d7240e5371caf1b4add8bb7183933d142337e/deno-dom-wasm.ts"

const parser = new DOMParser()
const service = new tds({
    linkStyle: "referenced",
    linkReferenceStyle: "shortcut",
})

function example(html: string) {
    const doc = parser.parseFromString(html, "text/html")
    if (!doc) { throw `failed to parse doc`}
    const result = service.turndown(doc)

    console.log("input:")
    console.log(html)
    console.log("output:")
    console.log(result)
}

example(`<a href="https://www.google.com">Link</a> <a href="https://www.wikipedia.org">Link</a>`)

outputs:

[Link] [Link]

[Link]: https://www.google.com
[Link]: https://www.wikipedia.org

To work around this, turndown should probably keep a map of shortcut->URL mapping and fall back to "full" style linkrefs in cases where there would otherwise be incorrect/duplicate linkref names.