mixmark-io / turndown

🛏 An HTML to Markdown converter written in JavaScript
https://mixmark-io.github.io/turndown
MIT License
8.52k stars 864 forks source link

Converting HTML to markdown doesn't appear to preserve HTML entities #431

Open sebpowell opened 1 year ago

sebpowell commented 1 year ago

Consider the following HTML example:

<p>I think &amp;</p>

When I try converting this to Markdown using Turndown, I get the following output:

I think &

I guess I would expect Turndown to preserve HTML entities and to output something like this instead:

I think &amp;

I couldn't see an option to turn this on, so unless I'm missing something, I assume I need to use something like https://www.npmjs.com/package/html-entities. But I just wanted to check I'm not missing anything obvious?

Here's the config I'm using:

const INITIAL_TURNDOWN_OPTIONS: Turndown.Options = {
  headingStyle: "atx",
  hr: "---",
  bulletListMarker: "-",
  codeBlockStyle: "fenced",
  fence: "```",
  emDelimiter: "_",
  strongDelimiter: "**",
  linkStyle: "inlined",
};

Any help much appreciated!

bjones1 commented 1 year ago

Using the CommonMark dingus, entering I think & or I think &amp; renders to <p>I think &amp;</p>. So, the HTML entity in the HTML source doesn't need to be preserved in the resulting Markdown to still render properly. Are you asking for a way to preserve HTML entities, even if they don't need to be preserved to render correctly?

Aloso commented 1 year ago

@bjones1 it does need to be preserved in this case:

&lt;br&gt;

which is converted to

<br>

and in this case:

&amp;amp; is an ampersand

and in this case:

A big &nbsp; space

and in this case:

&nbsp; &nbsp; Not a code block