html2markdown converts HTML entities to Unicode

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?
1. Run pandoc on <a href="">foo&nbsp;bar</a>

What is the expected output? What do you see instead?

Expected that the HTML &nbsp; entity would be preserved; instead, it is
translated as a Unicode non-breaking space.  Uggh.

What version of the product are you using? On what operating system?

pandoc 0.3 Debian

Please provide any additional information below.

Original issue reported on code.google.com by bart.mas...@gmail.com on 25 Jan 2007 at 7:31

GoogleCodeExporter commented 8 years ago

Pandoc is working as it is supposed to here:

When reading Markdown or HTML, it converts all entities to unicode characters.
When writing HTML, it converts these characters to entities as needed:
<>"& are escaped; for everything else, UTF-8 is used.  (As of r540, nonbreaking
spaces are also escaped as entities in HTML output.)  When writing Markdown,
Pandoc uses UTF-8 for everything, using backslash-escapes when necessary.

In this respect, Pandoc behaves differently from Markdown.pl, which just
leaves entities alone.  One reason for this difference is that Pandoc must 
handle
LaTeX output, and entities are meaningless in LaTeX.

Original comment by fiddloso...@gmail.com on 17 Feb 2007 at 4:04

Changed state: WontFix

GoogleCodeExporter commented 8 years ago

I can see why having unicode nonbreaking spaces in the Markdown output is
problematic.  As of r541, the Markdown writer uses "&nbsp;" for nonbreaking 
spaces.

Original comment by fiddloso...@gmail.com on 17 Feb 2007 at 5:00

Changed state: Fixed

kyoxiao / pandoc

html2markdown converts HTML entities to Unicode #3