htacg / tidy-html5

The granddaddy of HTML tools, with support for modern standards
http://www.html-tidy.org
2.68k stars 412 forks source link

html symbols problem #87

Closed angryfishcake closed 9 years ago

angryfishcake commented 11 years ago

tidy does not handle html symbols like: £ or  

results in xA3 and xA0...

tried changing the input/out/character encoding or enabling ascii chars.

acdha commented 11 years ago

This works if you specify the input/output encoding (e.g. tidy -utf8) and provide correctly encoded input. If you're having problems, you should post a sample file and the exact command-line options used.

angryfishcake commented 11 years ago

''' hide-comments: true tidy-mark: false indent: true indent-spaces: 4 new-blocklevel-tags: articleheaderfootersectionnav new-inline-tags: videoaudiocanvasrubyrtrp doctype: <!DOCTYPE HTML> sort-attributes: alpha vertical-space: false output-xhtml: true wrap: 0 wrap-attributes: false break-before-br: false numeric-entities: yes '''

those are my settings. the files encoding type is set to utf8 without bom. default input/ output encoding set to utf8 am i missing something?

craigbarnes commented 11 years ago

Tidy normally uses UTF-8 as the default encoding but you could try @acdha's suggestion above or adding char-encoding: utf8 to your config file. If that doesn't work, it'd be easier to figure out the problem if you told us what platform you're using and posted a small sample of input and the output you get for it, maybe as a gist.

angryfishcake commented 11 years ago

hmm didnt seem to make a difference. its just a normal .html file. im using notepad plusplus with the plugin tidy2 which is using tidyhtml5.

ghost commented 11 years ago

I have come across this issue as well, but only on files which are big-endian UTF-8 without a BOM.

This is occurring on a Windows install of notepad++.

acdha commented 11 years ago

@jonapgar UTF-8 does not have big or little endian modes and the use of a BOM is not recommended with UTF-8. If you have text which is UTF-8 without a BOM and using either -utf8 or char-encoding: utf8 it works as expected – perhaps the problem is that notepad++, which appears to be the common factor, is either not setting the encoding or is injecting an unnecessary BOM?

balthisar commented 9 years ago

I will close this due to age. I don't see evidence of an issue, but please feel free to open this again, @angryfishcake, if the problem persists.