svenkreiss / html5validator

Command line tool to validate HTML5 files. Great for continuous integration.
MIT License
319 stars 34 forks source link

UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 164: ordinal not in range(128) #6

Closed SPFZ closed 8 years ago

SPFZ commented 9 years ago

Starting from version 0.1.12 I get the following error, version 0.1.11 works fine. It is related to this commit: https://github.com/svenkreiss/html5validator/commit/647ad9131c95a5b85de27039301a7e43ca346288

html5validator --root build/html/
Found files to validate: 180
Traceback (most recent call last):
  File "/usr/local/bin/html5validator", line 9, in <module>
    load_entry_point('html5validator==0.1.12', 'console_scripts', 'html5validator')()
  File "/usr/local/lib/python2.7/dist-packages/html5validator/cli.py", line 28, in main
    sys.exit(v.validate(files, errors_only=args.error_only))
  File "/usr/local/lib/python2.7/dist-packages/html5validator/validator.py", line 79, in validate
    print(b'\n'.join(o).decode('utf-8'))
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 164: ordinal not in range(128)

I did set locale information to try to avoid the issue but it didn't help.

python --version
Python 2.7.6

# set locale
export LC_ALL=de_DE.UTF-8
export LANG=de_DE.UTF-8
export LANGUAGE=de_DE.UTF-8
locale
LANG=de_DE.UTF-8
LANGUAGE=de_DE.UTF-8
LC_CTYPE="de_DE.UTF-8"
LC_NUMERIC="de_DE.UTF-8"
LC_TIME="de_DE.UTF-8"
LC_COLLATE="de_DE.UTF-8"
LC_MONETARY="de_DE.UTF-8"
LC_MESSAGES="de_DE.UTF-8"
LC_PAPER="de_DE.UTF-8"
LC_NAME="de_DE.UTF-8"
LC_ADDRESS="de_DE.UTF-8"
LC_TELEPHONE="de_DE.UTF-8"
LC_MEASUREMENT="de_DE.UTF-8"
LC_IDENTIFICATION="de_DE.UTF-8"
LC_ALL=de_DE.UTF-8
svenkreiss commented 9 years ago

Thanks for reporting.

I am having trouble to reproduce this. Can you share the html files you are running this on? Or can you isolate the file and just share that one? You can also send me the file privately to me@svenkreiss.com.

Or does that happen for you for all files?

If you have a second, can you confirm that this is valid and without errors for you:

<!doctype html>
<html lang="en">
<head>
    <meta charset="utf-8">
    <title>Test</title>
</head>
<body>
    <p>This is a boring test.</p>
</body>
</html>
Mikaela commented 8 years ago

I got this with Travis too. https://travis-ci.org/Mikaela/mikaela.github.io/builds/84475677#L239

It's piped to true as it has too many errors from which majority is validator being too old to recognize Upgrade Insecure Requests.

svenkreiss commented 8 years ago

@SPampel I wasn't able to reproduce this, but the new 0.1.14 handles unicode differently and might fix the problem you had.