gristlabs / asttokens

Annotate Python AST trees with source text and token information
Apache License 2.0
172 stars 34 forks source link

Not work properly on non-ascii source code #7

Closed fyrestone closed 6 years ago

fyrestone commented 7 years ago

image

dsagal commented 7 years ago

It looks like the standard ast module will handle encoding declaration (like # -*- coding: UTF-8 -*-) if the content is bytes, but will reject an encoding declaration if the content is unicode. Here's an example:

>>> ast.parse("# -*- coding: UTF-8 -*-\nprint 'foo'")
<_ast.Module object at 0x101231c50>
>>> ast.parse(u"# -*- coding: UTF-8 -*-\nprint 'foo'")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ast.py", line 37, in parse
    return compile(source, filename, mode, PyCF_ONLY_AST)
  File "<unknown>", line 0
SyntaxError: encoding declaration in Unicode string

So either pass in your source code as bytes, or if you pass in unicode, don't include encoding declarations in the source.

dsagal commented 7 years ago

Try calling asttokens.ASTTokens(s.encode('utf8'), parse=True). If that's not enough, please include a reproducible example, and I'll try to help.

fyrestone commented 7 years ago

Thank you very much, I found it is caused by chardet auto detection. Some files are converted to utf-8 by a wrong codec.

dsagal commented 6 years ago

Glad it's resolved.