python / cpython

The Python programming language
https://www.python.org
Other
63.5k stars 30.41k forks source link

tokenize: add support for tokenizing 'str' objects #54178

Open meadori opened 14 years ago

meadori commented 14 years ago
BPO 9969
Nosy @ncoghlan, @vstinner, @voidspace, @meadori, @takluyver, @vadmium
Files
  • issue9969.patch: Patch against tip (3.3.0a0)
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-feature', 'library'] title = "tokenize: add support for tokenizing 'str' objects" updated_at = user = 'https://github.com/meadori' ``` bugs.python.org fields: ```python activity = actor = 'takluyver' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'meador.inge' dependencies = [] files = ['23099'] hgrepos = [] issue_num = 9969 keywords = ['patch'] message_count = 11.0 messages = ['117516', '117523', '117554', '117571', '117652', '121712', '121843', '143506', '252299', '252303', '316983'] nosy_count = 7.0 nosy_names = ['ncoghlan', 'vstinner', 'michael.foord', 'meador.inge', 'ark3', 'takluyver', 'martin.panter'] pr_nums = [] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue9969' versions = ['Python 3.4'] ```

    meadori commented 14 years ago

    Currently with 'py3k' only 'bytes' objects are accepted for tokenization:

    >>> import io
    >>> import tokenize
    >>> tokenize.tokenize(io.StringIO("1+1").readline)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/minge/Code/python/py3k/Lib/tokenize.py", line 360, in tokenize
        encoding, consumed = detect_encoding(readline)
      File "/Users/minge/Code/python/py3k/Lib/tokenize.py", line 316, in detect_encoding
        if first.startswith(BOM_UTF8):
    TypeError: Can't convert 'bytes' object to str implicitly
    >>> tokenize.tokenize(io.BytesIO(b"1+1").readline)
    <generator object _tokenize at 0x1007566e0>

    In a discussion on python-dev (http://www.mail-archive.com/python-dev@python.org/msg52107.html) it was generally considered to be a good idea to add support for tokenizing 'str' objects as well.

    voidspace commented 14 years ago

    Note from Nick Coghlan from the Python-dev discussion:

    A very quick scan of _tokenize suggests it is designed to support detect_encoding returning None to indicate the line iterator will return already decoded lines. This is confirmed by the fact the standard library uses it that way (via generate_tokens).

    An API that accepts a string, wraps a StringIO around it, then calls _tokenise with an encoding of None would appear to be the answer here. A feature request on the tracker is the best way to make that happen.

    ncoghlan commented 14 years ago

    Possible approach (untested):

    def get_tokens(source):
        if hasattr(source, "encode"):
            # Already decoded, so bypass encoding detection
            return _tokenize(io.StringIO(source).readline, None)
        # Otherwise attempt to detect the correct encoding
        return tokenize(io.BytesIO(source).readline)
    vstinner commented 14 years ago

    See also issue bpo-4626 which introduced PyCF_IGNORE_COOKIE and PyPARSE_IGNORE_COOKIE flags to support unicode string for the builtin compile() function.

    ncoghlan commented 14 years ago

    As per Antoine's comment on bpo-9873, requiring a real string via isinstance(source, str) to trigger the string IO version is likely to be cleaner than attempting to duck-type this. Strings are an area where we make so many assumptions about the way their internals work that duck-typing generally isn't all that effective.

    acf10214-2524-442a-af55-23e8c51e9b93 commented 13 years ago

    If the goal is tokenize(...) accepting a text I/O readline, we already have the (undocumented) generate_tokens(readline).

    ncoghlan commented 13 years ago

    The idea is bring the API up a level, and also take care of wrapping the file-like object around the source string/byte sequence.

    meadori commented 13 years ago

    Attached is a first cut at a patch.

    vadmium commented 9 years ago

    I left some comments. Also, it would be nice to use the new function in the documentation example, which currently suggests tunnelling through UTF-8 but not adding an encoding comment. And see the patch for bpo-12486, which highlights a couple of other places that would benefit from this function.

    vadmium commented 9 years ago

    Actually maybe bpo-12486 is good enough to fix this too. With the patch proposed there, tokenize_basestring("source") would just be equivalent to

    tokenize(StringIO("source").readline)

    cd34197f-d4a4-4e30-9fbe-454f267f097e commented 6 years ago

    I've opened a PR for issue bpo-12486, which would make the existing but undocumented 'generate_tokens' function public:

    https://github.com/python/cpython/pull/6957

    I agree that it would be good to design a nicer API for this, but the perfect shouldn't be the enemy of the good.