python / cpython

The Python programming language
https://www.python.org
Other
63.1k stars 30.22k forks source link

json dump fails for mixed-type keys when sort_keys is specified #69643

Open f76bdadd-b2f2-433c-a46a-828024cc9a08 opened 9 years ago

f76bdadd-b2f2-433c-a46a-828024cc9a08 commented 9 years ago
BPO 25457
Nosy @zpincus, @bitdancer, @MojoVampire, @jheiv, @aaronchall, @naught101, @akulakov
PRs
  • python/cpython#8011
  • python/cpython#15691
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['3.7', '3.8', 'type-bug', 'library'] title = 'json dump fails for mixed-type keys when sort_keys is specified' updated_at = user = 'https://bugs.python.org/tanzerswingcoat' ``` bugs.python.org fields: ```python activity = actor = 'tanzer@swing.co.at' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'tanzer@swing.co.at' dependencies = [] files = [] hgrepos = [] issue_num = 25457 keywords = ['patch'] message_count = 17.0 messages = ['253324', '253360', '253362', '253363', '253368', '253369', '297246', '317187', '317216', '317240', '317241', '317243', '317246', '320706', '384635', '397249', '397250'] nosy_count = 8.0 nosy_names = ['zachrahan', 'r.david.murray', 'josh.r', 'jedwards', 'tanzer@swing.co.at', 'Aaron Hall', 'naught101', 'andrei.avk'] pr_nums = ['8011', '15691'] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue25457' versions = ['Python 3.7', 'Python 3.8'] ```

    f76bdadd-b2f2-433c-a46a-828024cc9a08 commented 9 years ago

    In Python 3, trying to json-dump a dict with keys of different types fails with a TypeError when sort_keys is specified:

    python2.7 \===========

    Python 2.7.10 (default, May 29 2015, 10:02:30) 
    [GCC 4.8.4] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import json
    >>> json.dumps({1 : 42, "foo" : "bar", None : "nada"}, sort_keys = True)
    '{"null": "nada", "1": 42, "foo": "bar"}'

    python3.5 \============

    Python 3.5.0 (default, Oct  5 2015, 12:03:13) 
    [GCC 4.8.5] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import json
    >>> json.dumps({1 : 42, "foo" : "bar", None : "nada"}, sort_keys = True)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib64/python3.5/json/__init__.py", line 237, in dumps
        **kw).encode(obj)
      File "/usr/lib64/python3.5/json/encoder.py", line 199, in encode
        chunks = self.iterencode(o, _one_shot=True)
      File "/usr/lib64/python3.5/json/encoder.py", line 257, in iterencode
        return _iterencode(o, 0)
    TypeError: unorderable types: str() < int()

    Note that the documentation explicitly allows keys of different, if basic, types:

    If skipkeys is True (default: False), then dict keys that are not of a basic type (str, int, float, bool, None) will be skipped instead of raising a TypeError.

    As all they keys are dumped as strings, a simple solution would be to sort after converting to strings. Looking closely at the output of Python 2, the sort order is a bit strange!

    99ffcaa5-b43b-4e8e-a35e-9c890007b9cd commented 9 years ago

    The Python 2 sort order is a result of the "arbitrary but consistent fallback comparison" (omitting details, it's comparing the names of the types), thus the "strange" sort order. Python 3 (justifiably) said that incomparable types should be incomparable rather than silently behaving in non-intuitive ways, hiding errors.

    Python is being rather generous by allowing non-string keys, because the JSON spec ( http://json.org/ ) only allows the keys ("names" in JSON parlance) to be strings. So you're already a bit in the weeds as far as compliant JSON goes if you have non-string keys.

    Since mixed type keys lack meaningful sort order, I'm not sure it's wrong to reject attempts to sort them. Converting to string is as arbitrary and full of potential for silently incorrect comparisons as the Python 2 behavior, and reintroducing it seems like a bad idea.

    99ffcaa5-b43b-4e8e-a35e-9c890007b9cd commented 9 years ago

    As a workaround (should you absolutely need to sort keys by some arbitrary criteria), you can initialize a collections.OrderedDict from the sorted items of your original dict (using whatever key function you like), then dump without using sort_keys=True. For example, your suggested behavior (treat all keys as str) could be achieved by the user by replacing:

        json.dumps(mydict, sort_keys=True)

    with:

    json.dumps(collections.OrderedDict(sorted(mydict.items(), key=str)))

    Particularly in 3.5 (where OrderedDict is a C builtin), this shouldn't incur too much additional overhead (sort_keys has to make a sorted list intermediate anyway), and the output is the same, without introducing implicit hiding of errors.

    99ffcaa5-b43b-4e8e-a35e-9c890007b9cd commented 9 years ago

    Oops, minor flaw with that. It's str-ifying the tuples, not the keys, which could (in some cases) cause issues with keys whose reprs have different quoting. So you'd end up with lambdas. Boo. Anyway, corrected version (which would probably not be one-lined in real code):

    json.dumps(collections.OrderedDict(sorted(mydict.items(), key=lambda x: str(x[0]))))

    f76bdadd-b2f2-433c-a46a-828024cc9a08 commented 9 years ago

    Josh Rosenberg wrote at Fri, 23 Oct 2015 02:45:51 +0000:

    The Python 2 sort order is a result of the "arbitrary but consistent fallback comparison" (omitting details, it's comparing the names of the types), thus the "strange" sort order.

    Thanks. I knew that.

    Python 3 (justifiably) said that incomparable types should be incomparable rather than silently behaving in non-intuitive ways, hiding errors.

    "justifiably" is debatable. I consider the change ill-conveived.

    Displaying a dictionary (or just its keys) in a readable, or just reproducible, way is useful in many contexts. Python 3 broke that for very little, INMNSHO, gain.

    I consider "hiding errors" a myth, to say it politely!

    Python is being rather generous by allowing non-string keys, because the JSON spec ( http://json.org/ ) only allows the keys ("names" in JSON parlance) to be strings. So you're already a bit in the weeds as far as compliant JSON goes if you have non-string keys.

    There are two possibilities:

    1) Accepting non-string keys is intended. Then sort_keys shouldn't break like it does.

    As far as JSON goes, the output of json.dump[s] contains string keys.

    2) Accepting non-string keys is a bug. Then json.dump[s] should be changed to not accept them.

    Mixing both approaches is the worst of all worlds.

    Since mixed type keys lack meaningful sort order, I'm not sure it's wrong to reject attempts to sort them.

    The documentation says:

    If sort_keys is True (default False), then the output of dictionaries
    will be sorted by key; this is useful for regression tests to ensure
    that JSON serializations can be compared on a day-to-day basis.

    **Reproducible** is the keyword here.

    **Readability** is another one. Even if the sort order is "strange", it is much better than random order, if you are looking for a specific key.

    For the record, it was a test failing under Python 3.5 that triggered this bug report.

    > As all they keys are dumped as strings, a simple solution would be to > sort after converting to strings. Converting to string is as arbitrary and full of potential for silently incorrect comparisons as the Python 2 behavior, and reintroducing it seems like a bad idea.

    json.dumps already does the conversion::

        >>> json.dumps({1 : 42, "foo" : "bar", None : "nada"})
        '{"foo": "bar", "1": 42, "null": "nada"}'

    Another run::

        >>> json.dumps({1 : 42, "foo" : "bar", None : "nada"})
        '{"1": 42, "foo": "bar", "null": "nada"}'

    That difference is exactly the reason for sort_keys.

    f76bdadd-b2f2-433c-a46a-828024cc9a08 commented 9 years ago

    Josh Rosenberg wrote at Fri, 23 Oct 2015 02:56:30 +0000:

    As a workaround (should you absolutely need to sort keys by some arbitrary criteria), you can initialize a collections.OrderedDict from the sorted items of your original dict (using whatever key function you like), then dump without using sort_keys=True.

    Sigh...

    I already implemented a workaround but it's not as simple as you think — the dictionary in question is nested.

    The problem is that this is just another unnecessary difficulty when trying to move to Python 3.x.

    100b7c8c-82ad-411a-8e39-8d1e82dc9b22 commented 7 years ago

    This one just bit me too. It seems that if JSON serialization accepts non-string dict keys, it should make sure to accept them in all circumstances. Currently, there is an error *only with mixed-type dicts, *only when sort_keys=True.

    In addition, the error raised in such cases is especially unhelpful. Running the following: json.dumps({3:1, 'foo':'bar'}, sort_keys=True)

    produces a stack trace that terminates in a function defined in C, with this error: TypeError: '\<' not supported between instances of 'str' and 'int'

    That error doesn't give non-experts very much to go on...!

    The fix is reasonably simple: coerce dict keys to strings *before* trying to sort the keys, not after. The only fuss in making such a patch is that the behavior has to be fixed in both _json.c and in json/encode.py.

    The only other consistent behavior would be to disallow non-string keys, but that behavior is at this point very well entrenched. So it only makes sense that encoding should be patched to not fail in corner cases.

    affb16c2-6fdb-403f-b328-86b7e719c99e commented 6 years ago

    Now that dicts are sortable, does that make the sort_keys argument redundant?

    Should this bug be changed to "won't fix"?

    100b7c8c-82ad-411a-8e39-8d1e82dc9b22 commented 6 years ago

    Well, "wontfix" would be appropriate in the context of deprecating the sort_keys option (over the course of however many releases) and documenting that the new procedure for getting JSON output in a specific order is to ensure that the input dict was created in that order.

    Certainly for regression testing, sort_keys is no longer needed, but that's not the only reason people are using that option. (It's certainly not why I use the option -- my use stems from sort_keys improving human readability of the JSON.)

    But outside of deprecating sort_keys wholesale, it is still a bug that sort_keys=True can cause an error on input that would otherwise be valid for json.dump[s].

    f76bdadd-b2f2-433c-a46a-828024cc9a08 commented 6 years ago

    Aaron Hall wrote at Sun, 20 May 2018 16:49:06 +0000:

    Now that dicts are sortable, does that make the sort_keys argument redundant?

    Should this bug be changed to "won't fix"?

    https://bugs.python.org/issue25457#msg317216 is as good an answer as I could give.

    Considering that I openend the bug more than 2.5 years ago, it doesn't really matter though.

    bitdancer commented 6 years ago

    I'm fairly certain (though not 100%, obviously :) that coercing first and then sorting would be accepted if someone wants to create a PR for this.

    affb16c2-6fdb-403f-b328-86b7e719c99e commented 6 years ago

    From a design standpoint, I'm fairly certain the sort_keys argument was created due to Python's dicts being arbitrarily ordered.

    Coercing to strings before sorting is unsatisfactory because, e.g. numbers sort lexicographically instead of by numeric value when strings.

    >>> import json
    >>> json.dumps({i:i**2 for i in range(15)}, sort_keys=True)
    '{"0": 0, "1": 1, "2": 4, "3": 9, "4": 16, "5": 25, "6": 36, "7": 49, "8": 64, "9": 81, "10": 100, "11": 121, "12": 144, "13": 169, "14": 196}'
    >>> json.dumps({str(i):i**2 for i in range(15)}, sort_keys=True)
    '{"0": 0, "1": 1, "10": 100, "11": 121, "12": 144, "13": 169, "14": 196, "2": 4, "3": 9, "4": 16, "5": 25, "6": 36, "7": 49, "8": 64, "9": 81}'

    Changing the order of operations is just going to create more issues, IMHO.

    Now that users can sort their dicts prior to providing them to the function, e.g.:

    >>> json.dumps({str(i):i**2 for i in range(15)})
    '{"0": 0, "1": 1, "2": 4, "3": 9, "4": 16, "5": 25, "6": 36, "7": 49, "8": 64, "9": 81, "10": 100, "11": 121, "12": 144, "13": 169, "14": 196}'

    we could deprecate the argument, or just keep it as-is for hysterical raisins.

    Regardless, I'd close this as "won't fix".

    bitdancer commented 6 years ago

    json keys *are strings, so the fact that we support other object types as keys and coerce them to strings is an "extra feature" of python, and is actually a somewhat questionable feature. The reproducible use case is solved by the fact that dicts are now ordered, with no extra work on the part of the programmer. Likewise, if you want custom sorting you can ensure your dict is ordered the way you want it to be, as you indicate. The remaining use case for sort_keys, then (and one for which it is *commonly used) is sorting the keys lexicographically, and for that, sorting the coereced strings is correct per the json standard (in which all keys are required to be strings).

    Note that sort_keys will not be removed for backward compatibility reasons, so the only question is whether or not the increased functionality of coercing first is worth the trouble to implement. I'm actually only +0 on it, since I don't consider it good practice to json-ize dicts that have non-string keys. The reason I'm + is because it would increase backward compatibility with python2 (not the ordering of the output, we can't make that work, but in the fact that it would no longer raise an error in python3).

    We'll see if other core developers agree or disagree.

    ff7ad99c-3047-464f-8e61-d0a2adcf8411 commented 6 years ago

    This came up in a StackOverflow question[1] today, so I took a stab at addressing the error. The changes don't restore the 2.x behavior, but just do as R. David Murray suggested and coerce the keys to strings prior to sorting to prevent the error.

    The changes in _json.c and json.decoder are handled slightly differently in the case of skipkeys.

    Both create a list of (coerced_key, value) pairs, sorts it (when specified), and uses that in place of the PyDict_Items / .items().

    When skipkeys=True and invalid (uncoercible) keys are found, the c code will just not append that item to the coerced_items list while the python code uses None to signal that item should be filtered out.

    (That being said, I'm not a huge fan of the approach I used in the Python code and may rewrite using .append instead of a generator.

    The c code could definitely use a review when it comes to reference counts.

    Fork commit: https://github.com/jheiv/cpython/commit/8d3612f56a137da0d26b83d00507ff2f11bca9bb

    [1] https://stackoverflow.com/questions/51093268/why-am-i-getting-typeerror-unorderable-types-str-int-in-this-code

    7046984c-badd-4b36-8833-1965deca539b commented 3 years ago

    I want to do something like this:

        hashlib.md5(json.dumps(d, sort_keys=True))

    So I can check if a dict's contents are the same as a DB version, but I can't guarantee that all the keys are strings, so it breaks, annoyingly. I would very much like the apply-default-function-then-sort approach. Until then, my work-around is this:

        def deep_stringize_dict_keys(item):
            """Converts all keys to strings in a nested dictionary"""
            if isinstance(item, dict):
                return {str(k): deep_stringize_dict_keys(v) for k, v in item.items()}
    
            if isinstance(item, list):
                # This will check only ONE layer deep for nested dictionaries inside lists.
                # If you need deeper than that, you're probably doing something stupid.
                if any(isinstance(v, dict) for v in item):
                    return [deep_stringize_dict_keys(v) if isinstance(v, dict) else v
                            for v in item]
        # I don't care about tuples, since they don't exist in JSON
            return item

    Maybe it can be of some use for others.

    akulakov commented 3 years ago

    Some observations:

    It depends on the perspective: you may think of the json as a representation of a dict of objects, that just happen to be in json format; or you can think of it as a json document with string keys (of course) that just happen to come from a dict of objects. Both can be valid depending on the use case.

    Given all of this, I propose keeping current behavior for the existing arg, and adding another arg for 'stringify then sort' behavior. Then we'll have no silent guessing and the "unorderable" type error message can point the user directly to the new argument.

    If the user reads the docs before using this method, they will see two clear options with respective tradeoffs and decide which one to use.

    So either by reading the docs or running into the error, the user will have a clear explanation and a clear and convenient solution.

    f76bdadd-b2f2-433c-a46a-828024cc9a08 commented 3 years ago

    Json keys *are strings*.

    That‘s why json.dump stringifies all keys. If you want to argue that this behavior is wrong I wouldn’t protest except for that it breaks extant code.

    But arguing that sorting the stringified keys would violate user’s expectations or lead to problems down the line makes no sense. The user is asking for an object with string keys and they want the keys sorted. That is unambiguous and well defined.

    Neither does adding a second argument make any sense, it would just increase the confusion.

    My problem was that Python 3.x threw an exception about this for a complex json object in a context where it was not at all obvious what was going on. And the code in question had worked for years in Python 2.

    This bug report is many, many years old and I don’t much care one way or another but I am very sad that

     Practicality beats purity 

    got utterly lost in and after the transition to Python 3.

    Christian Tanzer

    On 10.07.2021, at 16:12, Andrei Kulakov report@bugs.python.org wrote:  Andrei Kulakov andrei.avk@gmail.com added the comment:

    Some observations:

    • sort_keys arg does a deep sorting of nested dictionaries. It's a bit too much to ask users to do this type of preprocessing manually before dumping to json.

    • the error doesn't seem too onerous to me. 'unorderable types: str() < int()' If uncertain, a user can go to interactive shell and try 1 < "2", and then the issue is obvious.

    • to me, current behaviour seems preferable to silently guessing that users wants stringified sorting, especially since it can surface as a problem way down the line.

    • what makes this issue interesting is that in roughly half of cases (I'm guessing) the user will want object sorted and then cast to string and would be surprised if the reverse happened, and in the other half cases the user would want them stringified, then sorted, and would be surprised if that didn't happen.

    It depends on the perspective: you may think of the json as a representation of a dict of objects, that just happen to be in json format; or you can think of it as a json document with string keys (of course) that just happen to come from a dict of objects. Both can be valid depending on the use case.

    Given all of this, I propose keeping current behavior for the existing arg, and adding another arg for 'stringify then sort' behavior. Then we'll have no silent guessing and the "unorderable" type error message can point the user directly to the new argument.

    If the user reads the docs before using this method, they will see two clear options with respective tradeoffs and decide which one to use.

    So either by reading the docs or running into the error, the user will have a clear explanation and a clear and convenient solution.

    ---------- nosy: +andrei.avk


    Python tracker \report@bugs.python.org\ \https://bugs.python.org/issue25457\