python / cpython

The Python programming language
https://www.python.org/
Other
61.17k stars 29.52k forks source link

Add a case-insensitive case-preserving dict #63186

Closed pitrou closed 9 years ago

pitrou commented 10 years ago
BPO 18986
Nosy @tim-one, @warsaw, @theller, @birkenfeld, @rhettinger, @mdickinson, @jaraco, @pitrou, @vstinner, @ericvsmith, @merwok, @bitdancer, @ethanfurman, @vadmium, @serhiy-storchaka, @demianbrecht
Files
  • transform.patch
  • transformdict.patch
  • ctransformdict.patch
  • dicttransform.patch: Add dict.transform
  • transformdict2.patch
  • transformdict3.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/rhettinger' closed_at = created_at = labels = ['type-feature', 'library'] title = 'Add a case-insensitive case-preserving dict' updated_at = user = 'https://github.com/pitrou' ``` bugs.python.org fields: ```python activity = actor = 'ethan.furman' assignee = 'rhettinger' closed = True closed_date = closer = 'ethan.furman' components = ['Library (Lib)'] creation = creator = 'pitrou' dependencies = [] files = ['31713', '31727', '31729', '31749', '31757', '31761'] hgrepos = [] issue_num = 18986 keywords = ['patch'] message_count = 86.0 messages = ['197359', '197360', '197361', '197362', '197366', '197367', '197368', '197369', '197370', '197376', '197377', '197379', '197380', '197381', '197387', '197389', '197390', '197391', '197392', '197393', '197398', '197399', '197400', '197401', '197402', '197403', '197405', '197406', '197410', '197411', '197412', '197430', '197434', '197445', '197446', '197457', '197464', '197469', '197479', '197516', '197525', '197526', '197527', '197528', '197529', '197531', '197533', '197635', '197637', '197644', '197648', '197710', '197711', '197733', '197969', '197970', '197973', '197975', '197980', '197981', '197982', '197983', '197984', '197986', '197987', '197989', '197990', '198281', '198912', '199652', '199708', '205979', '205995', '206027', '206195', '206196', '234050', '234062', '234077', '234087', '236105', '236161', '236163', '236178', '236909', '243370'] nosy_count = 19.0 nosy_names = ['tim.peters', 'barry', 'theller', 'georg.brandl', 'rhettinger', 'mark.dickinson', 'jaraco', 'pitrou', 'vstinner', 'eric.smith', 'eric.araujo', 'mrabarnett', 'Arfrever', 'r.david.murray', 'ethan.furman', 'sbt', 'martin.panter', 'serhiy.storchaka', 'demian.brecht'] pr_nums = [] priority = 'normal' resolution = 'rejected' stage = 'resolved' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue18986' versions = ['Python 3.5'] ```

    pitrou commented 10 years ago

    This is a very common need when implementing network protocols. You want to match keys case-insensitively but also preserve the original casing (e.g. for presentation).

    When searching on the Web, you see many people reimplementing their own variant (often incomplete, or buggy). For example, Twisted has its own, the email package has something resembling it, WebOb also.

    Having an implementation in the stdlib would spare many people the effort, ensure the implementation is complete and well-tested, and perhaps also add some optimizations to mitigate the overhead compared to a plain dict.

    Note this is an instance of a more general pattern, where they key used for matching is derived from the lookup key using a constant derivation function. So maybe we want to implement the more general pattern and let users specify str.lower as the key derivation function.

    39d85a87-36ea-41b2-b2bb-2be43abb500e commented 10 years ago

    Surely a case-insensitive dict should use str.casefold, not str.lower?

    pitrou commented 10 years ago

    Surely a case-insensitive dict should use str.casefold, not str.lower?

    Perhaps. Network protocols will usually only allow ASCII in parts where case is insensitive (e.g. header names), so it shouldn't make a difference.

    Implementing the generic pattern means this is left at the user's discretion, though.

    pitrou commented 10 years ago

    For the record, I have my own implementation here: https://bitbucket.org/optiflowsrd/obelus/src/tip/obelus/casedict.py?at=default https://bitbucket.org/optiflowsrd/obelus/src/tip/obelus/test/test_casedict.py?at=default

    bitdancer commented 10 years ago

    For the record, email is not a good argument for this, since email could not use this data structure (its data structure is *not* a dict, but a list with dict-like features grafted on).

    I do think this would be useful, and the generic version (analogous to defaultdict) would seem to make sense.

    pitrou commented 10 years ago

    Ok, let's bikeshed this a bit. What should be the name?

    ethanfurman commented 10 years ago

    I would say

    Too bad we can't just add an extra 'transform_key' keyword to defaultdict.

    ericvsmith commented 10 years ago

    It would be nice to combine the behaviors that defaultdict and the case-insensitive comparisons.

    pitrou commented 10 years ago

    It would be nice to combine the behaviors that defaultdict and the case-insensitive comparisons.

    Any use case? In my experience they are used in completely different situations. defaultdict mostly to use the writing of some (internal) algorithms, a case-insensitive dict to store user-visible data.

    ericvsmith commented 10 years ago

    Just today I was using a defaultdict where the keys are stock symbols. They're case insensitive (at least for this particular application).

    In this case I just str.upper everything, but it would be a nice feature to case-preserve the keys that I pre-populate. I care less about the keys from user data.

    pitrou commented 10 years ago

    In this case I just str.upper everything, but it would be a nice feature to case-preserve the keys that I pre-populate. I care less about the keys from user data.

    Well, stock symbols are what I would call user data :-)

    ericvsmith commented 10 years ago

    True enough!

    I was trying to distinguish keys that I populate with initial values (mostly stock indexes) versus those where I just read values from a user-supplied file. When I populate the index values, I'd like to preserve the case I initially used. When I use user-supplied values, I don't know that the first value I use to populate the defaultdict has any more meaning that the last one I see.

    It would just be a nice-to-have feature for which I have a real use case. It's not so critical that I can't work around it.

    ethanfurman commented 10 years ago

    To the point, however, Eric's example would make use of both the defaultdict portion and the transformkey portion in a single dict.

    39d85a87-36ea-41b2-b2bb-2be43abb500e commented 10 years ago

    mappeddict?

    Re defaultdict, you could write a dict that does all of these things, called superdict! :-)

    bitdancer commented 10 years ago

    coercekeydict

    pitrou commented 10 years ago

    I would like to shorten the proposals to "transformdict" and "coercedict". (after all, transforming the values would have little sense: you can do it yourself trivially)

    vstinner commented 10 years ago

    FYI os.environ uses something similar: keys and values are encoded and decoded using functions. So any transformation is supported.

    http://hg.python.org/cpython/file/eac63e7ceb03/Lib/os.py#l636

    On UNIX, the encoder and decoder are os.fsencode() and os.fsdecode() (not exactly, the real functions are more strict on the input type).

    On Windows, the encoder converts the key to uppercase.

    pitrou commented 10 years ago

    FYI os.environ uses something similar: keys and values are encoded and decoded using functions. So any transformation is supported.

    I don't think this is the same situation. os.environ has bijective transformations, which don't pose any implementation challenge.

    The whole point of a "transformdict" is to allow for multiple keys to actually map to the same dict entry.

    786d3f11-b763-4414-a03f-abc264e0b72d commented 10 years ago

    +1 For the general idea +1 For the more generic approach of which "lowercase" is just one special case

    +10 to make this a PEP so that more people have a chance to express their opinion (currently only those who noticed it on the issues mailing list). I find the issue tracker a very bad medium for any kind of brain-storming or bikeshedding.

    pitrou commented 10 years ago

    +10 to make this a PEP so that more people have a chance to express their opinion (currently only those who noticed it on the issues mailing list). I find the issue tracker a very bad medium for any kind of brain-storming or bikeshedding.

    Well, I don't think there is a lot of brainstorming to be done here: we are talking about a single well-defined functionality. I don't mind writing a PEP if other people ask for it, but I'd rather spend my time on more important things.

    As for bikeshedding, there is no "good medium" for it, really :-) At worse we could ask python-dev for their naming contributions (a great idea, surely).

    bitdancer commented 10 years ago

    Indeed. Although there was apparently some call for it, it doesn't sound from a quick google like defaultdict was deemed to require a PEP. Presumably the informed audience should be wider than this issue, though.

    I also note that defaultdict is implemented via a special method on dict itself (missing), and if this one was implemented the same way it would be easy to combine the features.

    ethanfurman commented 10 years ago

    Precisely what I was thinking. :)

    pitrou commented 10 years ago

    I also note that defaultdict is implemented via a special method on dict itself (missing), and if this one was implemented the same way it would be easy to combine the features.

    It's not that simple: to remember the original casing you need either a second container, or to use (original_key, value) tuples as values. Both approaches have non-trivial repercussions on the implementation of many methods.

    pitrou commented 10 years ago

    (as a sidenote, you might want a case-insensitive OrderedDict as well, I see no reason to make a special case for defaultdict here)

    serhiy-storchaka commented 10 years ago

    See also discussion on a topic: http://comments.gmane.org/gmane.comp.python.ideas/18469 .

    Proposed names: custom_dict, KeyedDictionary, Dictionary.

    It will be confused if this dict will not be compatible with PyDict API. It is possible to add such feature directly into the dict class (I experimented with IdentityDict).

    pitrou commented 10 years ago

    Proposed names: custom_dict, KeyedDictionary, Dictionary.

    Sounds much too vague and un-specific.

    It will be confused if this dict will not be compatible with PyDict API.

    Why? Many custom dict-like classes aren't.

    It is possible to add such feature directly into the dict class (I experimented with IdentityDict).

    Can you explain how?

    serhiy-storchaka commented 10 years ago

    Why? Many custom dict-like classes aren't.

    And this is weird (bpo-10977).

    Can you explain how?

    By patching Objects/dictobject.c of course. I suppose it should require changing about 400 lines of code, a little more than for IdentityDict. When you provides the specification perhaps I will provide a patch.

    pitrou commented 10 years ago

    By patching Objects/dictobject.c of course.

    Am I stupid :-)

    I suppose it should require changing about 400 lines of code, a little more than for IdentityDict. When you provides the specification perhaps I will provide a patch.

    Well, take a look at the code I've pointed to above. I'm curious to know how you'll integrate it in dictobject.c without slowing down normal dict objects, and without making them bigger.

    serhiy-storchaka commented 10 years ago

    I'm curious to know how you'll integrate it in dictobject.c without slowing down normal dict objects, and without making them bigger.

    It of course will make a size of source file bigger, but shouldn't affect a size or performance of normal dicts. A dict object contains dk_lookup. Constructor for keyed dict (subclass of ) should initialize it with specialized function which calls the "key" function and recalculate a hash (yes, with this simple approach a hash will be calculated twice, for original and for transformed keys). Hmm, actually it can be even simpler than for IdentityDict (for which not calculating a hash was important). Also some other methods which relies on dict implementation details (e.g. making a copy of dict) should be modified.

    The most cumbersome part is the tests. Unfortunately I lost my tests for IdentityDict (used hg diff without --git). It will be good if your provide complete test suite.

    pitrou commented 10 years ago

    It of course will make a size of source file bigger, but shouldn't affect a size or performance of normal dicts. A dict object contains dk_lookup.

    You need to keep both the original keys and the transformed keys. It's not only about transforming keys on lookup. Otherwise, yes, it's trivial.

    pitrou commented 10 years ago

    It will be good if your provide complete test suite.

    ... Again, take a look at the code I've posted above ...

    pitrou commented 10 years ago

    (now relayed on python-dev)

    pitrou commented 10 years ago

    I'm uploading a pure Python transformdict implementation + tests, for Serhiy's benefits (and others') :-)

    e26428b1-70cf-4e9f-ae3c-9ef0478633fb commented 10 years ago

    With the current patch __repr__() will fail if the untransformed key is unhashable:

    >>> d = collections.transformdict(id)
    >>> L = [1,2,3]
    >>> d[L] = None 
    >>> d.keys()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "C:\Repos\cpython-dirty\lib\collections\abc.py", line 444, in __repr__
        return '{0.__class__.__name__}({0._mapping!r})'.format(self)
      File "C:\Repos\cpython-dirty\lib\collections\__init__.py", line 944, in __repr__
        self._transform, repr(dict(self)))
    TypeError: unhashable type: 'list'
    pitrou commented 10 years ago

    With the current patch __repr__() will fail if the untransformed key is unhashable:

    Yeah, the __repr__() implementation will be a bit annoying to get right :-)

    merwok commented 10 years ago

    FTR this is one message from the previous thread about this: https://mail.python.org/pipermail/python-ideas/2010-June/007332.html

    pitrou commented 10 years ago

    Updated patch: fixes repr(), and retains the first key not the last.

    rhettinger commented 10 years ago

    I would *really* like for this to start outside the standard library. It needs to mature with user feedback before being dumped in the collections module (which was never intended to be a giant pile of every collection a person could think of).

    Adding yet more dictionary variants is an example of way-too-many-ways-to-do-it.

    serhiy-storchaka commented 10 years ago

    Here is a preliminary C implementation.

    pitrou commented 10 years ago

    Serhiy, any benchmarks for your implementation? Does it slow down regular dicts?

    ethanfurman commented 10 years ago

    On 09/11/2013 02:39 PM, Tim Delaney wrote on PyDev:

    I would think that retrieving the keys from the dict would return the transformed keys (I'd call them canonical keys).

    The more I think about this the more I agree. A canonicaldict with a key function that simply stored the transformed key and it's value would seem to be a lot simpler:

    Further, in order to store the non-canonical keys a separate list must be kept of the keys to preseed the canonicaldict; if we store the canonical keys a separate list must be kept for presentation purposes -- so worst case scenario we're keeping the same amount of information and best-case scenario the presentation of the keys doesn't matter and we just saved ourselves an extra data structure.

    bitdancer commented 10 years ago

    It would be simpler, but it would also be useless for the actual use case for which this issue was opened.

    ethanfurman commented 10 years ago

    True, but how big a deal is that?

    For one, it seems questionable to have the presentation portion of the data be part of the key.

    For two, when presentation is important a separate list must be kept anyway to preseed the dict; so just use that list to cycle through the canonicaldict:

    --> some_dict = some_function_that_returns_a_conanicaldict() --> presentation_list = ['IBM','Intel','AMD'] --> for company in presentation_list: ... key = some_dict.key[company] # demo purposes only ... value = some_dict[company] ... print(key, company, value) ibm IBM 2172 intel Intel 3210 amd AMD 4399

    pitrou commented 10 years ago

    Ethan, please don't post the same message *both* on the tracker and on the mailing-list. I'm sure most people here also read the ML thread.

    ethanfurman commented 10 years ago

    Right, sorry.

    bitdancer commented 10 years ago

    You are conceptualizing this very differently. In our view, this data structure is for cases where the original key is the most important piece of information (about the keys). The transformation in the lookup process is entirely in the service of looking up the value paired with that original key when there is more than one possible representation of that key. It is the original key that is critical when re-serializing the data or otherwise making use of the keys for anything other than lookup. So this is about making the data structure succinctly model the problem domain, which is what OO is supposed to be good at :)

    pitrou commented 10 years ago

    R. David Murray added the comment:

    You are conceptualizing this very differently. In our view, this data structure is for cases where the original key is the most important piece of information (about the keys). The transformation in the lookup process is entirely in the service of looking up the value paired with that original key when there is more than one possible representation of that key. It is the original key that is critical when re-serializing the data or otherwise making use of the keys for anything other than lookup. So this is about making the data structure succinctly model the problem domain, which is what OO is supposed to be good at :)

    Thanks for putting it much more convincingly than my python-dev response :-)

    serhiy-storchaka commented 10 years ago

    Here is an alternative C implementation. It adds to the dict class support of the __transform() method. If this method is defined in dict subclass it used to transforming keys. collections.TransformDict is just utilizes this feature as collections.defaultdict utilizes __missing(). This patch changes twice less C code than previous one (227 vs 474 lines).

    serhiy-storchaka commented 10 years ago

    Does it slow down regular dicts?

    I were surprized, but yes. The ComplexPythonFunctionCalls test from pybench is 40% slower with ctransformdict.patch (and I still don't known why). With dicttransform.patch it is only 2% slower. All other pybench tests are approximately equal.

    pitrou commented 10 years ago

    > Does it slow down regular dicts?

    I were surprized, but yes. The ComplexPythonFunctionCalls test from pybench is 40% slower with ctransformdict.patch (and I still don't known why). With dicttransform.patch it is only 2% slower. All other pybench tests are approximately equal.

    Did you try any other microbenchmarks? Your approach sounds promising.