Add a case-insensitive case-preserving dict

pitrou commented 10 years ago

BPO	18986
Nosy	@tim-one, @warsaw, @theller, @birkenfeld, @rhettinger, @mdickinson, @jaraco, @pitrou, @vstinner, @ericvsmith, @merwok, @bitdancer, @ethanfurman, @vadmium, @serhiy-storchaka, @demianbrecht
Files	transform.patch transformdict.patch ctransformdict.patch dicttransform.patch: Add dict.transform transformdict2.patch transformdict3.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = 'https://github.com/rhettinger' closed_at = created_at = labels = ['type-feature', 'library'] title = 'Add a case-insensitive case-preserving dict' updated_at = user = 'https://github.com/pitrou' ``` bugs.python.org fields: ```python activity = actor = 'ethan.furman' assignee = 'rhettinger' closed = True closed_date = closer = 'ethan.furman' components = ['Library (Lib)'] creation = creator = 'pitrou' dependencies = [] files = ['31713', '31727', '31729', '31749', '31757', '31761'] hgrepos = [] issue_num = 18986 keywords = ['patch'] message_count = 86.0 messages = ['197359', '197360', '197361', '197362', '197366', '197367', '197368', '197369', '197370', '197376', '197377', '197379', '197380', '197381', '197387', '197389', '197390', '197391', '197392', '197393', '197398', '197399', '197400', '197401', '197402', '197403', '197405', '197406', '197410', '197411', '197412', '197430', '197434', '197445', '197446', '197457', '197464', '197469', '197479', '197516', '197525', '197526', '197527', '197528', '197529', '197531', '197533', '197635', '197637', '197644', '197648', '197710', '197711', '197733', '197969', '197970', '197973', '197975', '197980', '197981', '197982', '197983', '197984', '197986', '197987', '197989', '197990', '198281', '198912', '199652', '199708', '205979', '205995', '206027', '206195', '206196', '234050', '234062', '234077', '234087', '236105', '236161', '236163', '236178', '236909', '243370'] nosy_count = 19.0 nosy_names = ['tim.peters', 'barry', 'theller', 'georg.brandl', 'rhettinger', 'mark.dickinson', 'jaraco', 'pitrou', 'vstinner', 'eric.smith', 'eric.araujo', 'mrabarnett', 'Arfrever', 'r.david.murray', 'ethan.furman', 'sbt', 'martin.panter', 'serhiy.storchaka', 'demian.brecht'] pr_nums = [] priority = 'normal' resolution = 'rejected' stage = 'resolved' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue18986' versions = ['Python 3.5'] ```

serhiy-storchaka commented 10 years ago

Did you try any other microbenchmarks? Your approach sounds promising.

Any microbenchmarks which I tried did not show any interesting. Until I found the cause of slowing down ComplexPythonFunctionCalls I have no idea which tests can be representable.

Of course you can run benchmarks yourself.

pitrou commented 10 years ago

Updated patch adding the getitem() method.

pitrou commented 10 years ago

Note: I haven't renamed transformdict to TransformDict yet.

pitrou commented 10 years ago

Uploading new patch with added transform_func property.

birkenfeld commented 10 years ago

Note that I'm strongly against this name of the getitem() method.

pitrou commented 10 years ago

Georg Brandl added the comment:

Note that I'm strongly against this name of the getitem() method.

Any suggestion?

birkenfeld commented 10 years ago

Not really. Would "entry" be acceptable instead of "item"?

pitrou commented 10 years ago

Georg Brandl added the comment:

Not really. Would "entry" be acceptable instead of "item"?

getentry() sounds decent to me, but it loses the parallel to popitem() and items().

birkenfeld commented 10 years ago

Hmm, I didn't consider popitem(). Maybe I'm too paranoid about users confusing __getitem__() and getitem() after all :)

serhiy-storchaka commented 10 years ago

But why not getkey()? Why you need return value too?

pitrou commented 10 years ago

But why not getkey()? Why you need return value too?

Because it's more useful to return both.

serhiy-storchaka commented 10 years ago

Sorry, I don't understand why it's more useful. We need create a tuple and then index it or unpack it and drop one of elements. This only muddles away a time and programmer's attention.

bitdancer commented 10 years ago

Because most often the time at which you want the original key is the point at which you are about to re-serialize the data...so you need the value too.

bitdancer commented 10 years ago

I do think getitem is the most natural name for the method.

serhiy-storchaka commented 10 years ago

Oh, could anyone borrow Guido's time machine and rename either __getitem() to __getvalue() or items() to entries()?

ericvsmith commented 10 years ago

On 09/17/2013 09:34 AM, R. David Murray wrote:

R. David Murray added the comment:

Because most often the time at which you want the original key is the point at which you are about to re-serialize the data...so you need the value too.

I can't think of a case where I'd need (original_key, value) where I wouldn't also be iterating over items(). Especially so if I'm serializing.

On the other hand, I don't have a use case for the original key, anyway. So I don't have a strong feeling about this, other than it feels odd that the answer to the original question (I think on python-dev) "how do we get the original key back?" is answered by "by giving you the original key and its value".

ericvsmith commented 10 years ago

On 09/17/2013 10:12 AM, Eric V. Smith wrote:

On the other hand, I don't have a use case for the original key, anyway. So I don't have a strong feeling about this, other than it feels odd that the answer to the original question (I think on python-dev) "how do we get the original key back?" is answered by "by giving you the original key and its value".

I meant: I don't have a use case for finding the original key outside of iterating over items().

jaraco commented 10 years ago

I just want to say thanks for working on this. I also have needed this functionality for various needs in the past. To fulfill my needs, I wrote this implementation:

https://bitbucket.org/jaraco/jaraco.util/src/1ab3e7061f96bc5e179b6b2c46b06d1c20f87129/jaraco/util/dictlib.py?at=default#cl-221

That implementation is used in the irc library for a case-insensitive dict, but using the IRC-specific standard for case insensitivity (https://bitbucket.org/jaraco/irc/src/1576b10dc2923d4d7234319d2d1e11a5080e1f7d/irc/dict.py?at=default#cl-49).

I share this just to add a +1 for the need and to provide additional use cases and implementations for reference.

pitrou commented 10 years ago

Raymond, have you had time to look at this?

rhettinger commented 10 years ago

Antoine, is the PEP ready for review?

pitrou commented 10 years ago

Antoine, is the PEP ready for review?

Well, I think it is. Do you think other points should be addressed in it? We still have some time.

mdickinson commented 10 years ago

+1 for this (for Python 3.5, now, I guess). I've just found another place where I'd use it.

Looking at the implementation, one thing surprises me a bit: I'd expect the KeyError from a 'del' or 'pop' operation to have the untransformed key rather than the transformed key in its .args.

How about '_keys' and '_values' for the slot names, in place of '_original' and '_data'?

rhettinger commented 10 years ago

Mark, what was the use case you found?

mdickinson commented 10 years ago

Mark, what was the use case you found?

It's essentially an IdentityDict, though I've found other more specific transforms useful.

I was writing a tool to find reference cycles between Python objects (we have a customer application that's working in a multithreaded COM environment and has to ensure that COM objects are released on the same types of threads they were created on, so we have to be careful about cyclic garbage and delayed garbage collection).

The graph of Python objects (class 'ObjectGraph') is modelled as a fairly standard directed graph (set of vertices, set of edges, two dictionaries mapping each edge to its head and tail), but of course for this application the dict and set have to be based on object identity rather than normal equality. Using a TransformDict (and an IdentitySet) lets me write the standard graph algorithms (e.g., for finding strongly connected components) in a natural way, leaving it to the TransformDict and IdentitySet to do the necessary id() conversions under the hood.)

I also have a similar AnnotatedGraph object (a sort of offline version of the ObjectGraph), where the edges and vertices carry additional information and it's convenient to be able to use a lightweight ID rather than an entire vertex or edge as a dictionary key. Again, using a TransformDict lets one hide the details and present the graph manipulation code readably and naturally.

Some code here, if you're interested:

https://github.com/mdickinson/refcycle/blob/refactor/refcycle/object_graph.py

Caveat: it's work in progress.

rhettinger commented 10 years ago

[Mark Dickinson]

It's essentially an IdentityDict, though I've found other more specific transforms useful.

Have any of the applications had use for the part of the API that looks up the original, untransformed key?

mdickinson commented 10 years ago

Not my applications, no.

ethanfurman commented 9 years ago

3.5 is almost here; Raymond, care to make a ruling?

rhettinger commented 9 years ago

Yes.

I intend to button this one up before long.

vadmium commented 9 years ago

For the record, this is related to PEP-455 (key-transforming dictionary)

vstinner commented 9 years ago

The API is simple and well defined, the addition is small, I don't understand what is the problem with this enhancement.

bb8bd63d-cf82-41f3-a63e-9703d695cb16 commented 9 years ago

Some refactoring that I'm working on for http.client could use this (currently I have it as part of my patch set). I haven't run into any issues using it and it's definitely useful. Would be nice to get this merged.

rhettinger commented 9 years ago

FYI, the PEP for this isn't going to be accepted (I'm working on the write-up for the reasons why and will post on python-dev). That said, it would be great if the code continues to be improved and then posted on the Python Package Index.

vadmium commented 9 years ago

I will be interested to see those reasons. Another way to do a similar thing might be using a Key(value, transform) class, somewhat along the lines of bpo-20632, but as a separate class rather than part of the core type system. But I have not thought that idea through very much.

bb8bd63d-cf82-41f3-a63e-9703d695cb16 commented 9 years ago

I will be interested to see those reasons.

+1. Something like what this PEP proposed would be beneficial in a few places throughout the library (header and cookie implementations would definitely benefit rather than having to deal with buggy normalization themselves). It’s unfortunate that this isn’t going to be approved.

jaraco commented 9 years ago

I'm also eager to hear what limitations prevented the acceptance. Please do link back here when you've posted.

I have to say, I'm not entirely surprised. In my implementation, I struggled with some cases, and it certainly doesn't feel like a fully safe implementation.

That said, since I mentioned the implementation in jaraco.util earlier, I wanted to announce that those implementations (FoldedCase and FoldedCaseKeyedDict) have been moved to two libraries (jaraco.text and jaraco.collections).

ethanfurman commented 9 years ago

From https://mail.python.org/pipermail/python-dev/2015-May/140003.html \====================================================================== Before the Python 3.5 feature freeze, I should step-up and formally reject PEP-455 for "Adding a key-transforming dictionary to collections".

I had completed an involved review effort a long time ago and I apologize for the delay in making the pronouncement.

What made it a interesting choice from the outset is that the idea of a "transformation" is an enticing concept that seems full of possibility. I spent a good deal of time exploring what could be done with it but found that it mostly fell short of its promise.

There were many issues. Here are some that were at the top:

Most use cases don't need or want the reverse lookup feature (what is wanted is a set of one-way canonicalization functions). Those that do would want to have a choice of what is saved (first stored, last stored, n most recent, a set of all inputs, a list of all inputs, nothing, etc). In database terms, it models a many-to-one table (the canonicalization or transformation function) with the one being a primary key into another possibly surjective table of two columns (the key/value store). A surjection into another surjection isn't inherently reversible in a useful way, nor does it seem to be a common way to model data.
People are creative at coming up with using cases for the TD but then find that the resulting code is less clear, slower, less intuitive, more memory intensive, and harder to debug than just using a plain dict with a function call before the lookup: d[func(key)]. It was challenging to find any existing code that would be made better by the availability of the TD.
The TD seems to be all about combining data scrubbing (case-folding, unicode canonicalization, type-folding, object identity, unit-conversion, or finding a canonical member of an equivalence class) with a mapping (looking-up a value for a given key). Those two operations are conceptually orthogonal. The former doesn't get easier when hidden behind a mapping API and the latter loses the flexibility of choosing your preferred mapping (an ordereddict, a persistentdict, a chainmap, etc) and the flexibility of establishing your own rules for whether and how to do a reverse lookup.

Raymond Hettinger

P.S. Besides the core conceptual issues listed above, there are a number of smaller issues with the TD that surfaced during design review sessions. In no particular order, here are a few of the observations:

It seems to require above average skill to figure-out what can be used as a transform function. It is more expert-friendly than beginner friendly. It takes a little while to get used to it. It wasn't self-evident that transformations happen both when a key is stored and again when it is looked-up (contrast this with key-functions for sorting which are called at most once per key).
The name, TransformDict, suggests that it might transform the value instead of the key or that it might transform the dictionary into something else. The name TransformDict is so general that it would be hard to discover when faced with a specific problem. The name also limits perception of what could be done with it (i.e. a function that logs accesses but doesn't actually change the key).
The tool doesn't self describe itself well. Looking at the help(), or the __repr__(), or the tooltips did not provide much insight or clarity. The dir() shows many of the _abc implementation details rather than the API itself.
The original key is stored and if you change it, the change isn't stored. The _original dict is private (perhaps to reduce the risk of putting the TD in an inconsistent state) but this limits access to the stored data.
The TD is unsuitable for bijections because the API is inherently biased with a rich group of operators and methods for forward lookup but has only one method for reverse lookup.
The reverse feature is hard to find (getitem vs __getitem__) and its output pair is surprising and a bit awkward to use. It provides only one accessor method rather that the full dict API that would be given by a second dictionary. The API hides the fact that there are two underlying dictionaries.
It was surprising that when d[k] failed, it failed with transformation exception rather than a KeyError, violating the expectations of the calling code (for example, if the transformation function is int(), the call d["12"] transforms to d[12] and either succeeds in returning a value or in raising a KeyError, but the call d["12.0"] fails with a TypeError). The latter issue limits its substitutability into existing code that expects real mappings and for exposing to end-users as if it were a normal dictionary.
There were other issues with dict invariants as well and these affected substitutability in a sometimes subtle way. For example, the TD does not work with __missing__(). Also, "k in td" does not imply that "k in list(td.keys())".
The API is at odds with wanting to access the transformations. You pay a transformation cost both when storing and when looking up, but you can't access the transformed value itself. For example, if the transformation is a function that scrubs hand entered mailing addresses and puts them into a standard format with standard abbreviations, you have no way of getting back to the cleaned-up address.
One design reviewer summarized her thoughts like this: "There is a learning curve to be climbed to figure out what it does, how to use it, and what the applications [are]. But, the [working out the same] examplea with plain dicts requires only basic knowledge." -- Patricia

python / cpython

Add a case-insensitive case-preserving dict #63186