python / cpython

The Python programming language
https://www.python.org
Other
63.61k stars 30.47k forks source link

Undocumented differences between defaultdict and dict behavior #124875

Open david-andrew opened 1 month ago

david-andrew commented 1 month ago

Documentation

defaultdict seems to call __getitem__ whenever __setitem__ is called (regardless of if the item was already present), whereas regular dict does not call __getitem__ when __setitem__ is called. The documentation for defaultdict says that defaultdict and dict are basically identical, except in a few narrow cases.

defaultdict is a subclass of the built-in dict class. It overrides one method and adds one writable instance variable. The remaining functionality is the same as for the dict class and is not documented here.

But nothing is mentioned in the docs about this difference in behavior of calling __getitem__ / __setitem__

This comes up when making a child class of either of them if you want to have a preprocessing step that operates on keys before they are used to index into the dictionary, e.g.

from collections import defaultdict

class Item: ...

def preprocess_item(item: Item) -> str:
    return f'::{item.__class__.__name__}@{hex(id(item))}'

class PreprocessingDefaultDict(defaultdict):
    def __getitem__(self, item: Item):
        key = preprocess_item(item)
        return super().__getitem__(key)

    # def __setitem__(self, item: Item, value):
    #     key = preprocess_item(item)
    #     super().__setitem__(key, value)

class PreprocessingDict(dict):
    def __getitem__(self, item: Item):
        key = preprocess_item(item)
        return super().__getitem__(key)

    def __setitem__(self, key: Item, value):
        key = preprocess_item(key)
        super().__setitem__(key, value)

if __name__ == '__main__':
    item = Item()

    d1 = PreprocessingDefaultDict(dict)
    d1[item]['a'] = 10  # initial creation of dict at d1[item]
    d1[item]['a'] = 20  # updating already existing dict at d1[item] 
    print(dict(d1)) # wrap in dict so prints the same as d2

    d2 = PreprocessingDict()
    d2[item] = {}
    d2[item]['a'] = 10  # initial creation of dict at d2[item]
    d2[item]['a'] = 20  # updating already existing dict at d2[item]
    print(d2)

Which prints out something like:

{'::Item@0x7fd29b035280': {'a': 20}}
{'::Item@0x7fd29b035280': {'a': 20}}

In this example, I have a preprocessor function I'd like to run on all keys to convert them from objects into strings which can be used in the dictionary. It is not clear from the docs that you need to not override __setitem__ like I have commented out, because defaultdict will always call __getitem__ thus always running the preprocessor. If you override __setitem__ like I have commented out, you will preprocess the item twice, and end up with results like this:

{'::str@0x7fc55b78a930': {'a': 10}, '::str@0x7fc55b78a970': {'a': 20}}
{'::Item@0x7fc55b754890': {'a': 20}}

or this:

{'::str@0x7f3715686930': {'a': 20}}
{'::Item@0x7f3715650860': {'a': 20}}

(I believe the extra element happens because the string from preprocess_item may or may not allocate new memory given an identical input)

I'm not exactly sure what the underlying cause of this difference is. It doesn't seem to be related to the __missing__ method mentioned in the docs, because the behavior I mentioned happens for keys that are not present in the defaultdict as well as for those that are already present (and presumably wouldn't be calling __missing__).

python version

I ran my example in python 3.6 through 3.12, and observed the same behavior in all of them

MechanicPig commented 1 month ago

When you overwrite __setitem__, after d1[item]['a'] = 10, there are no key preprocess_key(item) in your d1, but key preprocess_key(preprocess_key(item)) exists. Therefore, __missing__ is still called during d1[item]['a'] = 20.