mne-tools / mne-python

MNE: Magnetoencephalography (MEG) and Electroencephalography (EEG) in Python
https://mne.tools
BSD 3-Clause "New" or "Revised" License
2.7k stars 1.31k forks source link

Why are some mutable MNE objects hashable? #7368

Closed cbrnr closed 4 years ago

cbrnr commented 4 years ago

Several (mutable) MNE objects (e.g. Raw) are hashable, which violates one of the key assumptions that Python makes about hashable objects, namely that the hash value never changes during its lifetime. This can lead to bugs wherever hashable objects are expected not to change (so everywhere, such as using them as keys in dicts). @larsoner mentioned that this was done to support caching in joblib, but it would probably be a good idea to discuss other (Pythonic) solutions. Thoughts?

hoechenberger commented 4 years ago

Without looking at the code, the docs you linked to say:

Objects which are instances of user-defined classes are hashable by default. They all compare unequal (except with themselves), and their hash value is derived from their id().

Maybe that's the reason…

larsoner commented 4 years ago

Given that usually (or at least, often) user-defined classes are made to be mutable, it seems like a contradiction for Python to make user-defined classes by default be hashable, but say that hashable objects should not change.

So there is at least some inconsistency there, and at least our __hash__ functions are designed to make it so that if you do change something about the object (e.g., info or data), the __hash__ will also change (where it wouldn't with the default id-based implementation).

cbrnr commented 4 years ago

I don't think this is an inconsistency. You have to view both __hash__ and __eq__ methods together. Also, Python doesn't enforce anything, so of course you are free to create an unhashable immutable object or vice versa. But bad things can happen if you violate some basic assumptions. This SO answer explains this very nicely. So I still think we should try to find out why MNE objects have a hash method (what problem does this solve), and can we find a more Pythonic alternative? I don't have time now to play around with the objects, but it would be interesting to find out what happens when testing some Raw objects for equality (I'd assume they never compare equal because their equality is defined by their identity, but I'd have to look it up).

agramfort commented 4 years ago

can you share a problematic piece of code? to see how easy it is to be trapped.

Yes it was done for joblib caching which is a real usecase. I would not want to sacrifice this.

larsoner commented 4 years ago

So I still think we should try to find out why MNE objects have a hash method (what problem does this solve),

Joblib caching IIRC

https://github.com/mne-tools/mne-python/issues/1271 https://github.com/mne-tools/mne-python/pull/1338

larsoner commented 4 years ago

(I'd assume they never compare equal because their equality is defined by their identity, but I'd have to look it up).

If you do raw2 = raw.copy() you should find raw == raw2:

>>> import mne
>>> raw = mne.io.read_raw_fif(mne.datasets.testing.data_path() + '/MEG/sample/sample_audvis_trunc_raw.fif', preload=True)
>>> raw2 = raw.copy()
>>> raw == raw2
True
>>> raw2._data[0, 0] = 1
>>> raw == raw2
False
cbrnr commented 4 years ago

It gets problematic as soon as you put mutable hashable objects in a dict or a set (which are implemented with hash tables):

import numpy as np
import mne

info = mne.create_info(5, 256, ch_types="eeg") 
raw1 = mne.io.RawArray(np.ones((5, 1000)), info)
raw2 = mne.io.RawArray(np.ones((5, 1000)), info)

print(id(raw1), "!=", id(raw2))
print(hash(raw1), "==", hash(raw2))

dataset = {raw1, raw2}
print(len(dataset))  # makes sense, raw1 == raw2, so they have the same hash
print(raw1 in dataset)  # OK
print(raw2 in dataset)  # OK

raw1.crop(0, 1)
print(dataset)  # alright, seems to be still in the set
print(raw1 in dataset)  # False even though the set shows the cropped data raw1
print(raw2 in dataset)  # and now this is now also False

What about creating new data types such as FrozenRaw and FrozenEpochs, which are hashable but unmutable? These could be used in dicts, and also for joblib caching if this is needed.

Can you post an example which shows how joblib requires hashable objects? What about the Memory class (https://joblib.readthedocs.io/en/latest/memory.html) - the docs explicitly state that this works with non-hashable objects and especially with NumPy arrays?

larsoner commented 4 years ago

It gets problematic as soon as you put mutable hashable objects in a dict or a set (which are implemented with hash tables):

Actually here I think that the behavior is at least somewhat reasonable -- once you do raw1.crop above you should see raw1 not in dataset given how we use hashing (conceptually and practically) in MNE to say "is this the same raw data (and info) as before", as cropping makes it so that it is not the same anymore.

Come to think of it, it's possible that people have been relying on this "changed data leads to changed hash" behavior, so changing this behavior now would be backward incompatible (could break people's code). We don't want to make such changes without a good reason, so I'm now -1 on changing the behavior of our __hash__ without sufficiently strong motivation.

I'm all for making things conform to the Python spec or use standard practices in principle, but to me this has to be balanced with both the difficulty of doing so, additional maintenance burden, loss or changes in functionality by conforming (backward compat), etc. For example I am glad we don't follow the Python idea that methods that modify in place should not return anything (people use our chaining all the time, and it is not Pythonic). I know it's not satisfactory from a theoretical standpoint, but I'm not sure the costs I foresee coming from going down this path are worth the practical gains.

Instead, I'd be for making doc changes like:

Note that we provide hashing for our mutable objects, despite the fact that Python docs advocate for hashing only being implemented for immutable objects (despite Python itself by default providing hash values for all objects by default, regardless of their mutability). For safety, the default hash provided by the object will change when the object itself (data or info) is modified. If you want the object to have a static hash (e.g., for dict use) regardless of whether or not the content of the object changes, you can use id(raw) as a key, but it's likely that there is a simpler and more robust way to code what you need.

Regarding the idea:

What about creating new data types such as FrozenRaw and FrozenEpochs, which are hashable but unmutable?

I suspect this will be quite difficult to get right given how many mutable things are part of our classes. Rather than add code to MNE to support this I'd just tell people to use id(raw).

Can you post an example which shows how joblib requires hashable objects?

IIRC it was required for any call using the @memory.cache decorator using MNE classes to have it actually recognize that the return value for raw should be different from that of raw.crop(0, 1) -- the builtin Python hash function (which uses id) would not see these as different even though the function you call definitely would inside.

cbrnr commented 4 years ago

Actually here I think that the behavior is at least somewhat reasonable -- once you do raw1.crop above you should see raw1 not in datasetgiven how we use hashing (conceptually and practically) in MNE to say "is this the same raw data (and info) as before", as cropping makes it so that it is not the same anymore.

Sure, but I would be very confused if I first see that the cropped object is in the set (see it's 1 second):

>>> print(dataset)
{<RawArray  |  None, n_channels x n_times : 5 x 257 (1.0 sec), ~27 kB, data loaded>}

But if I use in to check if it's there, I get False:

>>> print(raw1 in dataset)
False

Come to think of it, it's possible that people have been relying on this "changed data leads to changed hash" behavior, so changing this behavior now would be backward incompatible (could break people's code). We don't want to make such changes without a good reason, so I'm now -1 on changing the behavior of our __hash__ without sufficiently strong motivation.

A good reason would be to not introduce surprising behavior that's totally unexpected in Python.

I'm all for making things conform to the Python spec or use standard practices in principle, but to me this has to be balanced with both the difficulty of doing so, additional maintenance burden, loss or changes in functionality by conforming (backward compat), etc. For example I am glad we don't follow the Python idea that methods that modify in place should not return anything (people use our chaining all the time, and it is not Pythonic). I know it's not satisfactory from a theoretical standpoint, but I'm not sure the costs I foresee coming from going down this path are worth the practical gains.

I don't know which "Python idea that methods that modify in place should not return anything" you mean. Some in-place methods such as list methods return None, but I have never read or seen any recommendation that this should be the case. In fact, e.g. operator.iadd modifies its argument in-place and returns the modified value (https://docs.python.org/3/library/operator.html#in-place-operators).

I'm fine with a doc change (at the very least) with one modification:

Note that we provide hashing for our mutable objects, despite the fact that Python docs advocates hashing only being implemented for immutable objects (despite Python itself by default providing hash values for all objects by default, regardless of their mutability). For safety, the default hash provided by the object will change when the object itself (data or info) is modified. If you want the object to have a static hash (e.g., for dict use) regardless of whether or not the content of the object changes, you can use id(raw) as a key, but it's likely that there is a simpler and more robust way to code what you need.

There's a good reason why custom classes are hashable by default: because their __eq__ is based on their id (so objects will always compare unequal). If you implement a more specific equality check (which you will do as soon as you want to compare objects), __hash__ will automatically return None.

IIRC it was required for any call using the @memory.cache decorator using MNE classes to have it actually recognize that the return value for raw should be different from that of raw.crop(0, 1) -- the builtin Python hash function (which uses id) would not see these as different even though the function you call definitely would inside.

I'm still not sure if this could have been solved differently. Could someone post a MWE of joblib in use with MNE objects?

larsoner commented 4 years ago

I'll close assuming we can't come up with a different way to use this. But in case you want to investigate @cbrnr here is a snippet (constructed quickly just from looking at the joblib docs), you could imagine the time.sleep could be any time-consuming process:

import time
import numpy as np
import joblib
import mne

data = np.random.RandomState(0).randn(1, 1000)
info = mne.create_info(1, 1000., 'eeg')
raw = mne.io.RawArray(data, info)
mem = joblib.Memory('.')

@mem.cache
def slow(raw):
    time.sleep(1.)

for _ in range(2):
    t0 = time.time()
    slow(raw)
    print(time.time() - t0)

Produces:

$ python -u cache.py 
________________________________________________________________________________
[Memory] Calling __main__--home-larsoner-Desktop-cache.slow...
slow(<RawArray | 1 x 1000 (1.0 s), ~15 kB, data loaded>)
_____________________________________________________________slow - 1.0s, 0.0min
1.008265733718872
0.001148223876953125
$ python -u cache.py 
0.003644227981567383
0.0012600421905517578