mentalisttraceur / python-macaddress

BSD Zero Clause License
19 stars 5 forks source link

Make HWAddress hashable #2

Closed duynhaaa closed 2 years ago

duynhaaa commented 2 years ago

I only implemented __hash__ method to make HWAddress and its subclasses hashable. This comes from my need of adding MAC to a set.

mentalisttraceur commented 2 years ago

Nice, thanks!

I think I totally agree that HWAddress classes should be hashable.

Before merging this I want to think through the details, a lot of little choices with possibly important ripples in here... I might conclude that we should change exactly what we hash on. I'll share more of my thoughts on it when I have time.

duynhaaa commented 2 years ago

Sure thing, just let me know if there's anything else that needs to be done.

mentalisttraceur commented 2 years ago

Yeah I'll let you know what I want changed as soon as I know. 👍

Right now my gut feel is to either hash on (self.size, self._address) or (type(self), self._address), but I probably won't have time to write out the reasoning for that, or finish thinking it through, until next week.

mentalisttraceur commented 2 years ago

Actually, I am going to merge this now as-is, and we can iteratively improve it later.

That way you don't have to have this open unresolved thing waiting around for me.

Plus for something this small, I would be needlessly spending your time and effort, and multiplying our total time and effort, by telling you what changes I want instead of just making them myself.

I think I was initially keeping this unmerged just in case I decided on something you disagreed with, so that you had a better opportunity to tell me "wait no that's a bad change because [...]" before I overwrote the lines you contributed, but now that I think about it that's not really a good enough reason to not just do it the nimble merge-then-iteratively-improve-as-needed way.

mentalisttraceur commented 2 years ago

Quick question: do you have any thoughts on different classes, or classes with different names, hashing differently vs the same?

I was thinking that the right default is for subclass instances to compare equal to their base class instances by default (they don't right now, despite the __eq__ docstring saying that they do, due to a bug that I haven't gotten around to committing a fix for) and this means that subclasses should probably hash the same way as their parent class.

Couple examples of this:

  1. The MAC and EUI48 classes. I think the most sensible behavior is that they're effectively aliases, so much so that maybe I should have just done MAC = EUI48 instead of class MAC(EUI48): .... However, maybe someone sees a good use-case for them to count as two different members of a set?

  2. MAC and the MACAllowsTrailingDelimiters example - this seems like a totally clear-cut case of a situation where the difference of type or even just type name shouldn't cause objects to compare or hash differently. Of course, it was kinda bad on my part to suggest changing formats by subclassing. If I provide and encourage a better way, then this becomes irrelevant.

But the current idea of class-aware equality comparison creates a weird hashing and set membership situation.

Sets and dictionaries and so on consider things the same if they are equal, but they expect and rely on equal things to hash the same, right?

So for example, if we have two subclasses of MAC, then they both compare equal to MAC instances with the same address (once the bug I described is fixed) but they compare unequal to each other. Even if we get hashing to be the same, this means we can get different sets purely based on the order of adds to those sets! That's messed up.

For another example, if two HWAddress types compare equal, then you have to know how they hash, and if they hash differently then I think it becomes indeterministic if they get counted as one member or two in the set (deterministic only if we know the underlying implementation details and all operations done, like how many buckets are in the underlying hash table at any given moment and how the hash method's hash is mapped to each bucket).

mentalisttraceur commented 2 years ago

My current thoughts are:

  1. If addresses just reject anything that's not exactly the class that they are, it breaks the use of subclasses for various stuff, but is easy to fix by overwriting equality operations as appropriate, which could "cast" self to the class that we want to be equivalent to forward to that equality method. This is consistent with a hash by type and address value. The only reasons right now for classes to do subclassing like this are the MAC/EUI-48 situation (honestly I should probably just alias them), the formatting situation (I should probably expose the relevant logic so people could do it without subclassing - how often is there a good reason for this to actually come up anyway? never?)
  2. If classes compare equal by just size and address value, then this is consistent with hashing by size and value. It might also break situations people have with identically-sized addresses that they want to be distinct (when though? is that a real case? I can imagine cases, but real ones? Like what if you have token ring and Ethernet MAC addresses - you probably want logic to distinguish them at the type level, but for equality or hashing purposes you might not? If you have two different hardware address types that have the same bit pattern, are you more benefitted from a "if equal: bad" case being easy to implement, or from keeping them both from colliding in a set or as dictionary keys?)

Of course I should also keep my eyes on the common cases! Those should guide a lot of these choices. The uncommon cases should just be reasonably doable.

The common case has no subclassing beyond what this library does.

The commonest appropriate use case for subclassing a HWAddress subclass would presumably be to add unrelated enriching behavior, and probably should not effect equality or hashing.

Treating MAC and EUI48 as equivalent in all ways seems like it is the right behavior more often than not. (In particular, a reasonable situation is that a user writes code like if is instance(something, Mac address.MAC), and this should not break if something was initialized by someone who used macaddress.EUI48.) So I think I've decided that MAC and EUI48 should be aliases of each other. This is a minor breaking change that would probably effect no one. The one downside of this is that error messages use the class name, so you could get error messages that refer to the other class name. Maybe this means that MAC just shouldn't exist as a name exported by this library, but that's a much bigger breaking change.

Alright that concludes this chunk of the thought dump.

duynhaaa commented 2 years ago

Actually, I am going to merge this now as-is, and we can iteratively improve it later.

That way you don't have to have this open unresolved thing waiting around for me.

Plus for something this small, I would be needlessly spending your time and effort, and multiplying our total time and effort, by telling you what changes I want instead of just making them myself.

I think I was initially keeping this unmerged just in case I decided on something you disagreed with, so that you had a better opportunity to tell me "wait no that's a bad change because [...]" before I overwrote the lines you contributed, but now that I think about it that's not really a good enough reason to not just do it the nimble merge-then-iteratively-improve-as-needed way.

First and foremost, I would like to thank you for your consideration towards my little contribution. Regarding of whether or not to "merge-then-iteratively-improve-as-needed", I definitely should have opened an issue first so that we can get some general ideas before (either of us) jumping into coding. Well, what's done is done, and I wholeheartedly appreciate your concerns.

Before going any further, I would like to move our discussion to the issue #3 that I just opened. I believe that it is worth doing so because what you want to achieve (mentioned in your thought dump) may span across multiple pull requests (or commits), which is definitely out of scope of this pull request.

Anyway, I would like to conclude this pull request here and continue the discussion in a more appropriate place -> here #3.

mentalisttraceur commented 2 years ago

Jotting down one design thought that I never got around to writing anywhere:

The reason why I initially didn't do this was because users could trivially do this:

my_set.add(bytes(mac_address))
# or
my_set.add((type(mac_address), bytes(mac_address))
# of
my_set.add((mac_address.size, bytes(mac_address))
# or
...

based on their needs.

When you have a concrete use-case for putting a MAC address in a set or as a dictionary key, you know exactly what information you do or don't need to determine uniqueness/sameness of an address.

When you're providing a __hash__ function, there's no single answer to what information a user will or won't need, so you're trying to strike a balance that's good for most cases and isn't overly bad for any cases.