seandenigris / Pharo-Enhancements

MIT License
3 stars 1 forks source link

Document the meaning of `#=` #22

Open seandenigris opened 2 years ago

seandenigris commented 2 years ago

Asked on Pharo Dev:

Anyone know of a book that goes in depth explaining #=?

I've always wondered what "equal" is supposed to mean.

For example, if I have two objects representing the same person, but one has their webpage cached as state, they represent the same entity, and are interchangeable in many contexts, but also have a difference that may be important in other contexts.

Is “equal” whatever you say it is in your domain, or is it a Smalltalk object system concept? If the latter, what is the exact expectation?

From @jgfoster:

I don’t think there is a formal definition. I’ve heard it described as “answers the same result when sent the same message” but that wouldn’t work for #’identityHash’, so I think it is whatever you say it is in your domain.

By the way, I try to avoid the term “equal” and use “equivalent” and “identical” to describe #’=‘ and #’==‘ respectively.

From @nicolas-cellier-aka-nice:

The general expectations are that = should be an equivalence relation

reflexive (a = a) = true symmetric (a = b) = (b = a) transitive (a = b) & (b = c) ==> (a = c) https://en.wikipedia.org/wiki/Equivalence_relation

Note that = is currently not an equivalence relation for Float, because of (Float nan = Float nan) = false. Also note that Number = means that the values are equal, not necessarily that the behavior is identical. For example (1/2) = 0.5, but (1/2) printString ~= 0.5 printString, and 0.5 does not understand #denominator for example... We expect the behavior to match to some degree, but it's very application specific... In some dialects (1/10) = 0.1 even if those numbers don't share the same value (and this obviously breaks the transitivity rule).

Another questionable example is (1 to: 3) = #(1 2 3). The answer might differ across Smalltalk dialects. A point of view is that they behave similarly as two sequenceable collections with equal elements. Another point of view is that the implementation cost is too high, for example for maintaining hash implication (a = b) ==> (a hash = b hash). Even, 'foobar' = #foobar is dialect specific (a false answer enables hash optimization by using identityHash for Symbol).

A clear sign that there is not a unique definition... If it is about cached state, I see it as an implementation detail (an optimization), and best efforts should be made so that the optimization remains a detail, and does not leak into behavioral differences... Well, except in well known and well documented cases if we can't avoid it.

seandenigris commented 1 year ago

A. Valloud wrote a whole book about this (and more)! His hashing book, to my layperson's ears, summarizes:

`#=` must mean "what hashed collections expect it to be". 

This implies that there will be as few collisions as possible. The fact that it took a whole book hints that the situation is much more complicated than #bitXor:-ing together instVars, as is commonly recommended.

After seeing all the pitfalls, I would add an additional rule: avoid redefining #= unless you a) really know what you're doing and b) need the optimization that hashed collections offer. For example, I used to use Set a lot, but now I wrap a non-hashed collection and override add to prevent duplicate elements.

seandenigris commented 1 year ago

Incorporate #21 (closing because it's basically a duplicate)