Fix #161: Implement == and hash for IndexLens and ComposedLens

phipsgabler commented 3 years ago

Now I'm not an expert in these matters, but is seems the bottleneck is not hashing the tuples, but the types or symbols (since hash(::Symbol) falls back to objectid, too). In the commit without adding type symbols to hash (https://github.com/jw3126/Setfield.jl/commit/2ca540462d30de73d3474d80c84fa72aec6115d1):

julia> @benchmark hash($((1, )))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  0.024 ns … 0.074 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     0.028 ns             ┊ GC (median):    0.00%
 Time  (mean ± σ):   0.029 ns ± 0.003 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                          
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁ ▁
  0.074 ns       Histogram: frequency by time      0.026 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark hash($(@lens(_[1])))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  0.024 ns … 12.903 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     0.028 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   0.030 ns ±  0.129 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                           
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁ ▁
  0.063 ns       Histogram: frequency by time       0.027 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

With the symbols (https://github.com/jw3126/Setfield.jl/commit/880daf445f735b2c94fc1b037688f3daf92dfe30):

julia> @benchmark hash($((1, )))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  0.024 ns … 0.062 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     0.028 ns             ┊ GC (median):    0.00%
 Time  (mean ± σ):   0.029 ns ± 0.003 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                          
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁ ▁
  0.062 ns       Histogram: frequency by time      0.027 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark hash($(@lens(_[1])))
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):   9.690 ns … 44.821 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     10.423 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.712 ns ±  1.721 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                            
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁▁▁  ▁
  15.1 ns         Histogram: frequency by time        10.1 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Without the symbol, an index lens hashes to the same as the underlying tuple, though:

julia> hash(@lens(_[1]))
0xf45b969ce741991e

julia> hash((1,))
0xf45b969ce741991e

That could be solved by using a magic constant instead, but is that a good idea?

CC @torfjelde

phipsgabler commented 3 years ago

Do you have any specific drawbacks in mind with the magic constant approach? If not I would prefer to have one. (Also for ComposedLens). Possible definition for the constant:
const SALT_IndexLens = objectid(IndexLens)

None other than me having no idea whether this is accepted good practice. The only objection I could remotely think of is: is it guaranteed that objectid(IndexLens) (or even hash(:IndexLens)) is constant in a distributed setting? Is that ever (considered) a problem?

If it's OK with you, I'll go ahead with it.

jw3126 commented 3 years ago

None other than me having no idea whether this is accepted good practice. The only objection I could remotely think of is: is it guaranteed that objectid(IndexLens) (or even hash(:IndexLens)) is constant in a distributed setting? Is that ever (considered) a problem?

I don't know the answer to these either. To be safer, let's just hardcode the following values:

julia> using Setfield

julia> objectid(Setfield.IndexLens)
0x8aa710fd8cb95e7e

julia> objectid(Setfield.ComposedLens)
0x1c9765b995720e93

jw3126 commented 3 years ago

I also asked about this on discourse.

JeffreySarnoff commented 3 years ago

To be safer, let's just hardcode the following values

If by hardcode you mean const value = objectid(object), objectid(object) is highly likely to change with different uses of the package and also with different runs of some use of the package. In a distributed setting, too. It would be trouble to have hash values for the same item in distinct contexts within a single run differ if they were to be checked for equality:

isequal is the comparison function used by hash tables (Dict). isequal(x,y) must imply that hash(x) == hash(y). ...
Furthermore, isequal is linked with isless, and they work together to define a fixed total ordering, where exactly one of isequal(x, y), isless(x, y), or isless(y, x) must be true (and the other two false).

jw3126 commented 3 years ago

Thanks a lot @JeffreySarnoff for the clarification. By hardcoding I meant

const value = 0x8aa710fd8cb95e7e # objectid(Setfield.IndexLens)

objectid(object) is highly likely to change with different uses of the package and also with different runs of some use of the package.

But the default hash implementation uses objectid right? So the default Base.hash would suffer from the same instabilities?

JeffreySarnoff commented 3 years ago

not for all things -- Strings, Tuples of Ints and other bitstypes for example keep their objectids

jw3126 commented 3 years ago

struct S{T}
    value::T
end

function hash_cheat(obj, h::UInt)
    hash(obj.value, h)
end

const SALT = 0x8aa710fd8cb95e7e

function hash1(obj, h::UInt)
    hash(obj.value, hash(SALT, h))
end

function hash2(obj, h::UInt)
    hash(obj.value, h+SALT)
end

obj = S((1,))
seed = UInt(1)
using BenchmarkTools

@btime hash_cheat($(Ref(obj))[], $(Ref(seed))[])
@btime hash1($(Ref(obj))[], $(Ref(seed))[])
@btime hash2($(Ref(obj))[], $(Ref(seed))[])

#  1.833 ns (0 allocations: 0 bytes)
#  1.813 ns (0 allocations: 0 bytes)
#  1.833 ns (0 allocations: 0 bytes)

I think we should incorporate some type specific salt into our hash and based on the above benchmark, it seems there is no performance difference between using + or hash for the mixing. So I suggest we use the following implementation:

function Base.hash(o::IndexLens, h::UInt)
    salt = 0x8aa710fd8cb95e7e #objectid(Setfield.IndexLens)
    hash(o.indices, hash(salt, h))
end

JeffreySarnoff commented 3 years ago

it seems there is no performance difference between using + or hash for the mixing.

There is under heavy use (Base knows best); simple benchmarking will not reveal that.

You should define your salt constant outside of and just above the hash function (better practice) and you must provide both a 64bit and a 32bit constant as hash uses UInt which may be either one.

jw3126 commented 3 years ago

it seems there is no performance difference between using + or hash for the mixing.

There is under heavy use (Base knows best); simple benchmarking will not reveal that.

Base operates under many constraints like generality and backward compatibility, so "Base knows best" is not obvious to me in this case. But anyway lets do it the Base way:

function make_salt(s64::UInt64)::UInt
    if UInt === UInt64
        return s64
    else
        return UInt32(s64 >> 32) ^ UInt32(s64 & 0x00000000ffffffff)
    end
end

const SALT_IndexLens = make_salt(0x8aa710fd8cb95e7e) # objectid(Setfield.IndexLens)
Base.hash(o::IndexLens, h::UInt) = hash(o.indices, SALT_IndexLens + h)

JeffreySarnoff commented 3 years ago

Cool. Point taken. Base knows best about hashing Julia constructs (I should have been more explicit). A great deal of effort has gone into the hashing approach and its most efficient implementation.

jw3126 commented 3 years ago

Thanks again @JeffreySarnoff for all your patience and explanations, I appreciate them.

phipsgabler commented 3 years ago

Great! I'll update to the latest version in the next couple of days, plus a few more tests.

jw3126 commented 3 years ago

@phipsgabler thanks again! This is now available in Setfield 0.8.

tkf commented 3 years ago

@jw3126 Sorry if you've already thought through this, but, I think this is actually technically non-breaking. The only documented public interface we need to satisfy is isequal(x,y) implies hash(x)==hash(y). So, unless we've already documented that hash(lens) satisfies some additional property, we can freely re-define hash even for patch releases (that'd be annoying, but technically OK) provided it satisfies the above invariance. It's even fine to increment the salt for each release or just use hash(::Lens) = UInt(0) (of course, I'm not suggesting this :slightly_smiling_face:).

(I just thought to ramble this, since I noticed JuliaObjects/Accessors.jl#31)

jw3126 commented 3 years ago

Thanks @tkf So for this PR we already tagged a breaking release. But in the case of Accessors your advise is to only bump the patch version?

tkf commented 3 years ago

Yeah, I think Setfield.jl is good as-is. But I don't think Accessors.jl needs a breaking release.

jw3126 / Setfield.jl

Fix #161: Implement == and hash for IndexLens and ComposedLens #162