Fix hashing of permutation group elements

videlec commented 3 years ago

As noticed on this sage-devel thread the hash function of PermutationGroupElement does not behave coherently with respect to the natural inclusions SymmetricGroup(n) c SymmetricGroup(n+1).

We design a hash function:

whose value on the identity is 1
which does only depend on the support (ie ignores fixed points so that the hash is consistent with permutation restriction)
which is independent of the domain ordering

Component: group theory

Author: Vincent Delecroix

Branch/Commit: u/vdelecroix/31236 @ e2e4d1d

Reviewer: Dima Pasechnik, David Roe

Issue created by migration from https://trac.sagemath.org/ticket/31236

videlec commented 3 years ago

Description changed:

--- 
+++ 
@@ -1,4 +1,4 @@
-As noticed on [this sage-devel thread](https://groups.google.com/g/sage-devel/c/0Ca0m94lfOE) the hash function of `PermutationGroupElement` do not behave coherently with respect to the natural inclusions `SymmetricGroup(n) c SymmetricGroup(n+1)`.
+As noticed on [this sage-devel thread](https://groups.google.com/g/sage-devel/c/0Ca0m94lfOE) the hash function of `PermutationGroupElement` does not behave coherently with respect to the natural inclusions `SymmetricGroup(n) c SymmetricGroup(n+1)`.

 We design a hash function:
 - whose value on the identity is 1

dimpase commented 3 years ago

comment:2

GAP has a function LargestMovedPoint() for permutations, so instead of using self.n, where n is the degree of the permutation group, one can use this value to iterate over while computing the hash.

While this does not ignore all the fixed points, this is still OK in terms of the consistency.

videlec commented 3 years ago

Branch: u/vdelecroix/31236

videlec commented 3 years ago

Commit: 462fdb1

videlec commented 3 years ago

New commits:

462fdb1 31236: fix hash function of permutations

dimpase commented 3 years ago

comment:5

What is 145623773L doing there?

Are you replacing something which looks to me as a perfect hash function (i.e. no collisions) with something that might have collisions?

videlec commented 3 years ago

comment:6

If you find a collision, I pay you a beer.

dimpase commented 3 years ago

comment:7

I asked a theory question.

videlec commented 3 years ago

comment:8

145623773L is an offset to ensure bit shuffling. See https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function (though I might not have used the best constants).

dimpase commented 3 years ago

comment:9

thanks. please add this info in the docs.

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Changed commit from 462fdb1 to ba9d62d

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

ba9d62d 31326: reference for algorithm

videlec commented 3 years ago

comment:11

done

dimpase commented 3 years ago

comment:12

OK, looks good, thanks.

dimpase commented 3 years ago

Reviewer: Dima Pasechnik

videlec commented 3 years ago

comment:13

sage: S1 = SymmetricGroup(FiniteEnumeratedSet([1,2,3]))
sage: S2 = SymmetricGroup(FiniteEnumeratedSet([2,1,3]))
sage: S1("(1,3,2)") == S2("(1,3,2)")
True
sage: hash(S1("(1,3,2)")) == hash(S2("(1,3,2)"))
False

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Changed commit from ba9d62d to 3e29948

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

3e29948 31326: change algorithm for consistency

videlec commented 3 years ago

comment:15

Consistent hashing with more tests.

videlec commented 3 years ago

Description changed:

--- 
+++ 
@@ -2,4 +2,5 @@

 We design a hash function:
 - whose value on the identity is 1
-- which does only depend on the support (ie ignores fixed points)
+- which does only depend on the support (ie ignores fixed points so that the hash is consistent with permutation restriction)
+- which is independent of the domain ordering

dimpase commented 3 years ago

comment:17

Can't one do hashing based on the position in the lexicographically ordered set of all permutations?

https://en.wikipedia.org/wiki/Factorial_number_system

videlec commented 3 years ago

comment:18

See comment:13 for why this will not work.

dimpase commented 3 years ago

comment:19

Replying to @videlec:

See comment:13 for why this will not work.

This seems to be an entirely different issue - one can 1st canonise the permutation to the standard domain [1..n], and only then compute hash.

Having said this, it's in general a quite bad idea to ignore the nature of the group the element comes from, as it affects the performance of algorithms quite badly. Every permutation group G has a base, a minimal size set B s.t. g_1(B)\=g_2(B) as soon as g_1\=g_2, and for transitive permutation groups this is in general the quickest test, and |B| is usually very small compared to the degree (it's even possible to quantify what "usually" means here - e.g. for primitive groups this is governed by O'Nan-Scott theorem).

videlec commented 3 years ago

comment:20

Replying to @dimpase:

Replying to @videlec:

See comment:13 for why this will not work.

This seems to be an entirely different issue - one can 1st canonise the permutation to the standard domain [1..n], and only then compute hash.

How do you canonise [pi, 'a', Partition([3,2,1])] to [1, 2, 3]?

Having said this, it's in general a quite bad idea to ignore the nature of the group the element comes from, as it affects the performance of algorithms quite badly. Every permutation group G has a base, a minimal size set B s.t. g_1(B)\=g_2(B) as soon as g_1\=g_2, and for transitive permutation groups this is in general the quickest test, and |B| is usually very small compared to the degree (it's even possible to quantify what "usually" means here - e.g. for primitive groups this is governed by O'Nan-Scott theorem).

Why is this relevent for hashing?

dimpase commented 3 years ago

comment:21

Replying to @videlec:

Replying to @dimpase:

Replying to @videlec:

See comment:13 for why this will not work.

This seems to be an entirely different issue - one can 1st canonise the permutation to the standard domain [1..n], and only then compute hash.

How do you canonise [pi, 'a', Partition([3,2,1])] to [1, 2, 3]?

By their positions in the list, i.e. as [1,2,3], why not?

Having said this, it's in general a quite bad idea to ignore the nature of the group the element comes from, as it affects the performance of algorithms quite badly. Every permutation group G has a base, a minimal size set B s.t. g_1(B)\=g_2(B) as soon as g_1\=g_2, and for transitive permutation groups this is in general the quickest test, and |B| is usually very small compared to the degree (it's even possible to quantify what "usually" means here - e.g. for primitive groups this is governed by O'Nan-Scott theorem).

Why is this relevent for hashing?

I am inclined to say that one should not hash permutation group elements ignoring their parent group. A workaround, to hash Permutation(g) rather than g itself, should work, though - although it does not, due to the usual sloppiness of Permutation.

videlec commented 3 years ago

comment:22

Replying to @dimpase:

Replying to @videlec:

Replying to @dimpase:

Replying to @videlec:

See comment:13 for why this will not work.

This seems to be an entirely different issue - one can 1st canonise the permutation to the standard domain [1..n], and only then compute hash.

How do you canonise [pi, 'a', Partition([3,2,1])] to [1, 2, 3]?

By their positions in the list, i.e. as [1,2,3], why not?

Because it won't work. If you do that with [pi, 'a', Partition([3,2,1])] and then [Partition([3,2,1]), 'a', pi] for the same permutation, let say the transposition ('a', pi), you will obtain two different permutations on [1,2,3] with different hashes.

You could also have read comment:13: the order of the elements in the domain should not matter.

videlec commented 3 years ago

comment:23

If you think you have a better way of implementing a hash function, please provide an alternative branch. As far as I tested my branch does solve the issue.

dimpase commented 3 years ago

comment:24

There is no issue IMHO. It's an over-zealous wish to mimic Python numeric tower to the point where it does not make sense any more.

dimpase commented 3 years ago

comment:25

Replying to @videlec:

Replying to @dimpase:

Replying to @videlec:

Replying to @dimpase:

Replying to @videlec:

See comment:13 for why this will not work.

This seems to be an entirely different issue - one can 1st canonise the permutation to the standard domain [1..n], and only then compute hash.

How do you canonise [pi, 'a', Partition([3,2,1])] to [1, 2, 3]?

By their positions in the list, i.e. as [1,2,3], why not?

Because it won't work. If you do that with [pi, 'a', Partition([3,2,1])] and then [Partition([3,2,1]), 'a', pi] for the same permutation, let say the transposition ('a', pi), you will obtain two different permutations on [1,2,3] with different hashes.

You could also have read comment:13: the order of the elements in the domain should not matter.

And this only reinforces my suspision that hashing permutations of different domains should return different answers. As you just demonstrated, there is no canonical way to identify the transposition ('a',pi) with (1,2), why would the hashes be the same?

videlec commented 3 years ago

comment:26

Replying to @dimpase:

Replying to @videlec:

Replying to @dimpase:

Replying to @videlec:

Replying to @dimpase:

Replying to @videlec:

See comment:13 for why this will not work.

This seems to be an entirely different issue - one can 1st canonise the permutation to the standard domain [1..n], and only then compute hash.

How do you canonise [pi, 'a', Partition([3,2,1])] to [1, 2, 3]?

By their positions in the list, i.e. as [1,2,3], why not?

Because it won't work. If you do that with [pi, 'a', Partition([3,2,1])] and then [Partition([3,2,1]), 'a', pi] for the same permutation, let say the transposition ('a', pi), you will obtain two different permutations on [1,2,3] with different hashes.

You could also have read comment:13: the order of the elements in the domain should not matter.

And this only reinforces my suspision that hashing permutations of different domains should return different answers.

There is a canonical map from the domain {1,2} to the domain {1,2,3}. Hence the transposition (1,2) on both sets should just be equal. It is trivial to implement a compatible hash function. This is what this ticket is about.

As you just demonstrated, there is no canonical way to identify the transposition ('a',pi) with (1,2), why would the hashes be the same?

The hashes are different.

roed314 commented 3 years ago

comment:27

It seems like you have incompatible goals for a hash function.

It should be as fast as possible to compute for elements of a fixed permutation group. Dima's suggestion about the base is an interesting idea for a way to compute hashes by only considering a few elements. But it's clearly not going to be consistent across subgroup relations (including S_n c S_{n+1} considered here), and it requires a precomputation for each permutation group (the base).
It should depend only on the permutation, not the group. Yes, there are other parts of Sage where hashing and equality testing are incompatible

sage: mod(1,3) == -2                                                                                                                                                                                                                         
True
sage: hash(mod(1,3)) == hash(-2)                                                                                                                                                                                                             
False

but this compatibility is still a desirable feature of a hash function.

Vincent's solution seems like a pretty good way to achieve the second goal. Dima, if you'd like to argue that we should drop compatibility of hash functions across different permutation groups in the interest of speed, I think that discussion should be taken back to the sage-devel thread.

I wonder if it's reasonable to have a "hash mode" option for parents where a user can indicate that they want hashes to be as fast as possible, dropping compatibility across parents. Maybe this could disable implicit coercions and require explicit conversions?

videlec commented 3 years ago

comment:28

Replying to @roed314:

It seems like you have incompatible goals for a hash function.

Thanks for your input. The branch I provided is a bit slower than the previous code in case the domain is the standard one, ie {1, 2, ..., n}

sage: S = SymmetricGroup(10)
sage: %timeit -r5 for s in S: h = hash(s)
1.02 s ± 24.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # old version
1.32 s ± 11.3 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) # new version

The time difference is plausibly due to the extra call P._has_natural_domain() which is here repeated 10 factorial times. I do not have any idea to improve the timing without Cythonizing the parent.

As expected the code is significatively slower in case the domain is non standard

sage: S = SymmetricGroup(['a','b','c','d','e','f','g','h','i'])
sage: %timeit -r5 for s in S: h = hash(s)
109 ms ± 511 µs per loop (mean ± std. dev. of 5 runs, 10 loops each) # old version
3.56 s ± 4.19 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) # new version

If the latter is a problem, it is possible to store the hashes of elements in the domain in a C array inside the parent. Should I do that?

dimpase commented 3 years ago

comment:29

PermutationGroupElement hashes have to be compatible w.r.t. to subgroups of this group (and, by the way, using images of the base - which gets computed as soon as almost any kind of nontrivial computation has to be done - to compute the hash is obviously good here), but I fail to understand why this would be desirable for groups with different domains, and to me domains (1,2,3) and (2,1,3) are different, as different as (1,2,3) and (7,8,9).

roed314 commented 3 years ago

comment:30

Replying to @videlec:

Thanks for your input. The branch I provided is a bit slower than the previous code in case the domain is the standard one, ie {1, 2, ..., n}
sage: S = SymmetricGroup(10)
sage: %timeit -r5 for s in S: h = hash(s)
1.02 s ± 24.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # old version
1.32 s ± 11.3 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) # new version
The time difference is plausibly due to the extra call P._has_natural_domain() which is here repeated 10 factorial times. I do not have any idea to improve the timing without Cythonizing the parent.

Since _has_natural_domain is cached, it would surprise me a bit if this was the bottleneck. Maybe try running your code again with that just set to True? I would suspect that the inside of the for loop is just more complicated, yielding a 30% slowdown. Obviously we'd prefer not to have a slowdown, but I think 30% is acceptable in order to fix the bug.

As expected the code is significatively slower in case the domain is non standard
sage: S = SymmetricGroup(['a','b','c','d','e','f','g','h','i'])
sage: %timeit -r5 for s in S: h = hash(s)
109 ms ± 511 µs per loop (mean ± std. dev. of 5 runs, 10 loops each) # old version
3.56 s ± 4.19 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) # new version
If the latter is a problem, it is possible to store the hashes of elements in the domain in a C array inside the parent. Should I do that?

Since the parent isn't Cythonized, where would that array go? I think speeding up this case would be worthwhile, and recording the domain's hashes seems like a good way to do it.

The speed comparison I was thinking about, though, wasn't either of these. Instead, I think Dima was suggesting that we can make a much faster hash function than the current one in the case where n is large. I think speeding things up for large n is worthwhile, but there are plenty of cases where Sage is pretty bad. I think the hash code is fine in comparison:

sage: S = SymmetricGroup(2000)
sage: %timeit a = S.random_element()
639 ms ± 15.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

sage: S = SymmetricGroup(10^8)
sage: %time hash(a)                                                                                                                                                                                                                          
CPU times: user 120 ms, sys: 350 µs, total: 120 ms
Wall time: 120 ms

roed314 commented 3 years ago

comment:31

Replying to @dimpase:

PermutationGroupElement hashes have to be compatible w.r.t. to subgroups of this group (and, by the way, using images of the base - which gets computed as soon as almost any kind of nontrivial computation has to be done - to compute the hash is obviously good here), but I fail to understand why this would be desirable for groups with different domains, and to me domains (1,2,3) and (2,1,3) are different, as different as (1,2,3) and (7,8,9).

I agree that they are different, but Sage currently allows equality testing between them:

sage: S = SymmetricGroup([1,2,3])                                                                                                                                                                                                            
sage: T = SymmetricGroup([2,3,1])                                                                                                                                                                                                            
sage: S('(2,1)')                                                                                                                                                                                                                             
(1,2)
sage: T('(2,1)')                                                                                                                                                                                                                             
(2,1)
sage: T('(2,1)') == S('(2,1)')                                                                                                                                                                                                               
True

I wouldn't object to prohibiting a coercion map between these parents (as you say, the domains are different since permutations are all about the ordering of domain, not just the labels), but I'd say that should also be discussed more broadly.

videlec commented 3 years ago

comment:32

Replying to @roed314:

Replying to @videlec:
Thanks for your input. The branch I provided is a bit slower than the previous code in case the domain is the standard one, ie {1, 2, ..., n}
sage: S = SymmetricGroup(10)
sage: %timeit -r5 for s in S: h = hash(s)
1.02 s ± 24.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # old version
1.32 s ± 11.3 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) # new version
The time difference is plausibly due to the extra call P._has_natural_domain() which is here repeated 10 factorial times. I do not have any idea to improve the timing without Cythonizing the parent.
Since _has_natural_domain is cached, it would surprise me a bit if this was the bottleneck. Maybe try running your code again with that just set to True?

sage: %timeit -r5 for s in S: h = hash(s) # new code without _has_natural_domain call
1.09 s ± 18.3 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

roed314 commented 3 years ago

comment:33

I guess I was wrong. I also don't see an easy way to make this faster.

videlec commented 3 years ago

comment:34

Replying to @roed314:

Replying to @videlec:
As expected the code is significatively slower in case the domain is non standard
sage: S = SymmetricGroup(['a','b','c','d','e','f','g','h','i'])
sage: %timeit -r5 for s in S: h = hash(s)
109 ms ± 511 µs per loop (mean ± std. dev. of 5 runs, 10 loops each) # old version
3.56 s ± 4.19 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) # new version
If the latter is a problem, it is possible to store the hashes of elements in the domain in a C array inside the parent. Should I do that?
Since the parent isn't Cythonized, where would that array go? I think speeding up this case would be worthwhile, and recording the domain's hashes seems like a good way to do it.

There is the Python array object that can store whatever C data you want. We will still suffer from the attribute access to the Python parent. But it will not be a factor x30 as it is currently

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

e2e4d1d 31236: use a hash array for non standard domains

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Changed commit from 3e29948 to e2e4d1d

videlec commented 3 years ago

comment:36

Better now

sage: S = SymmetricGroup(['a','b','c','d','e','f','g','h','i'])
sage: %timeit -r5 for s in S: h = hash(s)
159 ms ± 1.17 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)

Computing a single hash of an element triggers the creation of an array in the parent

sage: S._domain_hash_array
array('l', [-5034579376873000338, 5819885834672024805, -1342306816048582247, 4154873981948319473, -9080196639829811986, 3487762170233258304, -4059262801740968994, -1410151844174446032, -7454342532774124322])

This solution is not ideal as the hash array could be made available on the domain itself and not on the symmetric group. It could end up with many duplications of this array (ie when constructing subgroups).

videlec commented 3 years ago

comment:37

See #31269

videlec commented 3 years ago

comment:38

ping

roed314 commented 3 years ago

comment:39

I'm happy giving this ticket positive review. Since Dima originally reviewed it, I'll give him a couple days to respond with an objection and suggestions for how he'd like to proceed.

videlec commented 3 years ago

Changed reviewer from Dima Pasechnik to Dima Pasechnik, David Roe

dimpase commented 3 years ago

comment:41

My objection is only on the level of documentation. Where the hell these huge constants come from, and why? It's totally unclear.

videlec commented 3 years ago

comment:42

As their names indicate mult1, mult2 are multipliers (for bit shuffling) and mask1, mask2 are masks so that the hasing of the pair (i, j) is non commutative (ie different from the hash of (j, i)). These numbers are not big, they are 64 bit integers.

videlec commented 3 years ago

comment:43

ping

roed314 commented 3 years ago

comment:44

Sorry, I've been busy. This looks good to me.

vbraun commented 3 years ago

comment:45

I got this on Ubuntu 18.04 32 bit

**********************************************************************
File "src/sage/groups/perm_gps/permgroup_element.pyx", line 1485, in sage.groups.perm_gps.permgroup_element.PermutationGroupElement.__hash__
Failed example:
    for n in range(1, 10):
       G = SymmetricGroup(n)
       assert hash(G.one()) == 1
       assert len(set(map(hash, G))) == factorial(n)
Exception raised:
    Traceback (most recent call last):
      File "/var/lib/buildbot/slave/sage_git/build/local/lib/python3.9/site-packages/sage/doctest/forker.py", line 714, in _run
        self.compile_and_execute(example, compiler, test.globs)
      File "/var/lib/buildbot/slave/sage_git/build/local/lib/python3.9/site-packages/sage/doctest/forker.py", line 1133, in compile_and_execute
        exec(compiled, globs)
      File "<doctest sage.groups.perm_gps.permgroup_element.PermutationGroupElement.__hash__[3]>", line 4, in <module>
        assert len(set(map(hash, G))) == factorial(n)
    AssertionError
**********************************************************************
File "src/sage/groups/perm_gps/permgroup_element.pyx", line 1493, in sage.groups.perm_gps.permgroup_element.PermutationGroupElement.__hash__
Failed example:
    assert len(set(map(hash, A))) == A.cardinality()
Exception raised:
    Traceback (most recent call last):
      File "/var/lib/buildbot/slave/sage_git/build/local/lib/python3.9/site-packages/sage/doctest/forker.py", line 714, in _run
        self.compile_and_execute(example, compiler, test.globs)
      File "/var/lib/buildbot/slave/sage_git/build/local/lib/python3.9/site-packages/sage/doctest/forker.py", line 1133, in compile_and_execute
        exec(compiled, globs)
      File "<doctest sage.groups.perm_gps.permgroup_element.PermutationGroupElement.__hash__[7]>", line 1, in <module>
        assert len(set(map(hash, A))) == A.cardinality()
    AssertionError
**********************************************************************
1 item had failures:
   2 of  35 in sage.groups.perm_gps.permgroup_element.PermutationGroupElement.__hash__
    [426 tests, 2 failures, 3.64 s]

as a general bit of advice, consider using assert condition, message so you learn whats wrong from tracebacks...

sagemath / sage

Fix hashing of permutation group elements #31236