sagemath / sage

Main repository of SageMath
https://www.sagemath.org
Other
1.44k stars 480 forks source link

Fix hashing of permutation group elements #31236

Open videlec opened 3 years ago

videlec commented 3 years ago

As noticed on this sage-devel thread the hash function of PermutationGroupElement does not behave coherently with respect to the natural inclusions SymmetricGroup(n) c SymmetricGroup(n+1).

We design a hash function:

Component: group theory

Author: Vincent Delecroix

Branch/Commit: u/vdelecroix/31236 @ e2e4d1d

Reviewer: Dima Pasechnik, ​David Roe

Issue created by migration from https://trac.sagemath.org/ticket/31236

videlec commented 3 years ago

Description changed:

--- 
+++ 
@@ -1,4 +1,4 @@
-As noticed on [this sage-devel thread](https://groups.google.com/g/sage-devel/c/0Ca0m94lfOE) the hash function of `PermutationGroupElement` do not behave coherently with respect to the natural inclusions `SymmetricGroup(n) c SymmetricGroup(n+1)`.
+As noticed on [this sage-devel thread](https://groups.google.com/g/sage-devel/c/0Ca0m94lfOE) the hash function of `PermutationGroupElement` does not behave coherently with respect to the natural inclusions `SymmetricGroup(n) c SymmetricGroup(n+1)`.

 We design a hash function:
 - whose value on the identity is 1
dimpase commented 3 years ago
comment:2

GAP has a function LargestMovedPoint() for permutations, so instead of using self.n, where n is the degree of the permutation group, one can use this value to iterate over while computing the hash.

While this does not ignore all the fixed points, this is still OK in terms of the consistency.

videlec commented 3 years ago

Branch: u/vdelecroix/31236

videlec commented 3 years ago

Commit: 462fdb1

videlec commented 3 years ago

New commits:

462fdb131236: fix hash function of permutations
dimpase commented 3 years ago
comment:5

What is 145623773L doing there?

Are you replacing something which looks to me as a perfect hash function (i.e. no collisions) with something that might have collisions?

videlec commented 3 years ago
comment:6

If you find a collision, I pay you a beer.

dimpase commented 3 years ago
comment:7

I asked a theory question.

videlec commented 3 years ago
comment:8

145623773L is an offset to ensure bit shuffling. See https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function (though I might not have used the best constants).

dimpase commented 3 years ago
comment:9

thanks. please add this info in the docs.

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Changed commit from 462fdb1 to ba9d62d

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

ba9d62d31326: reference for algorithm
videlec commented 3 years ago
comment:11

done

dimpase commented 3 years ago
comment:12

OK, looks good, thanks.

dimpase commented 3 years ago

Reviewer: Dima Pasechnik

videlec commented 3 years ago
comment:13
sage: S1 = SymmetricGroup(FiniteEnumeratedSet([1,2,3]))
sage: S2 = SymmetricGroup(FiniteEnumeratedSet([2,1,3]))
sage: S1("(1,3,2)") == S2("(1,3,2)")
True
sage: hash(S1("(1,3,2)")) == hash(S2("(1,3,2)"))
False
7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Changed commit from ba9d62d to 3e29948

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

3e2994831326: change algorithm for consistency
videlec commented 3 years ago
comment:15

Consistent hashing with more tests.

videlec commented 3 years ago

Description changed:

--- 
+++ 
@@ -2,4 +2,5 @@

 We design a hash function:
 - whose value on the identity is 1
-- which does only depend on the support (ie ignores fixed points)
+- which does only depend on the support (ie ignores fixed points so that the hash is consistent with permutation restriction)
+- which is independent of the domain ordering
dimpase commented 3 years ago
comment:17

Can't one do hashing based on the position in the lexicographically ordered set of all permutations?

https://en.wikipedia.org/wiki/Factorial_number_system

videlec commented 3 years ago
comment:18

See comment:13 for why this will not work.

dimpase commented 3 years ago
comment:19

Replying to @videlec:

See comment:13 for why this will not work.

This seems to be an entirely different issue - one can 1st canonise the permutation to the standard domain [1..n], and only then compute hash.


Having said this, it's in general a quite bad idea to ignore the nature of the group the element comes from, as it affects the performance of algorithms quite badly. Every permutation group G has a base, a minimal size set B s.t. g_1(B)\=g_2(B) as soon as g_1\=g_2, and for transitive permutation groups this is in general the quickest test, and |B| is usually very small compared to the degree (it's even possible to quantify what "usually" means here - e.g. for primitive groups this is governed by O'Nan-Scott theorem).

videlec commented 3 years ago
comment:20

Replying to @dimpase:

Replying to @videlec:

See comment:13 for why this will not work.

This seems to be an entirely different issue - one can 1st canonise the permutation to the standard domain [1..n], and only then compute hash.

How do you canonise [pi, 'a', Partition([3,2,1])] to [1, 2, 3]?


Having said this, it's in general a quite bad idea to ignore the nature of the group the element comes from, as it affects the performance of algorithms quite badly. Every permutation group G has a base, a minimal size set B s.t. g_1(B)\=g_2(B) as soon as g_1\=g_2, and for transitive permutation groups this is in general the quickest test, and |B| is usually very small compared to the degree (it's even possible to quantify what "usually" means here - e.g. for primitive groups this is governed by O'Nan-Scott theorem).

Why is this relevent for hashing?

dimpase commented 3 years ago
comment:21

Replying to @videlec:

Replying to @dimpase:

Replying to @videlec:

See comment:13 for why this will not work.

This seems to be an entirely different issue - one can 1st canonise the permutation to the standard domain [1..n], and only then compute hash.

How do you canonise [pi, 'a', Partition([3,2,1])] to [1, 2, 3]?

By their positions in the list, i.e. as [1,2,3], why not?


Having said this, it's in general a quite bad idea to ignore the nature of the group the element comes from, as it affects the performance of algorithms quite badly. Every permutation group G has a base, a minimal size set B s.t. g_1(B)\=g_2(B) as soon as g_1\=g_2, and for transitive permutation groups this is in general the quickest test, and |B| is usually very small compared to the degree (it's even possible to quantify what "usually" means here - e.g. for primitive groups this is governed by O'Nan-Scott theorem).

Why is this relevent for hashing?

I am inclined to say that one should not hash permutation group elements ignoring their parent group. A workaround, to hash Permutation(g) rather than g itself, should work, though - although it does not, due to the usual sloppiness of Permutation.

videlec commented 3 years ago
comment:22

Replying to @dimpase:

Replying to @videlec:

Replying to @dimpase:

Replying to @videlec:

See comment:13 for why this will not work.

This seems to be an entirely different issue - one can 1st canonise the permutation to the standard domain [1..n], and only then compute hash.

How do you canonise [pi, 'a', Partition([3,2,1])] to [1, 2, 3]?

By their positions in the list, i.e. as [1,2,3], why not?

Because it won't work. If you do that with [pi, 'a', Partition([3,2,1])] and then [Partition([3,2,1]), 'a', pi] for the same permutation, let say the transposition ('a', pi), you will obtain two different permutations on [1,2,3] with different hashes.

You could also have read comment:13: the order of the elements in the domain should not matter.

videlec commented 3 years ago
comment:23

If you think you have a better way of implementing a hash function, please provide an alternative branch. As far as I tested my branch does solve the issue.

dimpase commented 3 years ago
comment:24

There is no issue IMHO. It's an over-zealous wish to mimic Python numeric tower to the point where it does not make sense any more.

dimpase commented 3 years ago
comment:25

Replying to @videlec:

Replying to @dimpase:

Replying to @videlec:

Replying to @dimpase:

Replying to @videlec:

See comment:13 for why this will not work.

This seems to be an entirely different issue - one can 1st canonise the permutation to the standard domain [1..n], and only then compute hash.

How do you canonise [pi, 'a', Partition([3,2,1])] to [1, 2, 3]?

By their positions in the list, i.e. as [1,2,3], why not?

Because it won't work. If you do that with [pi, 'a', Partition([3,2,1])] and then [Partition([3,2,1]), 'a', pi] for the same permutation, let say the transposition ('a', pi), you will obtain two different permutations on [1,2,3] with different hashes.

You could also have read comment:13: the order of the elements in the domain should not matter.

And this only reinforces my suspision that hashing permutations of different domains should return different answers. As you just demonstrated, there is no canonical way to identify the transposition ('a',pi) with (1,2), why would the hashes be the same?

videlec commented 3 years ago
comment:26

Replying to @dimpase:

Replying to @videlec:

Replying to @dimpase:

Replying to @videlec:

Replying to @dimpase:

Replying to @videlec:

See comment:13 for why this will not work.

This seems to be an entirely different issue - one can 1st canonise the permutation to the standard domain [1..n], and only then compute hash.

How do you canonise [pi, 'a', Partition([3,2,1])] to [1, 2, 3]?

By their positions in the list, i.e. as [1,2,3], why not?

Because it won't work. If you do that with [pi, 'a', Partition([3,2,1])] and then [Partition([3,2,1]), 'a', pi] for the same permutation, let say the transposition ('a', pi), you will obtain two different permutations on [1,2,3] with different hashes.

You could also have read comment:13: the order of the elements in the domain should not matter.

And this only reinforces my suspision that hashing permutations of different domains should return different answers.

There is a canonical map from the domain {1,2} to the domain {1,2,3}. Hence the transposition (1,2) on both sets should just be equal. It is trivial to implement a compatible hash function. This is what this ticket is about.

As you just demonstrated, there is no canonical way to identify the transposition ('a',pi) with (1,2), why would the hashes be the same?

The hashes are different.

roed314 commented 3 years ago
comment:27

It seems like you have incompatible goals for a hash function.

sage: mod(1,3) == -2                                                                                                                                                                                                                         
True
sage: hash(mod(1,3)) == hash(-2)                                                                                                                                                                                                             
False

but this compatibility is still a desirable feature of a hash function.

Vincent's solution seems like a pretty good way to achieve the second goal. Dima, if you'd like to argue that we should drop compatibility of hash functions across different permutation groups in the interest of speed, I think that discussion should be taken back to the sage-devel thread.

I wonder if it's reasonable to have a "hash mode" option for parents where a user can indicate that they want hashes to be as fast as possible, dropping compatibility across parents. Maybe this could disable implicit coercions and require explicit conversions?

videlec commented 3 years ago
comment:28

Replying to @roed314:

It seems like you have incompatible goals for a hash function.

Thanks for your input. The branch I provided is a bit slower than the previous code in case the domain is the standard one, ie {1, 2, ..., n}

sage: S = SymmetricGroup(10)
sage: %timeit -r5 for s in S: h = hash(s)
1.02 s ± 24.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # old version
1.32 s ± 11.3 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) # new version

The time difference is plausibly due to the extra call P._has_natural_domain() which is here repeated 10 factorial times. I do not have any idea to improve the timing without Cythonizing the parent.

As expected the code is significatively slower in case the domain is non standard

sage: S = SymmetricGroup(['a','b','c','d','e','f','g','h','i'])
sage: %timeit -r5 for s in S: h = hash(s)
109 ms ± 511 µs per loop (mean ± std. dev. of 5 runs, 10 loops each) # old version
3.56 s ± 4.19 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) # new version

If the latter is a problem, it is possible to store the hashes of elements in the domain in a C array inside the parent. Should I do that?

dimpase commented 3 years ago
comment:29

PermutationGroupElement hashes have to be compatible w.r.t. to subgroups of this group (and, by the way, using images of the base - which gets computed as soon as almost any kind of nontrivial computation has to be done - to compute the hash is obviously good here), but I fail to understand why this would be desirable for groups with different domains, and to me domains (1,2,3) and (2,1,3) are different, as different as (1,2,3) and (7,8,9).

roed314 commented 3 years ago
comment:30

Replying to @videlec:

Thanks for your input. The branch I provided is a bit slower than the previous code in case the domain is the standard one, ie {1, 2, ..., n}

sage: S = SymmetricGroup(10)
sage: %timeit -r5 for s in S: h = hash(s)
1.02 s ± 24.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # old version
1.32 s ± 11.3 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) # new version

The time difference is plausibly due to the extra call P._has_natural_domain() which is here repeated 10 factorial times. I do not have any idea to improve the timing without Cythonizing the parent.

Since _has_natural_domain is cached, it would surprise me a bit if this was the bottleneck. Maybe try running your code again with that just set to True? I would suspect that the inside of the for loop is just more complicated, yielding a 30% slowdown. Obviously we'd prefer not to have a slowdown, but I think 30% is acceptable in order to fix the bug.

As expected the code is significatively slower in case the domain is non standard

sage: S = SymmetricGroup(['a','b','c','d','e','f','g','h','i'])
sage: %timeit -r5 for s in S: h = hash(s)
109 ms ± 511 µs per loop (mean ± std. dev. of 5 runs, 10 loops each) # old version
3.56 s ± 4.19 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) # new version

If the latter is a problem, it is possible to store the hashes of elements in the domain in a C array inside the parent. Should I do that?

Since the parent isn't Cythonized, where would that array go? I think speeding up this case would be worthwhile, and recording the domain's hashes seems like a good way to do it.

The speed comparison I was thinking about, though, wasn't either of these. Instead, I think Dima was suggesting that we can make a much faster hash function than the current one in the case where n is large. I think speeding things up for large n is worthwhile, but there are plenty of cases where Sage is pretty bad. I think the hash code is fine in comparison:

sage: S = SymmetricGroup(2000)
sage: %timeit a = S.random_element()
639 ms ± 15.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

sage: S = SymmetricGroup(10^8)
sage: %time hash(a)                                                                                                                                                                                                                          
CPU times: user 120 ms, sys: 350 µs, total: 120 ms
Wall time: 120 ms
roed314 commented 3 years ago
comment:31

Replying to @dimpase:

PermutationGroupElement hashes have to be compatible w.r.t. to subgroups of this group (and, by the way, using images of the base - which gets computed as soon as almost any kind of nontrivial computation has to be done - to compute the hash is obviously good here), but I fail to understand why this would be desirable for groups with different domains, and to me domains (1,2,3) and (2,1,3) are different, as different as (1,2,3) and (7,8,9).

I agree that they are different, but Sage currently allows equality testing between them:

sage: S = SymmetricGroup([1,2,3])                                                                                                                                                                                                            
sage: T = SymmetricGroup([2,3,1])                                                                                                                                                                                                            
sage: S('(2,1)')                                                                                                                                                                                                                             
(1,2)
sage: T('(2,1)')                                                                                                                                                                                                                             
(2,1)
sage: T('(2,1)') == S('(2,1)')                                                                                                                                                                                                               
True

I wouldn't object to prohibiting a coercion map between these parents (as you say, the domains are different since permutations are all about the ordering of domain, not just the labels), but I'd say that should also be discussed more broadly.

videlec commented 3 years ago
comment:32

Replying to @roed314:

Replying to @videlec:

Thanks for your input. The branch I provided is a bit slower than the previous code in case the domain is the standard one, ie {1, 2, ..., n}

sage: S = SymmetricGroup(10)
sage: %timeit -r5 for s in S: h = hash(s)
1.02 s ± 24.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # old version
1.32 s ± 11.3 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) # new version

The time difference is plausibly due to the extra call P._has_natural_domain() which is here repeated 10 factorial times. I do not have any idea to improve the timing without Cythonizing the parent.

Since _has_natural_domain is cached, it would surprise me a bit if this was the bottleneck. Maybe try running your code again with that just set to True?

sage: %timeit -r5 for s in S: h = hash(s) # new code without _has_natural_domain call
1.09 s ± 18.3 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
roed314 commented 3 years ago
comment:33

I guess I was wrong. I also don't see an easy way to make this faster.

videlec commented 3 years ago
comment:34

Replying to @roed314:

Replying to @videlec:

As expected the code is significatively slower in case the domain is non standard

sage: S = SymmetricGroup(['a','b','c','d','e','f','g','h','i'])
sage: %timeit -r5 for s in S: h = hash(s)
109 ms ± 511 µs per loop (mean ± std. dev. of 5 runs, 10 loops each) # old version
3.56 s ± 4.19 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) # new version

If the latter is a problem, it is possible to store the hashes of elements in the domain in a C array inside the parent. Should I do that?

Since the parent isn't Cythonized, where would that array go? I think speeding up this case would be worthwhile, and recording the domain's hashes seems like a good way to do it.

There is the Python array object that can store whatever C data you want. We will still suffer from the attribute access to the Python parent. But it will not be a factor x30 as it is currently

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

e2e4d1d31236: use a hash array for non standard domains
7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 3 years ago

Changed commit from 3e29948 to e2e4d1d

videlec commented 3 years ago
comment:36

Better now

sage: S = SymmetricGroup(['a','b','c','d','e','f','g','h','i'])
sage: %timeit -r5 for s in S: h = hash(s)
159 ms ± 1.17 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)

Computing a single hash of an element triggers the creation of an array in the parent

sage: S._domain_hash_array
array('l', [-5034579376873000338, 5819885834672024805, -1342306816048582247, 4154873981948319473, -9080196639829811986, 3487762170233258304, -4059262801740968994, -1410151844174446032, -7454342532774124322])

This solution is not ideal as the hash array could be made available on the domain itself and not on the symmetric group. It could end up with many duplications of this array (ie when constructing subgroups).

videlec commented 3 years ago
comment:37

See #31269

videlec commented 3 years ago
comment:38

ping

roed314 commented 3 years ago
comment:39

I'm happy giving this ticket positive review. Since Dima originally reviewed it, I'll give him a couple days to respond with an objection and suggestions for how he'd like to proceed.

videlec commented 3 years ago

Changed reviewer from Dima Pasechnik to Dima Pasechnik, ​David Roe

dimpase commented 3 years ago
comment:41

My objection is only on the level of documentation. Where the hell these huge constants come from, and why? It's totally unclear.

videlec commented 3 years ago
comment:42

As their names indicate mult1, mult2 are multipliers (for bit shuffling) and mask1, mask2 are masks so that the hasing of the pair (i, j) is non commutative (ie different from the hash of (j, i)). These numbers are not big, they are 64 bit integers.

videlec commented 3 years ago
comment:43

ping

roed314 commented 3 years ago
comment:44

Sorry, I've been busy. This looks good to me.

vbraun commented 3 years ago
comment:45

I got this on Ubuntu 18.04 32 bit

**********************************************************************
File "src/sage/groups/perm_gps/permgroup_element.pyx", line 1485, in sage.groups.perm_gps.permgroup_element.PermutationGroupElement.__hash__
Failed example:
    for n in range(1, 10):
       G = SymmetricGroup(n)
       assert hash(G.one()) == 1
       assert len(set(map(hash, G))) == factorial(n)
Exception raised:
    Traceback (most recent call last):
      File "/var/lib/buildbot/slave/sage_git/build/local/lib/python3.9/site-packages/sage/doctest/forker.py", line 714, in _run
        self.compile_and_execute(example, compiler, test.globs)
      File "/var/lib/buildbot/slave/sage_git/build/local/lib/python3.9/site-packages/sage/doctest/forker.py", line 1133, in compile_and_execute
        exec(compiled, globs)
      File "<doctest sage.groups.perm_gps.permgroup_element.PermutationGroupElement.__hash__[3]>", line 4, in <module>
        assert len(set(map(hash, G))) == factorial(n)
    AssertionError
**********************************************************************
File "src/sage/groups/perm_gps/permgroup_element.pyx", line 1493, in sage.groups.perm_gps.permgroup_element.PermutationGroupElement.__hash__
Failed example:
    assert len(set(map(hash, A))) == A.cardinality()
Exception raised:
    Traceback (most recent call last):
      File "/var/lib/buildbot/slave/sage_git/build/local/lib/python3.9/site-packages/sage/doctest/forker.py", line 714, in _run
        self.compile_and_execute(example, compiler, test.globs)
      File "/var/lib/buildbot/slave/sage_git/build/local/lib/python3.9/site-packages/sage/doctest/forker.py", line 1133, in compile_and_execute
        exec(compiled, globs)
      File "<doctest sage.groups.perm_gps.permgroup_element.PermutationGroupElement.__hash__[7]>", line 1, in <module>
        assert len(set(map(hash, A))) == A.cardinality()
    AssertionError
**********************************************************************
1 item had failures:
   2 of  35 in sage.groups.perm_gps.permgroup_element.PermutationGroupElement.__hash__
    [426 tests, 2 failures, 3.64 s]

as a general bit of advice, consider using assert condition, message so you learn whats wrong from tracebacks...