Open videlec opened 3 years ago
Description changed:
---
+++
@@ -1,4 +1,4 @@
-As noticed on [this sage-devel thread](https://groups.google.com/g/sage-devel/c/0Ca0m94lfOE) the hash function of `PermutationGroupElement` do not behave coherently with respect to the natural inclusions `SymmetricGroup(n) c SymmetricGroup(n+1)`.
+As noticed on [this sage-devel thread](https://groups.google.com/g/sage-devel/c/0Ca0m94lfOE) the hash function of `PermutationGroupElement` does not behave coherently with respect to the natural inclusions `SymmetricGroup(n) c SymmetricGroup(n+1)`.
We design a hash function:
- whose value on the identity is 1
GAP has a function LargestMovedPoint()
for permutations, so instead of using self.n
, where n
is the degree of the permutation group, one can use this value to iterate over while computing the hash.
While this does not ignore all the fixed points, this is still OK in terms of the consistency.
Branch: u/vdelecroix/31236
What is 145623773L
doing there?
Are you replacing something which looks to me as a perfect hash function (i.e. no collisions) with something that might have collisions?
If you find a collision, I pay you a beer.
I asked a theory question.
145623773L
is an offset to ensure bit shuffling. See https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function (though I might not have used the best constants).
thanks. please add this info in the docs.
Branch pushed to git repo; I updated commit sha1. New commits:
ba9d62d | 31326: reference for algorithm |
done
OK, looks good, thanks.
Reviewer: Dima Pasechnik
sage: S1 = SymmetricGroup(FiniteEnumeratedSet([1,2,3]))
sage: S2 = SymmetricGroup(FiniteEnumeratedSet([2,1,3]))
sage: S1("(1,3,2)") == S2("(1,3,2)")
True
sage: hash(S1("(1,3,2)")) == hash(S2("(1,3,2)"))
False
Branch pushed to git repo; I updated commit sha1. New commits:
3e29948 | 31326: change algorithm for consistency |
Consistent hashing with more tests.
Description changed:
---
+++
@@ -2,4 +2,5 @@
We design a hash function:
- whose value on the identity is 1
-- which does only depend on the support (ie ignores fixed points)
+- which does only depend on the support (ie ignores fixed points so that the hash is consistent with permutation restriction)
+- which is independent of the domain ordering
Can't one do hashing based on the position in the lexicographically ordered set of all permutations?
See comment:13 for why this will not work.
Replying to @videlec:
See comment:13 for why this will not work.
This seems to be an entirely different issue - one can 1st canonise the permutation to the standard domain [1..n], and only then compute hash.
Having said this, it's in general a quite bad idea to ignore the nature of the group the element comes from, as it affects the performance of algorithms quite badly. Every permutation group G has a base, a minimal size set B s.t. g_1(B)\=g_2(B) as soon as g_1\=g_2, and for transitive permutation groups this is in general the quickest test, and |B| is usually very small compared to the degree (it's even possible to quantify what "usually" means here - e.g. for primitive groups this is governed by O'Nan-Scott theorem).
Replying to @dimpase:
Replying to @videlec:
See comment:13 for why this will not work.
This seems to be an entirely different issue - one can 1st canonise the permutation to the standard domain [1..n], and only then compute hash.
How do you canonise [pi, 'a', Partition([3,2,1])]
to [1, 2, 3]
?
Having said this, it's in general a quite bad idea to ignore the nature of the group the element comes from, as it affects the performance of algorithms quite badly. Every permutation group G has a base, a minimal size set B s.t. g_1(B)\=g_2(B) as soon as g_1\=g_2, and for transitive permutation groups this is in general the quickest test, and |B| is usually very small compared to the degree (it's even possible to quantify what "usually" means here - e.g. for primitive groups this is governed by O'Nan-Scott theorem).
Why is this relevent for hashing?
Replying to @videlec:
Replying to @dimpase:
Replying to @videlec:
See comment:13 for why this will not work.
This seems to be an entirely different issue - one can 1st canonise the permutation to the standard domain [1..n], and only then compute hash.
How do you canonise
[pi, 'a', Partition([3,2,1])]
to[1, 2, 3]
?
By their positions in the list, i.e. as [1,2,3]
, why not?
Having said this, it's in general a quite bad idea to ignore the nature of the group the element comes from, as it affects the performance of algorithms quite badly. Every permutation group G has a base, a minimal size set B s.t. g_1(B)\=g_2(B) as soon as g_1\=g_2, and for transitive permutation groups this is in general the quickest test, and |B| is usually very small compared to the degree (it's even possible to quantify what "usually" means here - e.g. for primitive groups this is governed by O'Nan-Scott theorem).
Why is this relevent for hashing?
I am inclined to say that one should not hash permutation group elements ignoring their parent group. A workaround, to hash Permutation(g)
rather than g
itself, should work, though - although it does not, due to the usual sloppiness of Permutation
.
Replying to @dimpase:
Replying to @videlec:
Replying to @dimpase:
Replying to @videlec:
See comment:13 for why this will not work.
This seems to be an entirely different issue - one can 1st canonise the permutation to the standard domain [1..n], and only then compute hash.
How do you canonise
[pi, 'a', Partition([3,2,1])]
to[1, 2, 3]
?By their positions in the list, i.e. as
[1,2,3]
, why not?
Because it won't work. If you do that with [pi, 'a', Partition([3,2,1])]
and then [Partition([3,2,1]), 'a', pi]
for the same permutation, let say the transposition ('a', pi)
, you will obtain two different permutations on [1,2,3]
with different hashes.
You could also have read comment:13: the order of the elements in the domain should not matter.
If you think you have a better way of implementing a hash function, please provide an alternative branch. As far as I tested my branch does solve the issue.
There is no issue IMHO. It's an over-zealous wish to mimic Python numeric tower to the point where it does not make sense any more.
Replying to @videlec:
Replying to @dimpase:
Replying to @videlec:
Replying to @dimpase:
Replying to @videlec:
See comment:13 for why this will not work.
This seems to be an entirely different issue - one can 1st canonise the permutation to the standard domain [1..n], and only then compute hash.
How do you canonise
[pi, 'a', Partition([3,2,1])]
to[1, 2, 3]
?By their positions in the list, i.e. as
[1,2,3]
, why not?Because it won't work. If you do that with
[pi, 'a', Partition([3,2,1])]
and then[Partition([3,2,1]), 'a', pi]
for the same permutation, let say the transposition('a', pi)
, you will obtain two different permutations on[1,2,3]
with different hashes.You could also have read comment:13: the order of the elements in the domain should not matter.
And this only reinforces my suspision that hashing permutations of different domains should return different answers.
As you just demonstrated, there is no canonical way to identify the transposition ('a',pi)
with (1,2)
, why would the hashes be the same?
Replying to @dimpase:
Replying to @videlec:
Replying to @dimpase:
Replying to @videlec:
Replying to @dimpase:
Replying to @videlec:
See comment:13 for why this will not work.
This seems to be an entirely different issue - one can 1st canonise the permutation to the standard domain [1..n], and only then compute hash.
How do you canonise
[pi, 'a', Partition([3,2,1])]
to[1, 2, 3]
?By their positions in the list, i.e. as
[1,2,3]
, why not?Because it won't work. If you do that with
[pi, 'a', Partition([3,2,1])]
and then[Partition([3,2,1]), 'a', pi]
for the same permutation, let say the transposition('a', pi)
, you will obtain two different permutations on[1,2,3]
with different hashes.You could also have read comment:13: the order of the elements in the domain should not matter.
And this only reinforces my suspision that hashing permutations of different domains should return different answers.
There is a canonical map from the domain {1,2}
to the domain {1,2,3}
. Hence the transposition (1,2)
on both sets should just be equal. It is trivial to implement a compatible hash function. This is what this ticket is about.
As you just demonstrated, there is no canonical way to identify the transposition
('a',pi)
with(1,2)
, why would the hashes be the same?
The hashes are different.
It seems like you have incompatible goals for a hash function.
S_n c S_{n+1}
considered here), and it requires a precomputation for each permutation group (the base).sage: mod(1,3) == -2
True
sage: hash(mod(1,3)) == hash(-2)
False
but this compatibility is still a desirable feature of a hash function.
Vincent's solution seems like a pretty good way to achieve the second goal. Dima, if you'd like to argue that we should drop compatibility of hash functions across different permutation groups in the interest of speed, I think that discussion should be taken back to the sage-devel thread.
I wonder if it's reasonable to have a "hash mode" option for parents where a user can indicate that they want hashes to be as fast as possible, dropping compatibility across parents. Maybe this could disable implicit coercions and require explicit conversions?
Replying to @roed314:
It seems like you have incompatible goals for a hash function.
Thanks for your input. The branch I provided is a bit slower than the previous code in case the domain is the standard one, ie {1, 2, ..., n}
sage: S = SymmetricGroup(10)
sage: %timeit -r5 for s in S: h = hash(s)
1.02 s ± 24.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # old version
1.32 s ± 11.3 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) # new version
The time difference is plausibly due to the extra call P._has_natural_domain()
which is here repeated 10 factorial times. I do not have any idea to improve the timing without Cythonizing the parent.
As expected the code is significatively slower in case the domain is non standard
sage: S = SymmetricGroup(['a','b','c','d','e','f','g','h','i'])
sage: %timeit -r5 for s in S: h = hash(s)
109 ms ± 511 µs per loop (mean ± std. dev. of 5 runs, 10 loops each) # old version
3.56 s ± 4.19 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) # new version
If the latter is a problem, it is possible to store the hashes of elements in the domain in a C array inside the parent. Should I do that?
PermutationGroupElement hashes have to be compatible w.r.t. to subgroups of this group (and, by the way, using images of the base - which gets computed as soon as almost any kind of nontrivial computation has to be done - to compute the hash is obviously good here), but I fail to understand why this would be desirable for groups with different domains, and to me domains (1,2,3) and (2,1,3) are different, as different as (1,2,3) and (7,8,9).
Replying to @videlec:
Thanks for your input. The branch I provided is a bit slower than the previous code in case the domain is the standard one, ie
{1, 2, ..., n}
sage: S = SymmetricGroup(10) sage: %timeit -r5 for s in S: h = hash(s) 1.02 s ± 24.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # old version 1.32 s ± 11.3 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) # new version
The time difference is plausibly due to the extra call
P._has_natural_domain()
which is here repeated 10 factorial times. I do not have any idea to improve the timing without Cythonizing the parent.
Since _has_natural_domain
is cached, it would surprise me a bit if this was the bottleneck. Maybe try running your code again with that just set to True
? I would suspect that the inside of the for loop is just more complicated, yielding a 30% slowdown. Obviously we'd prefer not to have a slowdown, but I think 30% is acceptable in order to fix the bug.
As expected the code is significatively slower in case the domain is non standard
sage: S = SymmetricGroup(['a','b','c','d','e','f','g','h','i']) sage: %timeit -r5 for s in S: h = hash(s) 109 ms ± 511 µs per loop (mean ± std. dev. of 5 runs, 10 loops each) # old version 3.56 s ± 4.19 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) # new version
If the latter is a problem, it is possible to store the hashes of elements in the domain in a C array inside the parent. Should I do that?
Since the parent isn't Cythonized, where would that array go? I think speeding up this case would be worthwhile, and recording the domain's hashes seems like a good way to do it.
The speed comparison I was thinking about, though, wasn't either of these. Instead, I think Dima was suggesting that we can make a much faster hash function than the current one in the case where n
is large. I think speeding things up for large n
is worthwhile, but there are plenty of cases where Sage is pretty bad. I think the hash code is fine in comparison:
sage: S = SymmetricGroup(2000)
sage: %timeit a = S.random_element()
639 ms ± 15.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sage: S = SymmetricGroup(10^8)
sage: %time hash(a)
CPU times: user 120 ms, sys: 350 µs, total: 120 ms
Wall time: 120 ms
Replying to @dimpase:
PermutationGroupElement hashes have to be compatible w.r.t. to subgroups of this group (and, by the way, using images of the base - which gets computed as soon as almost any kind of nontrivial computation has to be done - to compute the hash is obviously good here), but I fail to understand why this would be desirable for groups with different domains, and to me domains (1,2,3) and (2,1,3) are different, as different as (1,2,3) and (7,8,9).
I agree that they are different, but Sage currently allows equality testing between them:
sage: S = SymmetricGroup([1,2,3])
sage: T = SymmetricGroup([2,3,1])
sage: S('(2,1)')
(1,2)
sage: T('(2,1)')
(2,1)
sage: T('(2,1)') == S('(2,1)')
True
I wouldn't object to prohibiting a coercion map between these parents (as you say, the domains are different since permutations are all about the ordering of domain, not just the labels), but I'd say that should also be discussed more broadly.
Replying to @roed314:
Replying to @videlec:
Thanks for your input. The branch I provided is a bit slower than the previous code in case the domain is the standard one, ie
{1, 2, ..., n}
sage: S = SymmetricGroup(10) sage: %timeit -r5 for s in S: h = hash(s) 1.02 s ± 24.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # old version 1.32 s ± 11.3 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) # new version
The time difference is plausibly due to the extra call
P._has_natural_domain()
which is here repeated 10 factorial times. I do not have any idea to improve the timing without Cythonizing the parent.Since
_has_natural_domain
is cached, it would surprise me a bit if this was the bottleneck. Maybe try running your code again with that just set toTrue
?
sage: %timeit -r5 for s in S: h = hash(s) # new code without _has_natural_domain call
1.09 s ± 18.3 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
I guess I was wrong. I also don't see an easy way to make this faster.
Replying to @roed314:
Replying to @videlec:
As expected the code is significatively slower in case the domain is non standard
sage: S = SymmetricGroup(['a','b','c','d','e','f','g','h','i']) sage: %timeit -r5 for s in S: h = hash(s) 109 ms ± 511 µs per loop (mean ± std. dev. of 5 runs, 10 loops each) # old version 3.56 s ± 4.19 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) # new version
If the latter is a problem, it is possible to store the hashes of elements in the domain in a C array inside the parent. Should I do that?
Since the parent isn't Cythonized, where would that array go? I think speeding up this case would be worthwhile, and recording the domain's hashes seems like a good way to do it.
There is the Python array object that can store whatever C data you want. We will still suffer from the attribute access to the Python parent. But it will not be a factor x30 as it is currently
Branch pushed to git repo; I updated commit sha1. New commits:
e2e4d1d | 31236: use a hash array for non standard domains |
Better now
sage: S = SymmetricGroup(['a','b','c','d','e','f','g','h','i'])
sage: %timeit -r5 for s in S: h = hash(s)
159 ms ± 1.17 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)
Computing a single hash of an element triggers the creation of an array in the parent
sage: S._domain_hash_array
array('l', [-5034579376873000338, 5819885834672024805, -1342306816048582247, 4154873981948319473, -9080196639829811986, 3487762170233258304, -4059262801740968994, -1410151844174446032, -7454342532774124322])
This solution is not ideal as the hash array could be made available on the domain itself and not on the symmetric group. It could end up with many duplications of this array (ie when constructing subgroups).
See #31269
ping
I'm happy giving this ticket positive review. Since Dima originally reviewed it, I'll give him a couple days to respond with an objection and suggestions for how he'd like to proceed.
Changed reviewer from Dima Pasechnik to Dima Pasechnik, David Roe
My objection is only on the level of documentation. Where the hell these huge constants come from, and why? It's totally unclear.
As their names indicate mult1
, mult2
are multipliers (for bit shuffling) and mask1
, mask2
are masks so that the hasing of the pair (i, j)
is non commutative (ie different from the hash of (j, i)
). These numbers are not big, they are 64 bit integers.
ping
Sorry, I've been busy. This looks good to me.
I got this on Ubuntu 18.04 32 bit
**********************************************************************
File "src/sage/groups/perm_gps/permgroup_element.pyx", line 1485, in sage.groups.perm_gps.permgroup_element.PermutationGroupElement.__hash__
Failed example:
for n in range(1, 10):
G = SymmetricGroup(n)
assert hash(G.one()) == 1
assert len(set(map(hash, G))) == factorial(n)
Exception raised:
Traceback (most recent call last):
File "/var/lib/buildbot/slave/sage_git/build/local/lib/python3.9/site-packages/sage/doctest/forker.py", line 714, in _run
self.compile_and_execute(example, compiler, test.globs)
File "/var/lib/buildbot/slave/sage_git/build/local/lib/python3.9/site-packages/sage/doctest/forker.py", line 1133, in compile_and_execute
exec(compiled, globs)
File "<doctest sage.groups.perm_gps.permgroup_element.PermutationGroupElement.__hash__[3]>", line 4, in <module>
assert len(set(map(hash, G))) == factorial(n)
AssertionError
**********************************************************************
File "src/sage/groups/perm_gps/permgroup_element.pyx", line 1493, in sage.groups.perm_gps.permgroup_element.PermutationGroupElement.__hash__
Failed example:
assert len(set(map(hash, A))) == A.cardinality()
Exception raised:
Traceback (most recent call last):
File "/var/lib/buildbot/slave/sage_git/build/local/lib/python3.9/site-packages/sage/doctest/forker.py", line 714, in _run
self.compile_and_execute(example, compiler, test.globs)
File "/var/lib/buildbot/slave/sage_git/build/local/lib/python3.9/site-packages/sage/doctest/forker.py", line 1133, in compile_and_execute
exec(compiled, globs)
File "<doctest sage.groups.perm_gps.permgroup_element.PermutationGroupElement.__hash__[7]>", line 1, in <module>
assert len(set(map(hash, A))) == A.cardinality()
AssertionError
**********************************************************************
1 item had failures:
2 of 35 in sage.groups.perm_gps.permgroup_element.PermutationGroupElement.__hash__
[426 tests, 2 failures, 3.64 s]
as a general bit of advice, consider using assert condition, message
so you learn whats wrong from tracebacks...
As noticed on this sage-devel thread the hash function of
PermutationGroupElement
does not behave coherently with respect to the natural inclusionsSymmetricGroup(n) c SymmetricGroup(n+1)
.We design a hash function:
Component: group theory
Author: Vincent Delecroix
Branch/Commit: u/vdelecroix/31236 @
e2e4d1d
Reviewer: Dima Pasechnik, David Roe
Issue created by migration from https://trac.sagemath.org/ticket/31236