sagemath / sage

Main repository of SageMath
https://www.sagemath.org
Other
1.33k stars 453 forks source link

hash of polynomials is very bad #21284

Open videlec opened 8 years ago

videlec commented 8 years ago

A lot of conflicts:

sage: S.<t> = QQ[]
sage: from itertools import product
sage: values = [S(t) for t in product((0,1/2,1,2), repeat=5)]
sage: len(set(map(hash,values)))
565
sage: len(values)
1024

Though hashing in this situation is delicate since we want to preserve the compatibility with:

CC: @jdemeyer

Component: algebra

Issue created by migration from https://trac.sagemath.org/ticket/21284

videlec commented 8 years ago

Dependencies: #21272

kedlaya commented 8 years ago
comment:2

I don't have product in my version of Sage, but by replacing that line with

sage: values = [S(list(t)) for t in cartesian_product([[0,1/2,1,2],] * 5)]

I was able to reproduce the issue. I need a fix for this in order to deal with hashes of ideals; see #21297.

videlec commented 8 years ago
comment:3

Right product comes from the itertools module.

videlec commented 8 years ago

Description changed:

--- 
+++ 
@@ -2,6 +2,7 @@

sage: S. = QQ[] +sage: from itertools import product sage: values = [S(t) for t in product((0,1/2,1,2), repeat=5)] sage: len(set(map(hash,values))) 565

kedlaya commented 7 years ago
comment:4

To make this more specific, in case this helps to chase this down:

sage: R.<t> = QQ[]
sage: u = 2*t^3 + t^2; v = t^3 + 2*t + 1
sage: hash(u), hash(v)
(29698519856292282, 29698519856292282)

The key bit of code is in the function hash_c (omitting some comments):

for i from 0<= i <= self.degree():
    if i == 1:
        # we delay the hashing until now to not waste it on a constant poly
        var_name_hash = hash(self._parent._names[0])
    c_hash = hash(self[i])
    if c_hash != 0:
        if i == 0:
            result += c_hash
        else:
            # Hash (self[i], generator, i) as a tuple according to the algorithm.
            result_mon = c_hash
            result_mon = (1000003 * result_mon) ^ var_name_hash
            result_mon = (1000003 * result_mon) ^ i
            result += result_mon

which seems just not to work at all. Probably a complete rethink is in order here.

videlec commented 7 years ago
comment:5

(related: I did something for nf elements at #23402)

videlec commented 7 years ago
comment:6

Once #23402 is positively reviewed I will port the corresponding code here

kedlaya commented 7 years ago
comment:7

Let's make #23402 a dependency then.

kedlaya commented 7 years ago

Changed dependencies from #21272 to #21272, #23402

videlec commented 7 years ago
comment:8

Actually what I did in #23402 has to be rethought since it does not apply to sparse polynomials... the hashing function needs to involve the degree of the monomials in some way.

jdemeyer commented 7 years ago
comment:9

So, is #23402 then a dependency or not?

videlec commented 7 years ago

Description changed:

--- 
+++ 
@@ -9,3 +9,8 @@
 sage: len(values)
 1024

+ +Though hashing in this situation is delicate since we want to preserve the compatibility with: + +- sparse polynomials +- multivariate polynomials

videlec commented 7 years ago

Changed dependencies from #21272, #23402 to #23402

videlec commented 7 years ago

Changed dependencies from #23402 to none

videlec commented 7 years ago
comment:11

Let us remove the dependency as there is no coercion/hash compatibility involved between polynomial rings and number fields.

kedlaya commented 4 years ago
comment:13

Ping. Something has changed in the interim, as my example from comment:4 is no longer applicable:

sage: R.<t> = QQ[]
....: u = 2*t^3 + t^2; v = t^3 + 2*t + 1
....: hash(u), hash(v)
(-7745487741690187698, -7745487741690187694)

but there is still an overall issue:

sage: S.<t> = QQ[]
sage: from itertools import product
sage: values = [S(t) for t in product((0,1/2,1,2), repeat=5)]
sage: len(set(map(hash,values)))
491
sage: len(values)
1024
mwageringel commented 4 years ago
comment:14

Replying to @kedlaya:

Ping. Something has changed in the interim, as my example from comment:4 is no longer applicable:

This is because of the switch to Python 3, as strings (of the variable names) are hashed differently. With a Python 2 based Sage, it still behaves as before.


Also note that this

            # Hash (self[i], generator, i) as a tuple according to the algorithm.
            result_mon = c_hash
            result_mon = (1000003 * result_mon) ^ var_name_hash
            result_mon = (1000003 * result_mon) ^ i
            result += result_mon

essentially copies the implementation of hashing of tuples in Python 2. This was found to have many collisions on simple input and was replaced by a better hash function (xxHash) in Python 3.8; see https://bugs.python.org/issue34751. I guess we could just adopt this new tuple hash here to get better collision resistance.

videlec commented 4 years ago
comment:15

Sounds reasonable. Does it also fix

sage: hash(-1) == hash(-2)
True
mwageringel commented 4 years ago
comment:16

No, that is still the same. IIRC, it was mentioned somewhere that this pattern is baked too deeply into Python as well as third-party code to change it at this point.

videlec commented 4 years ago
comment:17

Then using the tuple hash is probably not what you want to do (below with Python 3.8.0)

>>> from itertools import product
>>> for p in product([-1,-2], repeat=3): print(hash(p)) 
-6133384102531166807
-6133384102531166807
-6133384102531166807
-6133384102531166807
-6133384102531166807
-6133384102531166807
-6133384102531166807
-6133384102531166807
mwageringel commented 4 years ago
comment:18

This is probably difficult to avoid in general, since the hashing function for polynomials should incorporate the hash values of its coefficients. As the hash values of -1 and -2 are the same, all polynomials that differ only in terms of coefficients of -1 or -2 will necessarily have the same hash.

Cython automatically returns -2 if a hashing function returns -1, so to avoid this, one would have to add a separate hashing function (which Cython does not know about) to the elements of all coefficient rings – or find some other way of obtaining hash values of -1.

Nevertheless, the problem in the ticket description is independent from this -1/-2 issue, so hashing could still be improved a lot for polynomials not involving coefficients of -1.

mkoeppe commented 4 years ago
comment:19

Batch modifying tickets that will likely not be ready for 9.1, based on a review of the ticket title, branch/review status, and last modification date.

mkoeppe commented 3 years ago
comment:21

Setting new milestone based on a cursory review of ticket status, priority, and last modification date.

sheerluck commented 1 year ago
comment:26
--- a/sage/rings/polynomial/polynomial_element.pyx
+++ b/sage/rings/polynomial/polynomial_element.pyx
@@ -1246,7 +1246,7 @@
             TypeError: unhashable type: 'sage.rings.padics.qadic_flint_CR.qAdicCappedRelativeElement'

         """
-        cdef long result = 0 # store it in a c-int and just let the overflowing additions wrap
+        cdef long result = 0xb1f740b4 # store it in a c-int and just let the overflowing additions wrap
         cdef long result_mon
         cdef long c_hash
         cdef long var_name_hash
@@ -1265,12 +1265,10 @@
                     result += c_hash
                 else:
                     # Hash (self[i], generator, i) as a tuple according to the algorithm.
-                    result_mon = c_hash
+                    result_mon = c_hash ^ result
                     result_mon = (1000003 * result_mon) ^ var_name_hash
                     result_mon = (1000003 * result_mon) ^ i
                     result += result_mon
-        if result == -1:
-            return -2
         return result

     def _im_gens_(self, codomain, im_gens, base_map=None):
sage: S.<t> = QQ[]
sage: from itertools import product
sage: values = [S(t) for t in product((0,1/2,1,2), repeat=5)]
sage: len(set(map(hash,values)))
996
sage: len(values)
1024
sheerluck commented 1 year ago
--- a/sage/rings/polynomial/polynomial_element.pyx
+++ b/sage/rings/polynomial/polynomial_element.pyx
@@ -65,6 +65,8 @@
 import operator
 import copy
 import re
+import sys
+import xxhash
 from io import StringIO

 from sage.cpython.wrapperdescr cimport wrapperdescr_fastcall
@@ -1256,32 +1258,22 @@
             TypeError: unhashable type: 'sage.rings.padics.qadic_flint_CR.qAdicCappedRelativeElement'

         """
-        cdef long result = 0 # store it in a c-int and just let the overflowing additions wrap
-        cdef long result_mon
-        cdef long c_hash
-        cdef long var_name_hash
         cdef int i
+        xx = xxhash.xxh3_64()
         for i from 0<= i <= self.degree():
             if i == 1:
                 # we delay the hashing until now to not waste it on a constant poly
                 var_name_hash = hash(self._parent._names[0])
+                xx.update(var_name_hash.to_bytes(8, byteorder="big", signed=True))
             # I'm assuming (incorrectly) that hashes of zero indicate that the element is 0.
             # This assumption is not true, but I think it is true enough for the purposes and it
             # it allows us to write fast code that omits terms with 0 coefficients.  This is
             # important if we want to maintain the '==' relationship with sparse polys.
             c_hash = hash(self.get_unsafe(i))
             if c_hash != 0:
-                if i == 0:
-                    result += c_hash
-                else:
-                    # Hash (self[i], generator, i) as a tuple according to the algorithm.
-                    result_mon = c_hash
-                    result_mon = (1000003 * result_mon) ^ var_name_hash
-                    result_mon = (1000003 * result_mon) ^ i
-                    result += result_mon
-        if result == -1:
-            return -2
-        return result
+                xx.update(c_hash.to_bytes(8, byteorder="big", signed=True))
+                xx.update(hash(3*(i + 7)).to_bytes(8, byteorder="big"))
+        return int.from_bytes(xx.digest(), byteorder="big") % sys.maxsize

     def _im_gens_(self, codomain, im_gens, base_map=None):
         """
sage: S.<t> = QQ[]
sage: from itertools import product
sage: values = [S(t) for t in product((0, 1 / 2, 1, 2), repeat=5)]
sage: len(set(map(hash, values)))
1024
sage: len(values)
1024