matrix multiplication over ZZ sometimes gives incorrect results

bb51d1f9-3114-42b3-bf15-d218ca7784e2 commented 13 years ago

Something is wrong with the multi-modular matrix multiplication code for matrices over ZZ. At random, and infrequently, it gives incorrect results. For example, the following code chooses random 3x2 and 2x10 integer matrices and multiplies them together using the multi-modular algorithm. It then does the same multiplication 100 more times, checks that the answer is always the same, and if not raises an exception:

sage: for n in range(2000):
....:     A = MatrixSpace(ZZ,3,2).random_element()    
....:     B = MatrixSpace(ZZ,2,10).random_element()
....:     try_once = A._multiply_multi_modular(B)
....:     for k in range(100):
....:             try_again = A._multiply_multi_modular(B)
....:         if try_once != try_again:
....:                 print "="*60
....:             print "n = %s, k = %s"%(n,k)
....:             print "A = "
....:             print A
....:             print "B ="
....:             print B
....:             print "first attempt = "
....:             print try_once
....:             print "k-th retry = "
....:             print try_again
....:             raise RuntimeError
....:

This fails with very high probability (on Sage 4.6.2 under OS X, built from source) with output such as:

============================================================
n = 27, k = 43
A = 
[-1  0]
[ 1  0]
[ 0  1]
B =
[  -2    1   -1    0    0   -1   -1    3   -1   -1]
[   1    1 -116    3    0   -1   -1    0   -1    0]
first attempt = 
[   2   -1    1    0    0    1    1   -3    1    1]
[  -2    1   -1    0    0   -1   -1    3   -1   -1]
[   1    1 -116    3    0   -1   -1    0   -1    0]
k-th retry = 
[   2 1102    1    0    0    1    1 1100    1    1]
[1101    1 1102    0    0 1102 1102    3 1102 1102]
[   1    1  987    3    0 1102 1102    0 1102    0]
---------------------------------------------------------------------------
RuntimeError [...]

Note that the two candidates for the matrix product here are congruent modulo 1103, which is prime. If you rerun the code with verbose logging, using set_verbose(2), then every time it fails the two candidates for the matrix product are congruent modulo the prime being used in the multi-modular algorithm. Thus I suspect that the Chinese Remainder Theorem code in sage/ext/multi_modular.pyx is not handling a corner case properly.

SEE ALSO: #10281, which touches the sage/ext/multi_modular.pyx code in nontrivial ways.

CC: @burcin

Component: linear algebra

Keywords: matrix multiplication, multi-modular, integers, ZZ

Stopgaps: #12710

Author: William Stein

Reviewer: Douglas McNeil

Merged: sage-5.0.beta12

Issue created by migration from https://trac.sagemath.org/ticket/11358

bb51d1f9-3114-42b3-bf15-d218ca7784e2 commented 13 years ago

comment:1

This also fails on Sage 4.6.2 under Red Hat Enterprise Linux (binary install, x86_64 GNU/Linux)

bb51d1f9-3114-42b3-bf15-d218ca7784e2 commented 13 years ago

comment:2

Looking at the function mpz_crt_vec_tail() in sage/ext/multi_modular.pyx, it appears that three of the last five lines:

cdef Integer zz
zz = PY_NEW(Integer)
mpz_set(zz.value, self.half_product)

are redundant, because the variable zz is never used. (NB I am unfamiliar with both Cython and GMP, so I may be wrong here.) But why is this code there? Is it supposed to be doing something else?

1d7ec08f-60ae-4512-91a6-8324c06eab9f commented 13 years ago

comment:3

From #sagemath on IRC:

[07:11] <sagebot> New news from tractimeline: Ticket #11358 (matrix multiplication over ZZ sometimes gives incorrect results) created
[07:32] <logix> hm, i just tried to reproduce that bug (which i can), and it failed both times when B had a full zero column
[07:32] <logix> (which is the case in the bug report too)
[07:33] <logix> perhaps that's the condition triggering the bug?
[07:34] <logix> i mean one of the conditions triggering the bug (there must be something else, as the inner loop runs more than once and it fails only sometimes

bb51d1f9-3114-42b3-bf15-d218ca7784e2 commented 13 years ago

comment:4

Replying to @rbeezer:

I don't think that this is correct. By running the code repeatedly, I found examples without zero columns.

1d7ec08f-60ae-4512-91a6-8324c06eab9f commented 13 years ago

comment:5

Replying to @sagetrac-tomc:

I don't think that this is correct. By running the code repeatedly, I found examples without zero columns.

OK, just saw that in IRC and thought I'd capture it.

I'm curious to see what the root cause is here!

Rob

56048686-9665-4c5e-973e-6c3add3aa805 commented 13 years ago

comment:6

I think I've got it.

This seems to happen iff a random prime happens to be chosen twice. For example, if you modify _extend_moduli_to_height_c to always reuse a prime if more than one is needed, this always happens. The same bug seems to exist in _extend_moduli_to_count as well.

The problem seems to be that in _new_random_prime, there's a test which doesn't do what it's trying to do:

    cdef mod_int _new_random_prime(self):
        # choose a new random prime
        cdef Py_ssize_t i
        cdef mod_int p
        while True:
            p = random_prime(self._u_bound, lbound =self._l_bound)
            for i in range(self.n):
                if p == self.moduli[i]:
                    break
            else:
                return p

IIUC, the "for i in range(self.n)" loop is attempting to avoid the problem of repeated primes, but this only works if p is added to self.moduli immediately after it's returned. In the case of _extend_moduli_to_height_c, this isn't true; the addition is delayed until a later extend_with_primes call, so the above check decides that the prime isn't reused, we get repeated moduli, and the multiplication breaks.

Or even more obviously:

sage: from sage.ext.multi_modular import MultiModularBasis_base
sage: mm = MultiModularBasis_base(1)
sage: mm._extend_moduli_to_count(1000)
1000
sage: len(mm), len(set(mm))
(1000, 843)

ISTM either we can remove the check-for-duplicate code from _new_random_prime and do it externally, or we can add an argument and pass it the "accumulated" primes as we build them elsewhere so it can avoid duplicates there too. I prefer the first. Should I work up a patch?

bb51d1f9-3114-42b3-bf15-d218ca7784e2 commented 13 years ago

comment:8

Replying to @sagetrac-dsm:

Good catch! I agree.

ISTM either we can remove the check-for-duplicate code from _new_random_prime and do it externally, or we can add an argument and pass it the "accumulated" primes as we build them elsewhere so it can avoid duplicates there too. I prefer the first. Should I work up a patch?

I prefer the first too. Please write a patch and then I will review it.

56048686-9665-4c5e-973e-6c3add3aa805 commented 13 years ago

comment:9

Okay, here's my first attempt. Because another function calls _new_random_prime assuming that it correctly avoids duplication I had to change my intended approach a bit, but it should work.

Two other minor changes:

(1) as mentioned above, mpz_crt_vec_tail() has what looks to me like vestigial code, and I've removed it.

(2) the documentation for replace_prime_c claimed that it would replace a prime with a greater prime, but made no efforts to ensure the "greater than" part. I've changed the doc accordingly.

bb51d1f9-3114-42b3-bf15-d218ca7784e2 commented 13 years ago

comment:10

Thanks. I'll review this.

bb51d1f9-3114-42b3-bf15-d218ca7784e2 commented 13 years ago

comment:11

I can't get the patch to apply. When I apply the patch and then do

sage -b

I get

Error converting Pyrex file to C:
------------------------------------------------------------
...
        """
        cdef Py_ssize_t i
        cdef mod_int p
        while True:
            p = random_prime(self._u_bound, lbound =self._l_bound)
            if p not in self.moduli[:self.n]: return p
                                  ^
------------------------------------------------------------

/Users/tomc/sage-4.6.2/devel/sage-test-crt-fix/sage/ext/multi_modular.pyx:191:35: Cannot convert 'mod_int *' to Python object

Also, in the new implementation of _new_random_prime() there is the possibility of an infinite loop if self.moduli already contains all the primes between self._lbound and self._ubound. I suggest modifying _new_random_prime() so that it:

makes some number of tries (10? 100?) to find a new random prime p in the range self._lbound < p < self._ubound; and
if this fails then it increases self_ubound and tries again.

burcin commented 13 years ago

comment:12

Replying to @sagetrac-tomc:

Also, in the new implementation of _new_random_prime() there is the possibility of an infinite loop if self.moduli already contains all the primes between self._lbound and self._ubound. I suggest modifying _new_random_prime() so that it:

makes some number of tries (10? 100?) to find a new random prime p in the range self._lbound < p < self._ubound; and
if this fails then it increases self_ubound and tries again.

I suggest raising an error instead of increasing _ubound. The upper bound is usually set so that the primes fit in a single machine word. A lot of things would break if this changed without notice.

bb51d1f9-3114-42b3-bf15-d218ca7784e2 commented 13 years ago

comment:13

Replying to @burcin:

I suggest raising an error instead of increasing _ubound. The upper bound is usually set so that the primes fit in a single machine word. A lot of things would break if this changed without notice.

If we do raise an error then it should be caught before it propagates up to the user. For instance, it should not be that if A and B are integer matrices then evaluating AB can sometimes cause an exception because the multi-modular matrix multiplication algorithm ran out of primes: in this case we should catch the exception and evaluate AB using a different algorithm.

Is it really the case that raising self._ubound would cause things to break? I thought from looking at the multi-modular code that it all used mpz_t types from GMP, and that these would automatically increase their allocated memory as necessary.

The multi-modular code is used in five places: dense matrix multiplication over ZZ; dense matrix multiplication over cyclotomics; dense matrix multiplication over QQ (although only in the rational reconstruction routines _lift_crt_rr() and _lift_crt_rr_with_lcm(), which I think are dead code); echelon form calculations over cyclotomics; and characteristic polynomial calculations over cyclotomics. If we decide to raise an error if the multi-modular code runs out of primes then we will need to catch this error and handle it appropriately in all five places.

burcin commented 13 years ago

comment:14

Replying to @sagetrac-tomc:

Replying to @burcin:

I suggest raising an error instead of increasing _ubound. The upper bound is usually set so that the primes fit in a single machine word. A lot of things would break if this changed without notice.

If we do raise an error then it should be caught before it propagates up to the user. For instance, it should not be that if A and B are integer matrices then evaluating AB can sometimes cause an exception because the multi-modular matrix multiplication algorithm ran out of primes: in this case we should catch the exception and evaluate AB using a different algorithm.

This is better than trying to handle it magically in the MultiModularBasis class.

Is it really the case that raising self._ubound would cause things to break? I thought from looking at the multi-modular code that it all used mpz_t types from GMP, and that these would automatically increase their allocated memory as necessary.

Multi precision integers are needed for the lifting step. One of the reasons to use a multi-modular algorithm is to take advantage of fast arithmetic implemented in hardware. This requires that the modulus is small enough to fit in a word size. For linear algebra, the dimensions of the matrix also come into play. Ideally you should reduce only once for each row*column multiplication.

The multi-modular code is used in five places: dense matrix multiplication over ZZ; dense matrix multiplication over cyclotomics; dense matrix multiplication over QQ (although only in the rational reconstruction routines _lift_crt_rr() and _lift_crt_rr_with_lcm(), which I think are dead code); echelon form calculations over cyclotomics; and characteristic polynomial calculations over cyclotomics. If we decide to raise an error if the multi-modular code runs out of primes then we will need to catch this error and handle it appropriately in all five places.

You cannot assume that the only users of the code are in the Sage library. This class is designed to be an abstraction that makes it easy to implement multi modular algorithms.

Perhaps we can verify that this error will never occur in these places through some theoretical argument. It is highly unlikely that we will always keep choosing the same random prime.

bb51d1f9-3114-42b3-bf15-d218ca7784e2 commented 13 years ago

comment:15

Replying to @burcin:

Multi precision integers are needed for the lifting step. One of the reasons to use a multi-modular algorithm is to take advantage of fast arithmetic implemented in hardware. This requires that the modulus is small enough to fit in a word size.

I agree that things will slow down if we ever raise self._ubound to be larger than a machine word: my question was whether anything would actually break. But in any case, since the MultiModularBasis class is explicitly documented as:

"""
    This class stores a list of machine-sized prime numbers...
"""

we should not allow the code to raise self._ubound. Thus we should raise an error if we repeatedly fail to choose a new random prime. As discussed, we will need to handle this error in several other places in Sage.

Note that the situation we are discussing is very unlikely to ever occur in practice, as the default values of _lbound and _ubound are repectively 210 and 215 and there are many primes in between 210 and 215. But in theory someone could do:

sage: mm = MultiModularBasis(lbound=4,ubound=6,...)

or something equally silly, and then the multi-modular code would quickly run out of primes.

56048686-9665-4c5e-973e-6c3add3aa805 commented 13 years ago

comment:16

Replying to @sagetrac-tomc:

I can't get the patch to apply.

Try it against one of the 4.7 release candidates. I should've mentioned what I was basing it on.

As for the case of running out of primes -- which I considered the caller's fault, and so I simply warned about it in the docstring rather than handle it -- I'm fine with whatever. (Once cython 0.15 gets integrated and we have cythonic generators it may be worth collecting random prime functions into a fast module of their own.)

Given that the moduli have to be below MAX_MODULUS, it wouldn't cost much to check whether there really are any primes left and raise an exception only if we really are done and not merely unlucky. If we're doing that, though, then we should probably raise the same exception if you ask for more primes than there are available in extend_moduli_to_count as well.

williamstein commented 12 years ago

comment:17

ping why is this ticket dead? It seems very, very important, but everybody forgot about it a year ago?!

Replying to @sagetrac-tomc:

Note that the situation we are discussing is very unlikely to ever occur in practice, as the default values of _lbound and _ubound are repectively 210 and 215 and there are many primes in between 210 and 215.

This is completely false. There are hardly any primes in that range:

sage: len(prime_range(2^10,2^15))
3340

The linear algebra in Sage is currently very broken in practice, perhaps due to various optimizations and re-implementations that have fundamentally broken things. For example, I'm constantly hitting a problem of "not enough primes" (see #10281), even for matrices over QQ.

williamstein commented 12 years ago

Description changed:

--- 
+++ 
@@ -45,3 +45,5 @@
 RuntimeError [...]

Note that the two candidates for the matrix product here are congruent modulo 1103, which is prime. If you rerun the code with verbose logging, using set_verbose(2), then every time it fails the two candidates for the matrix product are congruent modulo the prime being used in the multi-modular algorithm. Thus I suspect that the Chinese Remainder Theorem code in sage/ext/multi_modular.pyx is not handling a corner case properly. + +SEE ALSO: #10281, which touches the sage/ext/multi_modular.pyx code in nontrivial ways.

jbalakrishnan commented 12 years ago

Stopgaps: #12710

williamstein commented 12 years ago

this is a usable version of the test program in the ticket description

williamstein commented 12 years ago

comment:20

Attachment: 11358.sage.gz

I've posted a patch trac_11358-wstein.patch that fixes the bug. The solution is definitely much better than what was proposed 10 years ago. You can run the attached 11358.sage to check that the exact bug cited here doesn't happen. But better is to see the new doctests which get to the heart of the matter.

56048686-9665-4c5e-973e-6c3add3aa805 commented 12 years ago

comment:21

The patch does seem to eliminate the original bug for me, but there's still some strangeness for different _extend methods:


sage: set_random_seed(0)
sage: from sage.ext.multi_modular import MultiModularBasis_base
sage: m = MultiModularBasis_base(0);
sage: m._extend_moduli(100)
sage: len(m), len(set(m))
(101, 99)
sage: [k for k in m if list(m).count(k) > 1]
[30859, 3217, 30859, 3217]


sage: set_random_seed(1)
sage: from sage.ext.multi_modular import MultiModularBasis_base
sage: m = MultiModularBasis_base(0);
sage: m._extend_moduli_to_count(100)
100
sage: len(m), len(set(m))
(100, 97)
sage: [k for k in m if list(m).count(k) > 1]
[4001, 5309, 23293, 4001, 23293, 5309]

So I think that two of the three _extend methods still don't give results that I'd expect.

williamstein commented 12 years ago

comment:22

So I think that two of the three _extend methods still don't give results that I'd expect.

I have fixed the patch. There was one case where I didn't update use of _new_random_prime correctly.

56048686-9665-4c5e-973e-6c3add3aa805 commented 12 years ago

comment:23

The new patch:

(1) Applies against 5.0.beta8 with minor fuzz. (2) Seems to eliminate the original bug. (3) Matches the original diagnosis about where the problem was (use of repeated primes in a multimodular basis), so I get why it should fix the bug. (4) Has doctests which verify that .._height now behaves. (5) The new version passes my separate tests confirming that the other _extend methods behave too, while the first version didn't.

Positive review.

11d1fc49-71a1-44e1-869f-76be013245a0 commented 12 years ago

Reviewer: Douglas McNeil

11d1fc49-71a1-44e1-869f-76be013245a0 commented 12 years ago

Author: William Stein

jdemeyer commented 12 years ago

Merged: sage-5.0.beta12

jdemeyer commented 12 years ago

apply ONLY this.

jdemeyer commented 12 years ago

comment:26

Attachment: trac_11358-wstein.patch.gz

Rebased patch by removing a hunk which just added a newline.

sagemath / sage

matrix multiplication over ZZ sometimes gives incorrect results #11358