sagemath / sage

Main repository of SageMath
https://www.sagemath.org
Other
1.43k stars 479 forks source link

intermittent crash in bernmm (4.0.2.rc0) #6304

Closed e13df781-8644-42aa-9d66-1e8d332e25bb closed 15 years ago

e13df781-8644-42aa-9d66-1e8d332e25bb commented 15 years ago
bsd$ uname -a
Darwin bsd.local 9.7.0 Darwin Kernel Version 9.7.0: Tue Mar 31 22:52:17 PDT 2009; root:xnu-1228.12.14~1/RELEASE_I386 i386

~/sage-4.0.2.rc0
bsd$ ./sage
----------------------------------------------------------------------
| Sage Version 4.0.2.rc0, Release Date: 2009-06-15                   |
| Type notebook() for the GUI, and license() for information.        |
----------------------------------------------------------------------
sage: w = bernoulli(100000, algorithm="bernmm", num_threads=8)
sage: w = bernoulli(100000, algorithm="bernmm", num_threads=8)
/Users/dmharvey/sage-4.0.2.rc0/local/bin/sage-sage: line 198: 62412 Illegal instruction     sage-ipython "$@" -i

~/sage-4.0.2.rc0
bsd$ ./sage
----------------------------------------------------------------------
| Sage Version 4.0.2.rc0, Release Date: 2009-06-15                   |
| Type notebook() for the GUI, and license() for information.        |
----------------------------------------------------------------------
sage: w = bernoulli(100000, algorithm="bernmm", num_threads=8)
sage: w = bernoulli(100000, algorithm="bernmm", num_threads=8)
sage: w = bernoulli(100000, algorithm="bernmm", num_threads=8)
sage: w = bernoulli(100000, algorithm="bernmm", num_threads=8)
sage: w = bernoulli(100000, algorithm="bernmm", num_threads=8)
sage: w = bernoulli(100000, algorithm="bernmm", num_threads=8)
/Users/dmharvey/sage-4.0.2.rc0/local/bin/sage-sage: line 198: 62473 Illegal instruction     sage-ipython "$@" -i

Component: combinatorics

Author: David Harvey

Reviewer: Mike Hansen

Merged: sage-4.2.alpha0

Issue created by migration from https://trac.sagemath.org/ticket/6304

e13df781-8644-42aa-9d66-1e8d332e25bb commented 15 years ago
comment:1

I can reproduce this from outside Sage, on the same machine, building bernmm directly against GMP and NTL, using only two threads.

gdb says the crash happens somewhere inside GMP, during one of the large XGCD operations.

e13df781-8644-42aa-9d66-1e8d332e25bb commented 15 years ago
comment:2

I've been trying to debug this for almost three hours, and I have absolutely no idea what is going wrong.

I can't reproduce the error on any other systems. Only seems to happen on OSX 10.5.

williamstein commented 15 years ago
comment:3

I can't reproduce the error on any other systems. Only seems to happen on OSX 10.5.

I would be OK with the following:

e13df781-8644-42aa-9d66-1e8d332e25bb commented 15 years ago
comment:4

William,

I got the feeling while trying to debug that it could be a compiler issue. The gcc version is 4.0.1 on that box. I've read online that newer versions of XCode for leopard also include gcc 4.2.1, but it's not switched on by default. I couldn't find it on that machine. Would it be possible to try installing apple's newer xcode/gcc to see if that helps?

david

williamstein commented 15 years ago
comment:5

Would it be possible to try installing apple's newer xcode/gcc to see if that helps?

That's a very good idea. What happens on your laptop (I assume you can't replicate the issue).

Anyway, I can't do anything admin-wise on that box until August when I back in Seattle.

e13df781-8644-42aa-9d66-1e8d332e25bb commented 15 years ago
comment:6

Hmmm, no. I can make it fail on my OS 10.4.11 laptop too.

e13df781-8644-42aa-9d66-1e8d332e25bb commented 15 years ago
comment:7

I tried on my wife's laptop which is OS 10.5.7. I switched over to apple's gcc 4.2.1, but I cannot build sage 4.0.2, I get

cc1: error: unrecognized command line option "-Wno-long-double"

while building python-2.5.4.p1.

williamstein commented 15 years ago
comment:8

On Sat, Aug 15, 2009 at 9:52 AM, William Stein<wstein@gmail.com> wrote:
> On Sat, Aug 15, 2009 at 9:42 AM, David Harvey<dmharvey@cims.nyu.edu> wrote:
>>
>> On Aug 15, 2009, at 12:40 PM, William Stein wrote:
>>
>>> On Sat, Aug 15, 2009 at 9:33 AM, David Harvey<dmharvey@cims.nyu.edu>
>>> wrote:
>>>>
>>>> On Aug 15, 2009, at 12:28 PM, William Stein wrote:
>>>>
>>>>> gcc version 4.0.1 (Apple Inc. build 5493)
>>>>
>>>> but still gcc 4.0.1?
>>>>
>>>> Try "man gcc_select"?
>>>
>>> Yes.  So just for clarification, the bug happens with all builds of
>>> GCC 4.0.1, but can be got around by switching to GCC 4.2.x?
>>
>> I don't know. My guess is that there is a bug in the threading support in
>> gcc 4.0.1, but of course it could also be a bug in my code. I spent several
>> hours debugging one day and found nothing. From memory I then tried to build
>> sage using gcc 4.2.x (?) on 10.5 but was not successful, and then I got
>> distracted by other things....
>>
>
> OK, thanks for the clarification.  You do in fact clearly explain this
> at https://github.com/sagemath/sage-prod/issues/6304/.  At least it
> crashes instead of giving wrong answers.
>
> There is no gcc_select command with that name on OS X.  I switched to
> gcc-4.2.1 just by changing two symlinks in /usr/bin/.   (For gcc and
> g++.)   I'll try building Sage on that box with that compiler now.
>

I completely built with the latest gcc-4.2.1, and bernmm test still fails.   I've updated the ticket accordingly.  I think the right thig to do at this point is to make using bernmm off by default for OS X 10.5 intel, and put a remark in the docstring that it will sometimes crash sage with an illegal instruction error, and that using the latest XCode with either GCC 4.0.1 or 4.2.1 does not fix the problem.    Robust multithreaded programming is hard. 
e13df781-8644-42aa-9d66-1e8d332e25bb commented 15 years ago
comment:9

I am trying to debug again on my laptop (core 2 duo, 2 cores, mac os 10.4.11). If I build bernmm standalone using GMP 4.3.1 + NTL 5.4.2 with default configure options, I can get the test suite to fail quite regularly (bus error) with ./bernmm-test --rational 40000 8. Interestingly, if I configure GMP 4.3.1 with recommended "maximum debuggability options" (--disable-shared --enable-assert --enable-alloca=debug --build=none CFLAGS="-m64 -g"), I can't get it to crash any more.

e13df781-8644-42aa-9d66-1e8d332e25bb commented 15 years ago
comment:10

Now I tried compiling GMP with --disable-shared --enable-assert --enable-alloca=debug CFLAGS="-g -O2 -pedantic -m64 -mtune=k8" (the latter is the default CFLAGS plus "-g"), and there seem to be no crashes. This suggests the problem is not in the GMP assembly code.

e13df781-8644-42aa-9d66-1e8d332e25bb commented 15 years ago
comment:11

Tried again, this time removing --disable-shared. Still doesn't crash.

e13df781-8644-42aa-9d66-1e8d332e25bb commented 15 years ago
comment:12

Now making progress.... on sage.math, if I run bernmm-test under valgrind, even for n = 4 and one thread, I get all kind of invalid read errors.

e13df781-8644-42aa-9d66-1e8d332e25bb commented 15 years ago
comment:13

Actually no progress at all. I discovered after another hour that valgrind even reports invalid read errors for a simple program that computes "2+2" using GMP. I have no idea what to make of this.

e13df781-8644-42aa-9d66-1e8d332e25bb commented 15 years ago
comment:14

Moving back to my laptop, if I compile GMP without the --enable-alloca=debug option, the crashes reappear.

e13df781-8644-42aa-9d66-1e8d332e25bb commented 15 years ago
comment:15

Finally got somewhere.

It appears to be a stack overflow issue. It occurs inside GMP's xgcd function. The default stack size for new threads is 8 MB on sage.math but apparently only 512 KB on OSX. If I increase the thread stack size inside bernmm, the crashes stop happening.

I wrote a test program (below) that calls mpz_invert for a given input size using a given thread stack size. (The mpz_invert call is what seems to be causing the problems in bernmm.) I found that for stack size = 512 KB, GMP doesn't have any problems, but if I bump it down to only 448 KB, it starts crashing for inputs of 2800 limbs and above. This is around about the largest size that is used in bernmm for computing B(40000), which is the value of k where problems seem to start occurring. So if bernmm is only using a few 10's of KB of stack, it could push GMP over the limit.

I haven't tried any of this with MPIR, but given that it uses a similar quasi-linear XGCD algorithm, it wouldn't surprise me that the cause is the same.

This is not so easy to address. A band-aid solution is to make bernmm use a bigger stack. The real issue is whether it is reasonable for GMP to require so much stack space for the XGCD operation (or conversely whether the default stack size on OSX is too small). I will ask on the GMP mailing list about this.

#include <limits.h>
#include <stdio.h>
#include <gmp.h>
#include <pthread.h>

void*
worker (void* arg)
{
  size_t n = * (size_t*) arg;

  mpz_t a, b;
  mpz_init (a);
  mpz_init (b);

  /* try to invert a random number modulo B^n + 1 */
  mpz_random (a, n);
  mpz_set_ui (b, 1);
  mpz_mul_2exp (b, b, n * GMP_NUMB_BITS);
  mpz_add_ui (b, b, 1);
  mpz_invert (a, a, b);

  mpz_clear (b);
  mpz_clear (a);
}

int
main (int argc, char* argv[])
{
  if (argc < 3)
    {
      printf ("syntax: test <n> <stacksize>\n");
      return 0;
    }

  size_t n = atol (argv[1]);
  size_t old_stacksize;
  size_t new_stacksize = atol (argv[2]);

  pthread_attr_t attr;
  pthread_attr_init (&attr);

  pthread_attr_getstacksize (&attr, &old_stacksize);
  printf ("old stacksize = %ld\n", old_stacksize);

  int retval = pthread_attr_setstacksize (&attr, new_stacksize);
  if (retval != 0)
    {
      printf ("PTHREAD_STACK_MIN = %ld\n", PTHREAD_STACK_MIN);
      printf ("pthread_attr_setstacksize call failed with size = %ld\n",
              new_stacksize);
      return 0;
    }

  pthread_t thread;
  pthread_create (&thread, &attr, worker, &n);
  pthread_join (thread, NULL);

  pthread_attr_destroy (&attr);

  return 0;
}
e13df781-8644-42aa-9d66-1e8d332e25bb commented 15 years ago

Attachment: 6304.patch.gz

e13df781-8644-42aa-9d66-1e8d332e25bb commented 15 years ago
comment:16

I have released bernmm 1.1 which addresses this issue, by providing a THREAD_STACK_SIZE compile-time option. See attached patch.

This doesn't address the underlying issue (that in my opinion, GMP/MPIR uses too much stack space by default for XGCD), but it will have to do for the moment.

mwhansen commented 15 years ago

Reviewer: Mike Hansen

mwhansen commented 15 years ago
comment:18

This changed fixed things for me. I'm going to go ahead and give it a positive review.

mwhansen commented 15 years ago

Author: David Harvey

mwhansen commented 15 years ago

Merged: sage-4.2.alpha0