sagemath / sage

Main repository of SageMath
https://www.sagemath.org
Other
1.47k stars 487 forks source link

openblas 0.3.6 vs. OS X on some processors #27961

Closed jhpalmieri closed 5 years ago

jhpalmieri commented 5 years ago

On some OS X machines but not all (I've seen it only on an iMac Pro):

sage -t src/doc/en/thematic_tutorials/explicit_methods_in_number_theory/birds_other.rst
    Bad exit: 1
**********************************************************************
Tests run before process (pid=66736) failed:
sage: from sage.schemes.hyperelliptic_curves.hypellfrob import hypellfrob ## line 16 ##
sage: R.<x> = PolynomialRing(ZZ) ## line 17 ##
sage: f = x^5 + 2*x^2 + x + 1; p = 101 ## line 18 ##
sage: M = hypellfrob(p, 1, f); M ## line 19 ##
[     O(101)      O(101) 93 + O(101) 62 + O(101)]
[     O(101)      O(101) 55 + O(101) 19 + O(101)]
[     O(101)      O(101) 65 + O(101) 42 + O(101)]
[     O(101)      O(101) 89 + O(101) 29 + O(101)]
sage: M = hypellfrob(p, 4, f)   # about 0.25 seconds ## line 35 ##
sage: M[0,0] ## line 36 ##
91844754 + O(101^4)
sage: M.charpoly() ## line 47 ##
(1 + O(101^4))*x^4 + (7 + O(101^4))*x^3 + (167 + O(101^4))*x^2 + (707 + O(101^4))*x + 10201 + O(101^4)
sage: sig_on_count() # check sig_on/off pairings (virtual doctest) ## line 50 ##
0
sage: M = ModularSymbols(GammaH(13,[3]), weight=4) ## line 86 ##
sage: M ## line 87 ##
Modular Symbols space of dimension 14 for Congruence Subgroup Gamma_H(13) with H generated by [3] of weight 4 with sign 0 and over Rational Field
sage: M.basis() ## line 91 ##
([X^2,(0,4)],
 [X^2,(0,7)],
 [X^2,(4,10)],
 [X^2,(4,11)],
 [X^2,(4,12)],
 [X^2,(7,3)],
 [X^2,(7,5)],
 [X^2,(7,6)],
 [X^2,(7,7)],
 [X^2,(7,8)],
 [X^2,(7,9)],
 [X^2,(7,10)],
 [X^2,(7,11)],
 [X^2,(7,12)])
sage: factor(charpoly(M.T(2))) ## line 106 ##
(x - 7) * (x + 7) * (x - 9)^2 * (x + 5)^2 * (x^2 - x - 4)^2 * (x^2 + 9)^2
sage: dimension(M.cuspidal_subspace()) ## line 109 ##
Fatal: Memory exhausted.

**********************************************************************
----------------------------------------------------------------------
sage -t src/doc/en/thematic_tutorials/explicit_methods_in_number_theory/birds_other.rst  # Bad exit: 1

This occurs when building from scratch, with 8.8.beta7 merged with #27847.

(What used to be issue !#1 on this ticket is now addressed at #28008.)

Component: packages: standard

Keywords: days101 openblas

Author: Volker Braun

Branch/Commit: d00c34a

Reviewer: Erik Bray

Issue created by migration from https://trac.sagemath.org/ticket/27961

jhpalmieri commented 5 years ago

Attachment: Sage_crash_report.txt

kiwifb commented 5 years ago
comment:1

I note that it talk about libopenblas_haswellp-r0.3.5.dylib. Shouldn't that be 0.3.6 instead? Was it from a building from scratch or an upgrade from the last beta?

jhpalmieri commented 5 years ago
comment:2

Mine was an incremental build. I'll try from scratch next. David Coudert reported this failure on sage-release and there was a similar message. I don't know if his build was an upgrade or from scratch.

jhpalmieri commented 5 years ago

Description changed:

--- 
+++ 
@@ -1,4 +1,4 @@
-On at least some OS X machines, openblas 0.3.6 causes Sage to crash. From the end of Sage_crash_report.txt:
+On at least some OS X machines, after upgrading from openblas 0.3.5, openblas 0.3.6 causes Sage to crash. From the end of Sage_crash_report.txt:

ImportError: dlopen(/Users/palmieri/Desktop/Sage_stuff/git/sage/local/lib/python2.7/site-packages/sage/matrix/matrix_rational_dense.so, 2): Library not loaded: /Users/palmieri/Desktop/Sage_stuff/git/sage/local/lib/libopenblas_haswellp-r0.3.5.dylib

jhpalmieri commented 5 years ago

Description changed:

--- 
+++ 
@@ -1,7 +1,61 @@
-On at least some OS X machines, after upgrading from openblas 0.3.5, openblas 0.3.6 causes Sage to crash. From the end of Sage_crash_report.txt:
+Two issues with openblas 0.3.6.
+
+1. On at least some OS X machines, after upgrading from openblas 0.3.5, openblas 0.3.6 causes Sage to crash. From the end of Sage_crash_report.txt:

ImportError: dlopen(/Users/palmieri/Desktop/Sage_stuff/git/sage/local/lib/python2.7/site-packages/sage/matrix/matrix_rational_dense.so, 2): Library not loaded: /Users/palmieri/Desktop/Sage_stuff/git/sage/local/lib/libopenblas_haswellp-r0.3.5.dylib Referenced from: /Users/palmieri/Desktop/Sage_stuff/git/sage/local/lib/python2.7/site-packages/sage/matrix/matrix_rational_dense.so Reason: image not found

+This failure only occurs when upgrading Sage; it works when building from scratch.
+
+2. On some OS X machines but not all (I've seen it only on an iMac Pro):
+
+```
+sage -t src/doc/en/thematic_tutorials/explicit_methods_in_number_theory/birds_other.rst
+    Bad exit: 1
+**********************************************************************
+Tests run before process (pid=66736) failed:
+sage: from sage.schemes.hyperelliptic_curves.hypellfrob import hypellfrob ## line 16 ##
+sage: R.<x> = PolynomialRing(ZZ) ## line 17 ##
+sage: f = x^5 + 2*x^2 + x + 1; p = 101 ## line 18 ##
+sage: M = hypellfrob(p, 1, f); M ## line 19 ##
+[     O(101)      O(101) 93 + O(101) 62 + O(101)]
+[     O(101)      O(101) 55 + O(101) 19 + O(101)]
+[     O(101)      O(101) 65 + O(101) 42 + O(101)]
+[     O(101)      O(101) 89 + O(101) 29 + O(101)]
+sage: M = hypellfrob(p, 4, f)   # about 0.25 seconds ## line 35 ##
+sage: M[0,0] ## line 36 ##
+91844754 + O(101^4)
+sage: M.charpoly() ## line 47 ##
+(1 + O(101^4))*x^4 + (7 + O(101^4))*x^3 + (167 + O(101^4))*x^2 + (707 + O(101^4))*x + 10201 + O(101^4)
+sage: sig_on_count() # check sig_on/off pairings (virtual doctest) ## line 50 ##
+0
+sage: M = ModularSymbols(GammaH(13,[3]), weight=4) ## line 86 ##
+sage: M ## line 87 ##
+Modular Symbols space of dimension 14 for Congruence Subgroup Gamma_H(13) with H generated by [3] of weight 4 with sign 0 and over Rational Field
+sage: M.basis() ## line 91 ##
+([X^2,(0,4)],
+ [X^2,(0,7)],
+ [X^2,(4,10)],
+ [X^2,(4,11)],
+ [X^2,(4,12)],
+ [X^2,(7,3)],
+ [X^2,(7,5)],
+ [X^2,(7,6)],
+ [X^2,(7,7)],
+ [X^2,(7,8)],
+ [X^2,(7,9)],
+ [X^2,(7,10)],
+ [X^2,(7,11)],
+ [X^2,(7,12)])
+sage: factor(charpoly(M.T(2))) ## line 106 ##
+(x - 7) * (x + 7) * (x - 9)^2 * (x + 5)^2 * (x^2 - x - 4)^2 * (x^2 + 9)^2
+sage: dimension(M.cuspidal_subspace()) ## line 109 ##
+Fatal: Memory exhausted.
+
+**********************************************************************
+----------------------------------------------------------------------
+sage -t src/doc/en/thematic_tutorials/explicit_methods_in_number_theory/birds_other.rst  # Bad exit: 1
+```
+This occurs when building from scratch, with 8.8.beta7 merged with #27847.
jhpalmieri commented 5 years ago

Description changed:

--- 
+++ 
@@ -7,7 +7,7 @@
   Referenced from: /Users/palmieri/Desktop/Sage_stuff/git/sage/local/lib/python2.7/site-packages/sage/matrix/matrix_rational_dense.so
   Reason: image not found

-This failure only occurs when upgrading Sage; it works when building from scratch.

@@ -58,4 +58,4 @@

sage -t src/doc/en/thematic_tutorials/explicit_methods_in_number_theory/birds_other.rst # Bad exit: 1

-This occurs when building from scratch, with 8.8.beta7 merged with #27847.
+ This occurs when building from scratch, with 8.8.beta7 merged with #27847.
dimpase commented 5 years ago
comment:6

does this happen with clang-compiled Sage? And what Fortran?

jhpalmieri commented 5 years ago
comment:7

Replying to @dimpase:

does this happen with clang-compiled Sage?

Yes

And what Fortran?

A previously compiled gfortran-6.4.0, which I keep around just so I don't have to wait for Sage to build its own. I'll try with Sage's gfortran to see if that makes a difference, at least in issue 2.

dimpase commented 5 years ago
comment:8

I don't think issue 1 has much to do with OpenBLAS update, it's just a building problem. For some reason sage/matrix/matrix_rational_dense.so didn't get rebuilt. I guess it might be Unicode values of strings involved in the description of this Extension in src/module_list.py - indeed,

sage: import pkgconfig
sage: cblas_pc = pkgconfig.parse('cblas')
....: cblas_libs = cblas_pc['libraries']
....: cblas_library_dirs = cblas_pc['library_dirs']
....: cblas_include_dirs = cblas_pc['include_dirs']
....: 
sage: cblas_libs
[u'openblas']
sage: cblas_library_dirs
[u'/mnt/opt/Sage/sage-dev/local/lib']

and then one has

        libraries = ['iml', 'ntl', 'm'] + cblas_libs,

mixing str and unicode in Python 2, at least.

jhpalmieri commented 5 years ago
comment:9

The unicode observation is interesting, although both David Coudert and I saw it with Python 2 and Python 3. With Python 3:

sage: import pkgconfig
sage: cblas_pc = pkgconfig.parse('cblas')
sage: cblas_libs = cblas_pc['libraries']
sage: cblas_library_dirs = cblas_pc['library_dirs']
sage: cblas_libs
['openblas']
sage: cblas_library_dirs
['/Users/jpalmier/Desktop/Sage/sage_builds/PYTHON3/sage-8.8.beta7/local/lib']
jhpalmieri commented 5 years ago
comment:10

Oh, and I completely agree about issue 1 not being a problem with the openblas update, but instead being a Sage build problem.

jhpalmieri commented 5 years ago
comment:11

No change (back to issue 2) if I use Sage's gfortran, by the way.

jhpalmieri commented 5 years ago
comment:12

Let me know what else I can do to diagnose this.

dimpase commented 5 years ago
comment:13

it might be hardware-dependent. If you don't see this on an otherwise identical branch on another OSX machine with the same OS version, then that's the most obvious conclusion.

jhpalmieri commented 5 years ago
comment:14

I agree. Is it an OpenBLAS issue? For what it's worth, when I run the OpenBLAS test suite, I seem to get failures with 0.3.5 and also 0.3.6, although spkg-check exits successfully. None of the failures look immediately like "Fatal: Memory exhausted" to me, but I don't really know what I should be looking for.

vbraun commented 5 years ago
comment:15

Can you try to recompile openblas with different target arch, e.g.

OPENBLAS_CONFIGURE="TARGET=PRESCOTT" ./sage -p openblas

Also, are you not really running out of memory, e.g. background processes / ulimit / ...

jhpalmieri commented 5 years ago
comment:16

I am pretty sure that I am not running out of memory. The machine has 32GB of RAM, and after running into this, I checked the available memory using the OS X "Activity Monitor" and everything was fine. I quit all potential memory hogs, I rebooted, and still got

sage: M = ModularSymbols(GammaH(13,[3]), weight=4)
sage: M.cuspidal_subspace()
Fatal: Memory exhausted.

I am away from the machine now, but I will try with a different target tomorrow.

vbraun commented 5 years ago
comment:17

Another datapoint would be running tests with multithreading disabled, e.g.

OPENBLAS_NUM_THREADS=1 sage -t src/doc/en/thematic_tutorials/explicit_methods_in_number_theory/birds_other.rst
jhpalmieri commented 5 years ago
comment:18

OPENBLAS_CONFIGURE="TARGET=PRESCOTT" ./sage -p openblas works: Sage passes all tests. Also, the openblas test suite succeeds this way.

On the other hand, I still get "Fatal: Memory exhausted" if I don't specify OPENBLAS_CONFIGURE but instead just disable multithreading with OPENBLAS_NUM_THREADS=1 sage -t ....

jhpalmieri commented 5 years ago
comment:19

By the way, as mentioned above, if I don't specify OPENBLAS_CONFIGURE, I see some apparent failures when I run ./sage -f -c openblas, but spkg-check exits successfully. Is make tests misconfigured by OpenBLAS?

embray commented 5 years ago
comment:20

Replying to @vbraun:

Another datapoint would be running tests with multithreading disabled, e.g.

OPENBLAS_NUM_THREADS=1 sage -t src/doc/en/thematic_tutorials/explicit_methods_in_number_theory/birds_other.rst

IIUC this is already the default set (somewhat to my chagrin) in sage-env:

626 # Multithreading in OpenBLAS does not seem to play well with Sage's attempts to
627 # spawn new processes, see #26118. Apparently, OpenBLAS sets the thread
628 # affinity and, e.g., parallel doctest jobs, remain on the same core.
629 # Disabling that thread-affinity with OPENBLAS_MAIN_FREE=1 leads to hangs in
630 # some computations.
631 # So we disable OpenBLAS' threading completely; we might loose some performance
632 # here but strangely the opposite seems to be the case. Note that callers such
633 # as LinBox use a single-threaded OpenBLAS anyway.
634 export OPENBLAS_NUM_THREADS=1

unless we unset that when running the tests?

embray commented 5 years ago
comment:21

I wonder if the upgrade to 0.3.6 can be reverted until/unless the cause of this is rooted out. 0.3.5 was working fine, but there was a rush to upgrade for better gcc 9.0 support. However, this has a workaround to it already: Use an older compiler.

That or, if there is a specific change to OpenBLAS related to the gcc problem perhaps we could just patch that selectively.

embray commented 5 years ago
comment:22

I've been trawling through the existing issues opened against OpenBLAS and don't see anything obvious that matches this problem, though it would help to know exactly how this call is involving OpenBLAS.

vbraun commented 5 years ago
comment:23

Can you upload the openblas config.h and install log?

jhpalmieri commented 5 years ago
comment:24

config.h looks like this (without setting OPENBLAS_CONFIGURE):

#define OS_DARWIN   1
#define ARCH_X86_64 1
#define C_CLANG 1
#define __64BIT__   1
#define FUNDERSCORE _
#define PTHREAD_CREATE_FUNC pthread_create
#define BUNDERSCORE _
#define NEEDBUNDERSCORE 1
#define SKYLAKEX
#define L1_CODE_SIZE 32768
#define L1_CODE_ASSOCIATIVE 8
#define L1_CODE_LINESIZE 64
#define L1_DATA_SIZE 32768
#define L1_DATA_ASSOCIATIVE 8
#define L1_DATA_LINESIZE 64
#define L2_SIZE 262144
#define L2_ASSOCIATIVE 8
#define L2_LINESIZE 64
#define ITB_SIZE 2097152
#define ITB_ASSOCIATIVE 0
#define ITB_ENTRIES 8
#define DTB_SIZE 4096
#define DTB_ASSOCIATIVE 4
#define DTB_DEFAULT_ENTRIES 64
#define HAVE_CMOV
#define HAVE_MMX
#define HAVE_SSE
#define HAVE_SSE2
#define HAVE_SSE3
#define HAVE_SSSE3
#define HAVE_SSE4_1
#define HAVE_SSE4_2
#define HAVE_AVX
#define HAVE_AVX2
#define HAVE_AVX512VL
#define HAVE_FMA3
#define HAVE_CFLUSH
#define NUM_SHAREDCACHE 2
#define NUM_CORES 8
#define CORE_SKYLAKEX
#define CHAR_CORENAME "SKYLAKEX"
#define SLOCAL_BUFFER_SIZE  24576
#define DLOCAL_BUFFER_SIZE  32768
#define CLOCAL_BUFFER_SIZE  12288
#define ZLOCAL_BUFFER_SIZE  8192
#define GEMM_MULTITHREAD_THRESHOLD  4

If I set OPENBLAS_CONFIGURE="TARGET=PRESCOTT", it looks like

#define OS_DARWIN   1
#define ARCH_X86_64 1
#define C_CLANG 1
#define __64BIT__   1
#define FUNDERSCORE _
#define PTHREAD_CREATE_FUNC pthread_create
#define BUNDERSCORE _
#define NEEDBUNDERSCORE 1
#define PENTIUM4
#define L1_DATA_SIZE 16384
#define L1_DATA_LINESIZE 64
#define L2_SIZE 1048576
#define L2_LINESIZE 64
#define DTB_DEFAULT_ENTRIES 64
#define DTB_SIZE 4096
#define L2_ASSOCIATIVE 8
#define HAVE_CMOV
#define HAVE_MMX
#define HAVE_SSE
#define HAVE_SSE2
#define HAVE_SSE3
#define CORE_PRESCOTT
#define CHAR_CORENAME "PRESCOTT"
#define SLOCAL_BUFFER_SIZE  8192
#define DLOCAL_BUFFER_SIZE  8192
#define CLOCAL_BUFFER_SIZE  8192
#define ZLOCAL_BUFFER_SIZE  8192
#define GEMM_MULTITHREAD_THRESHOLD  4
jhpalmieri commented 5 years ago

Attachment: openblas.log

jhpalmieri commented 5 years ago

Attachment: openblas-PRESCOTT.log

vbraun commented 5 years ago
comment:25

My guess is that its a problem with AVX512, not may cpus have it so its not tested much. Also part of the release notes is "the AVX512 DGEMM kernel has been disabled again due to unsolved problems", not exactly filling me with confidence that it all works as expected. Can you try:

OPENBLAS_CONFIGURE="NO_AVX512=1" ./sage -p openblas
jhpalmieri commented 5 years ago
comment:26

Replying to @vbraun:

Can you try:

OPENBLAS_CONFIGURE="NO_AVX512=1" ./sage -p openblas

That works, all tests pass. config.h, in case it's relevant:

#define OS_DARWIN   1
#define ARCH_X86_64 1
#define C_CLANG 1
#define __64BIT__   1
#define FUNDERSCORE _
#define PTHREAD_CREATE_FUNC pthread_create
#define BUNDERSCORE _
#define NEEDBUNDERSCORE 1
#define HASWELL
#define L1_CODE_SIZE 32768
#define L1_CODE_ASSOCIATIVE 8
#define L1_CODE_LINESIZE 64
#define L1_DATA_SIZE 32768
#define L1_DATA_ASSOCIATIVE 8
#define L1_DATA_LINESIZE 64
#define L2_SIZE 262144
#define L2_ASSOCIATIVE 8
#define L2_LINESIZE 64
#define ITB_SIZE 2097152
#define ITB_ASSOCIATIVE 0
#define ITB_ENTRIES 8
#define DTB_SIZE 4096
#define DTB_ASSOCIATIVE 4
#define DTB_DEFAULT_ENTRIES 64
#define HAVE_CMOV
#define HAVE_MMX
#define HAVE_SSE
#define HAVE_SSE2
#define HAVE_SSE3
#define HAVE_SSSE3
#define HAVE_SSE4_1
#define HAVE_SSE4_2
#define HAVE_AVX
#define HAVE_AVX2
#define HAVE_FMA3
#define HAVE_CFLUSH
#define NUM_SHAREDCACHE 2
#define NUM_CORES 8
#define CORE_HASWELL
#define CHAR_CORENAME "HASWELL"
#define SLOCAL_BUFFER_SIZE  24576
#define DLOCAL_BUFFER_SIZE  32768
#define CLOCAL_BUFFER_SIZE  12288
#define ZLOCAL_BUFFER_SIZE  8192
#define GEMM_MULTITHREAD_THRESHOLD  4

Now it's using "HASWELL" instead of "SKYLAKEX".

By the way, what should the workflow be? I ran ... ./sage -p openblas and then make. I didn't see any obvious errors, but make fails, and in particular Sage doesn't start. Running ./sage -ba fixes it. This seems like issue !#1 (from the ticket description): something in the Sage library is not getting rebuilt properly.

vbraun commented 5 years ago

Branch: u/vbraun/openblas_0_3_6_vs__os_x

vbraun commented 5 years ago
comment:28

I think "HASWELL" is correct since AVX512 support is essentially the only new isa

Blas doesn't have any arch-dependent headers so it must be that some binary is linking against libopenblas_haswellp-r0.3.5.dylib instead of libopenblas.dylib. Then when you replace it with libopenblas-whatever-0.3.6.dylib it wont' work any more. On Linux its correct, so thats an OSX special. I've created #28008 to deal with that separate issue.


New commits:

d00c34aDisable OpenBLAS AVX512 support since it causes crashes
vbraun commented 5 years ago

Commit: d00c34a

vbraun commented 5 years ago

Author: Volker Braun

embray commented 5 years ago
comment:31

Seems like the safest bet for now. Would be nice to have but I don't most people using Sage are explicitly dependent on such bleeding-edge features, and they are probably building their own openblas if they do.

embray commented 5 years ago

Reviewer: Erik Bray

embray commented 5 years ago

Changed keywords from none to days101 openblas

jhpalmieri commented 5 years ago

Description changed:

--- 
+++ 
@@ -1,15 +1,4 @@
-Two issues with openblas 0.3.6.
-
-1. On at least some OS X machines, after upgrading from openblas 0.3.5, openblas 0.3.6 causes Sage to crash. From the end of Sage_crash_report.txt:
-
-```
-ImportError: dlopen(/Users/palmieri/Desktop/Sage_stuff/git/sage/local/lib/python2.7/site-packages/sage/matrix/matrix_rational_dense.so, 2): Library not loaded: /Users/palmieri/Desktop/Sage_stuff/git/sage/local/lib/libopenblas_haswellp-r0.3.5.dylib
-  Referenced from: /Users/palmieri/Desktop/Sage_stuff/git/sage/local/lib/python2.7/site-packages/sage/matrix/matrix_rational_dense.so
-  Reason: image not found
-```
- This failure only occurs when upgrading Sage; it works when building from scratch.
-
-2. On some OS X machines but not all (I've seen it only on an iMac Pro):
+On some OS X machines but not all (I've seen it only on an iMac Pro):

sage -t src/doc/en/thematic_tutorials/explicit_methods_in_number_theory/birds_other.rst @@ -58,4 +47,6 @@

sage -t src/doc/en/thematic_tutorials/explicit_methods_in_number_theory/birds_other.rst # Bad exit: 1

- This occurs when building from scratch, with 8.8.beta7 merged with #27847.
+This occurs when building from scratch, with 8.8.beta7 merged with #27847.
+
+(What used to be issue !#1 on this ticket is now addressed at #28008.)
vbraun commented 5 years ago

Changed branch from u/vbraun/openblas_0_3_6_vs__os_x to d00c34a