sagemath / sage

Main repository of SageMath
https://www.sagemath.org
Other
1.38k stars 469 forks source link

test failures in sage.schemes.curves.zariski_vankampen with sirocco built with -O2 #29167

Closed embray closed 4 years ago

embray commented 4 years ago

29149 fixes building Sage with the sirocco package installed on Cygwin. This required, for linking purposes, removing the forced -O0 flag (this could probably be worked around by other means--building with optimizations disabled should be possible).

However, as mmarco remarked below, there are some optimizations that cause numerical errors in sirocco such that it fails to converge or takes a longer path than necessary. This can be reproduced on Linux (although it was previously found on Cygwin).

In particular this causes test failures in sage.schemes.curves.zariski_vankampen. First the following test fails:

File "src/sage/schemes/curves/zariski_vankampen.py", line 268, in sage.schemes.curves.zariski_vankampen.followstrand
Failed example:
    followstrand(f, x0, x1, -1.0) # optional - sirocco # abs tol 1e-15
Expected:
    [(0.0, -1.0, 0.0),
     (0.7500000000000001, -1.015090921153253, -0.24752813818386948),
     (1.0, -1.026166099551513, -0.32768940253604323)]
Got:
    [(0.0, -1.0, 0.0),
     (0.04687500000000001, -1.0000610264662042, -0.015624364352899234),
     (0.09376287283569926, -1.0002440686687775, -0.031249206930640684),
     (0.14068935711887412, -1.00054911561723, -0.046879295645742315),
     (0.18768019150872053, -1.0009762160115023, -0.06251939538085316),
     (0.23476112890433726, -1.0015254781428102, -0.07817426805966325),
     (0.2819579506558838, -1.0021970697475961, -0.0938486727311868),
     (0.32683564123989106, -1.0029469806716729, -0.10873190673317454),
     (0.36890847616239786, -1.0037475631034418, -0.12266360371442413),
     (0.408351758902248, -1.0045828947575761, -0.13570358710012892),
     (0.4453298364708575, -1.0054396825086667, -0.14790834427708396),
     (0.5146637319120003, -1.007235298523974, -0.1707339873423019),
     (0.5753308904230002, -1.0090048273677124, -0.19063761622091374),
     (0.6284146541201252, -1.0107014101228857, -0.20799548368797782),
     (0.6748629473551095, -1.0122968271247523, -0.22313613838778654),
     (0.7155052039357208, -1.0137759106169235, -0.23634583220457633),
     (0.7866291529517906, -1.0165461226074612, -0.25937181657421426),
     (0.8399721147138429, -1.0187713227013186, -0.2765614548183696),
     (0.8799793360353821, -1.0205207374109584, -0.28940677345545085),
     (0.939990168017691, -1.0232704733075533, -0.30859664694102984),
     (1.0, -1.026166099551513, -0.3276894025360433)]

The first an last lines are correct, but none of the stuff in the middle.

Then a bit later it hangs, seemingly indefinitely at

Trying (line 339):    B = zvk.braid_in_segment(g,CC(p1),CC(p2)) # optional - sirocco

which on Linux passes very quickly. However, the CPU is still very active during this hang so it must be some kind of busy loop.

I don't understand anything about this package except that it's using some kind of linear approximation techniques, so there might be some small numerical glitches being invoked that are causing some solutions to diverge or something.

It might be possible to work around this by identifying exactly which optimizations result in the problem, rather than disabling all optimizations.

Upstream PR: https://github.com/miguelmarco/SIROCCO2/pull/3

Upstream: Fixed upstream, but not in a stable release.

CC: @miguelmarco

Component: algebraic topology

Keywords: cygwin sirocco

Author: Erik Bray

Branch/Commit: u/embray/ticket-29167 @ 35d7703

Issue created by migration from https://trac.sagemath.org/ticket/29167

miguelmarco commented 4 years ago
comment:1

Thanks for looking into this.

Sadly, I don't have access to any windows system, so I never tested Sirocco there.

As you say, Sirocco does numerical aproximations of the movements of roots of a polynomial, but carefully testing that the step is small enough to guarantee the correctness. So it is indeed very sensitive to all the small subtleties of floating point computing.

For example, we discovered when we moved to autotools that the optimization flag -O2 can actually go against performance (and maybe even correctness) , so we had to make sure that it was compiled with no optimization options. Maybe the problem you are detecting on windows is related to that?

miguelmarco commented 4 years ago
comment:2

I just tested under a linux box disabling the tweek we did to prevent the compiler optimization and I get the same error as you point, so it really looks that is the cuplrit.

In particular, the only line I had to modify is this one:

: ${CXXFLAGS=-O0 -g}

In the configure.ac file [1]

Maybe that kind of autotools behaviour is platform-dependent?

[1] https://github.com/miguelmarco/SIROCCO2/blob/master/configure.ac#L4

P.S. I noticed that you forked the old sirocco repo. The one we are using here is the version 2 one: https://github.com/miguelmarco/SIROCCO2

embray commented 4 years ago
comment:3

That's for the note about -O0. I have another ticket #29149 where I'm fixing building sirocco on Cygwin (though I haven't posted the patch yet). In particular I had to remove CXXFLAGS=-O0 to even get it to build, because otherwise I had problems during linking with some templates; I'm not exactly sure why but it seems like they were being overspecialized, in such a way that resulted in multiple definitions for some functions. Removing -O0 fixed it because then the compiler would optimize out unused specializations, but there is probably a better solution.

As you say, this is likely the culprit.

embray commented 4 years ago
comment:4

Replying to @miguelmarco:

P.S. I noticed that you forked the old sirocco repo. The one we are using here is the version 2 one: https://github.com/miguelmarco/SIROCCO2

I see. When I noticed that my copy was not 2.0 I updated my fork of the repo to 2.0 as well, so I can confirm that I am building the correct version at least.

embray commented 4 years ago
comment:5

Replying to @miguelmarco:

For example, we discovered when we moved to autotools that the optimization flag -O2 can actually go against performance (and maybe even correctness) , so we had to make sure that it was compiled with no optimization options. Maybe the problem you are detecting on windows is related to that?

Did you ever figure out what specific optimizations were hurting performance and/or correctness, and where those optimizations were occurring? Because disabling all optimizations is a blunt hammer and probably hurts performance in other cases. I know figuring that out can be tricky of course, but I have to ask.

embray commented 4 years ago
comment:6

I was able to reproduce this on Linux also by removing -O0 and setting -O2 instead.

embray commented 4 years ago

Description changed:

--- 
+++ 
@@ -1,6 +1,8 @@
-#29149 fixes building Sage with the sirocco package installed on Cygwin.
+#29149 fixes building Sage with the sirocco package installed on Cygwin. This required, for linking purposes, removing the forced `-O0` flag (this could probably be worked around by other means--building with optimizations disabled *should* be possible).

-However, this then has some test failures in `sage.schemes.curves.zariski_vankampen` that I can't reproduce on Linux.  First the following test fails:
+However, as mmarco remarked below, there are some optimizations that cause numerical errors in sirocco such that it fails to converge or takes a longer path than necessary.  This *can* be reproduced on Linux (although it was previously found on Cygwin).
+
+In particular this causes test failures in `sage.schemes.curves.zariski_vankampen`.  First the following test fails:

File "src/sage/schemes/curves/zariski_vankampen.py", line 268, in sage.schemes.curves.zariski_vankampen.followstrand @@ -45,3 +47,5 @@ which on Linux passes very quickly. However, the CPU is still very active during this hang so it must be some kind of busy loop.

I don't understand anything about this package except that it's using some kind of linear approximation techniques, so there might be some small numerical glitches being invoked that are causing some solutions to diverge or something. + +It might be possible to work around this by identifying exactly which optimizations result in the problem, rather than disabling all optimizations.

embray commented 4 years ago
comment:8

I can also reproduce the problem enabling basic optimizations with -O1, so it could be any combination of

-fauto-inc-dec 
-fbranch-count-reg 
-fcombine-stack-adjustments 
-fcompare-elim 
-fcprop-registers 
-fdce 
-fdefer-pop 
-fdelayed-branch 
-fdse 
-fforward-propagate 
-fguess-branch-probability 
-fif-conversion 
-fif-conversion2 
-finline-functions-called-once 
-fipa-profile 
-fipa-pure-const 
-fipa-reference 
-fipa-reference-addressable 
-fmerge-constants 
-fmove-loop-invariants 
-fomit-frame-pointer 
-freorder-blocks 
-fshrink-wrap 
-fshrink-wrap-separate 
-fsplit-wide-types 
-fssa-backprop 
-fssa-phiopt 
-ftree-bit-ccp 
-ftree-ccp 
-ftree-ch 
-ftree-coalesce-vars 
-ftree-copy-prop 
-ftree-dce 
-ftree-dominator-opts 
-ftree-dse 
-ftree-forwprop 
-ftree-fre 
-ftree-phiprop 
-ftree-pta 
-ftree-scev-cprop 
-ftree-sink 
-ftree-slsr 
-ftree-sra 
-ftree-ter 
-funit-at-a-time
miguelmarco commented 4 years ago
comment:9

I didn't dig into the details of what specific optimization was the problem. Sorry.

embray commented 4 years ago
comment:10

One problem I found (if not the problem) is that although there is a template specialization for evaluation polynomials on complex intervals:

template <>
IComplex Polynomial<IComplex>::operator () (const IComplex &x, const IComplex &y) const{
    IComplex rop, ropx, ropy;

#ifdef DEVELOPER
    std::cerr << "using specialized polynomial eval for " << __PRETTY_FUNCTION__ << std::endl;
#endif

    ropx = this->evalIPolHornerXY (x, y);
    ropy = this->evalIPolHornerYX (x, y);
    rop  = this->evalPolClassic (x, y);

    rop.r.a = MAX(MAX(ropx.r.a, ropy.r.a), rop.r.a);
    rop.r.b = MIN(MIN(ropx.r.b, ropy.r.b), rop.r.b);
    rop.i.a = MAX(MAX(ropx.i.a, ropy.i.a), rop.i.a);
    rop.i.b = MIN(MIN(ropx.i.b, ropy.i.b), rop.i.b);

    return rop;
}

when building with -O1 that specialization seems to be being ignored, and instead it's using the unspecialized template:

template <class T>
T Polynomial<T>::operator() (const T &x, const T &y) const {
#ifdef DEVELOPER
    std::cerr << "using unspecialized polynomial eval for " << __PRETTY_FUNCTION__ << std::endl;
#endif
    return this->evalPolClassic (x,y);
}

I don't know why that would be but I'll keep looking...

embray commented 4 years ago
comment:11

This seems to fix it for me:

diff --git a/include/polynomial.hpp b/include/polynomial.hpp
index 88d1299..4596579 100644
--- a/include/polynomial.hpp
+++ b/include/polynomial.hpp
@@ -230,6 +230,10 @@ T Polynomial<T>::operator() (const T &x, const T &y) const {
        return this->evalPolClassic (x,y);
 }

+// Defined in lib/polynomial.cpp
+template <> IComplex Polynomial<IComplex>::operator() (const IComplex &x, const IComplex &y) const;
+template <> MPIComplex Polynomial<MPIComplex>::operator () (const MPIComplex &x, const MPIComplex &y) const;
+
 template <class T>
 T Polynomial<T>::diffX (const T &x, const T &y) const {
                // coef[(i*(i+1))/2 + j] is coeficient of monomial of degree 'i',
@@ -311,6 +315,10 @@ T Polynomial<T>::diffY (const T &x, const T &y) const {
        return this->evalPolYClassic (x,y);
 }

+// Defined in lib/polynomial.cpp
+template<> IComplex Polynomial<IComplex>::diffY (const IComplex &x, const IComplex &y) const;
+template<> MPIComplex Polynomial<MPIComplex>::diffY (const MPIComplex &x, const MPIComplex &y) const;
+

 template <class T>
 T Polynomial<T>::diffXX (const T &x, const T &y) const {

The problem is that sirocco.cpp includes "polynomial.hpp" which only has the unspecialized operator() defined inline, but no declaration for the specialized version, so it just use the unspecialized one. Meanwhile, when compiling polynomial.cpp the unused specializations are just optimized out.

This would possibly explain the link errors I was having on Cygwin as well--I don't know why the problem only occurs on Windows but it must be generating an explicit specialization for Polynomial<IComplex>::operator() in sirocco.o and then conflicting with the one in polynomial.o. Going to test that theory out now.

embray commented 4 years ago
comment:12

Yep, this is exactly the link error that I got which motivated me to try removing the -O0. This explains it exactly:

libtool: link: g++ -std=gnu++11 -std=gnu++11 -shared -nostdlib /usr/lib/gcc/x86_64-pc-cygwin/7.4.0/crtbeginS.o  .libs/icomplex.o .libs/interval.o .libs/mp_complex.o .libs/mp_icomplex.o .libs/mp_interval.o .libs/polynomial.o .libs/sirocco.o   -lmpfr -lgmp -L/home/embray/src/sagemath/sage/local/lib -L/home/embray/src/sagemath/sage/local/lib/../lib -L/usr/lib/gcc/x86_64-pc-cygwin/7.4.0 -L/usr/lib/gcc/x86_64-pc-cygwin/7.4.0/../../../../x86_64-pc-cygwin/lib/../lib -L/usr/lib/gcc/x86_64-pc-cygwin/7.4.0/../../../../lib -L/lib/../lib -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-pc-cygwin/7.4.0/../../../../x86_64-pc-cygwin/lib -L/usr/lib/gcc/x86_64-pc-cygwin/7.4.0/../../.. -lstdc++ -lgcc_s -lgcc -lcygwin -ladvapi32 -lshell32 -luser32 -lkernel32 -lgcc_s -lgcc /usr/lib/gcc/x86_64-pc-cygwin/7.4.0/crtend.o  -O0 -Wl,-rpath -Wl,/home/embray/src/sagemath/sage/local/lib   -o .libs/cygsirocco-0.dll -Wl,--enable-auto-image-base -Xlinker --out-implib -Xlinker .libs/libsirocco.dll.a
.libs/sirocco.o:sirocco.cpp:(.text$_ZNK10PolynomialI8IComplexEclERKS0_S3_[_ZNK10PolynomialI8IComplexEclERKS0_S3_]+0x0): multiple definition of `Polynomial<IComplex>::operator()(IComplex const&, IComplex const&) const'
.libs/polynomial.o:polynomial.cpp:(.text+0x642): first defined here
.libs/sirocco.o:sirocco.cpp:(.text$_ZNK10PolynomialI8IComplexE5diffYERKS0_S3_[_ZNK10PolynomialI8IComplexE5diffYERKS0_S3_]+0x0): multiple definition of `Polynomial<IComplex>::diffY(IComplex const&, IComplex const&) const'
.libs/polynomial.o:polynomial.cpp:(.text+0xfbe): first defined here
.libs/sirocco.o:sirocco.cpp:(.text$_ZNK10PolynomialI10MPIComplexEclERKS0_S3_[_ZNK10PolynomialI10MPIComplexEclERKS0_S3_]+0x0): multiple definition of `Polynomial<MPIComplex>::operator()(MPIComplex const&, MPIComplex const&) const'
.libs/polynomial.o:polynomial.cpp:(.text+0x188a): first defined here
.libs/sirocco.o:sirocco.cpp:(.text$_ZNK10PolynomialI10MPIComplexE5diffYERKS0_S3_[_ZNK10PolynomialI10MPIComplexE5diffYERKS0_S3_]+0x0): multiple definition of `Polynomial<MPIComplex>::diffY(MPIComplex const&, MPIComplex const&) const'
.libs/polynomial.o:polynomial.cpp:(.text+0x22f8): first defined here
collect2: error: ld returned 1 exit status
embray commented 4 years ago

Author: Erik Bray

embray commented 4 years ago

New commits:

35d7703Trac #29167: Fix incorrect results that occur when sirocco is build with
embray commented 4 years ago

Description changed:

--- 
+++ 
@@ -49,3 +49,5 @@
 I don't understand anything about this package except that it's using some kind of linear approximation techniques, so there might be some small numerical glitches being invoked that are causing some solutions to diverge or something.

 It might be possible to work around this by identifying exactly which optimizations result in the problem, rather than disabling all optimizations.
+
+*Upstream PR:* https://github.com/miguelmarco/SIROCCO2/pull/3
embray commented 4 years ago

Branch: u/embray/ticket-29167

embray commented 4 years ago

Upstream: Reported upstream. Developers acknowledge bug.

embray commented 4 years ago

Commit: 35d7703

miguelmarco commented 4 years ago
comment:14

Thanks a lot. Great work.

I might need some time to review and test it though.

Did you check that all test pass with this patch?

Also, what would you consider the best way to go: just patch sirocco at install time inside Sage, or release a new version of Sirocco and using that in Sage?

embray commented 4 years ago
comment:15

Replying to @miguelmarco:

Thanks a lot. Great work.

I might need some time to review and test it though.

Thank you for looking at it! There's no hurry.

Did you check that all test pass with this patch?

Per our discussion on GitHub, yes. I'm just reiterating here for the record.

Also, what would you consider the best way to go: just patch sirocco at install time inside Sage, or release a new version of Sirocco and using that in Sage?

It's up to you. There's no urgency on this either way, so whatever's easiest for you.

embray commented 4 years ago

Changed upstream from Reported upstream. Developers acknowledge bug. to Fixed upstream, but not in a stable release.

miguelmarco commented 4 years ago
comment:16

I just created a new release in github. The source code is available at https://github.com/miguelmarco/SIROCCO2/releases/download/v2.0.1/libsirocco-2.0.1.tar.gz

embray commented 4 years ago
comment:17

Oh shoot, before you made the release I should have pointed out to you the fixes I made for Cygwin. I forgot to make a PR for them yet. I will do that now...

miguelmarco commented 4 years ago
comment:18

No problem, I can make another release (that is the good thing about numbers: we won't run out of them).

miguelmarco commented 4 years ago
comment:19

I made a new release on github.

The tarball is available at https://github.com/miguelmarco/SIROCCO2/releases/download/2.0.2/libsirocco-2.0.2.tar.gz

tscrim commented 4 years ago
comment:20

Erik, are you going to rebase this ticket based on the new release?

embray commented 4 years ago
comment:21

Yes, I've been busy with other things, but I plan to replace this ticket with one to update the sirocco spkg to the new release.

embray commented 4 years ago
comment:22

Superseded by #29284 which upgrades the SIROCCO SPKG to the new 2.0.2 release, rather than patching.