addmul_2/3/4/5 call in mulmid_basecase.c is invalid

dlaugt commented 6 years ago

I've upgraded mpir from 2.1.1 to 3.0.0 on our compilation servers (Linux and Solaris). All tests passed on Linux. However, on Solaris, some tests are failing:

PASS: t-constants
PASS: t-count_zeros
PASS: t-gmpmax
PASS: t-hightomask
PASS: t-modlinv
PASS: t-parity
PASS: t-popc
PASS: t-sub

============================================================================
Testsuite summary for MPIR 3.0.0
============================================================================
# TOTAL: 9
# PASS:  9
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0
============================================================================

PASS: t-addadd_n
PASS: t-addsub_n
PASS: t-aors_1
PASS: t-asmtype
../../../test-driver[107]: 5502 Abort(coredump)
FAIL: t-dc_bdiv_q
../../../test-driver[107]: 5515 Abort(coredump)
FAIL: t-dc_bdiv_q_n
PASS: t-dc_bdiv_qr
PASS: t-dc_bdiv_qr_n
../../../test-driver[107]: 5555 Abort(coredump)
FAIL: t-dc_div_q
PASS: t-dc_div_qr
PASS: t-dc_div_qr_n
../../../test-driver[107]: 5594 Abort(coredump)
FAIL: t-dc_divappr_q
PASS: t-divebyff
PASS: t-divebyfobm1
PASS: t-divrem_1
PASS: t-fat
../../../test-driver[107]: 5659 Abort(coredump)
FAIL: t-gcdext
PASS: t-get_d
PASS: t-hgcd
PASS: t-instrument
PASS: t-inv_div_q
PASS: t-inv_div_qr
PASS: t-inv_div_qr_n
PASS: t-inv_divappr_q
PASS: t-inv_divappr_q_n
PASS: t-invert
PASS: t-iord_u
PASS: t-lorrshift1
PASS: t-matrix22
PASS: t-mp_bases
PASS: t-mullow_basecase
PASS: t-mullowhigh
../../../test-driver[107]: 5867 Abort(coredump)
FAIL: t-mulmid
PASS: t-mulmod_2expm1
PASS: t-mulmod_2expp1
PASS: t-neg
PASS: t-perfsqr
PASS: t-redc_1
PASS: t-sb_bdiv_q
PASS: t-sb_bdiv_qr
PASS: t-sb_div_q
PASS: t-sb_div_qr
PASS: t-sb_divappr_q
PASS: t-scan
PASS: t-sizeinbase
PASS: t-subadd_n
PASS: t-tdiv_q
PASS: t-tdiv_qr

============================================================================
Testsuite summary for MPIR 3.0.0
============================================================================
# TOTAL: 48
# PASS:  42
# SKIP:  0
# XFAIL: 0
# FAIL:  6
# XPASS: 0
# ERROR: 0

I don't know how to dig into those errors. Any suggestions? test-suite.log

wbhart commented 6 years ago

The log indicates the division code is failing. Given that the other tests pass, there's likely not a configuration issue. My guess would be faulty hardware or compiler bug. I don't think there is much we can do about it. We have no Solaris experts in the project any more.

dlaugt commented 6 years ago

I've just compiled mpir 2.7.2 on the same Solaris machine and on the same compiler (gcc 4.8.1). All tests have passed with this version. This seems to be caused by a change on the code of mpir 3.0.0...

dlaugt commented 6 years ago

One more indication. On Linux with the same compiler (gcc 4.8.1), all tests have passed.

dlaugt commented 6 years ago

CPU is a SPARC-T4:

$ psrinfo -pv
The physical processor has 64 virtual processors (0-63)
  SPARC-T4 (chipid 0, clock 2848 MHz)
The physical processor has 64 virtual processors (64-127)
  SPARC-T4 (chipid 1, clock 2848 MHz)

wbhart commented 6 years ago

Sure, the division code changed in MPIR 3.0.0. I still think it's most likely to be a miscompilation of that new code (or a hardware fault triggered by that new code).

dlaugt commented 6 years ago

The regression is coming from the following change: #193

All tests are passing on Solaris when I replace mpn/generic/mulmid_basecase.c with the one in 2.7.2. I suspect a bug on this code when mpn_addmul_2, mpn_addmul_3, mpn_addmul_4, mpn_addmul_5 or mpn_addmul_6 are native.

Anyway, from my side, I will take mulmid_basecase.c of 2.7.2 which is in synch with the master branch of GMP.

wbhart commented 6 years ago

The issue is probably in this commit then:

https://github.com/wbhart/mpir/pull/193/commits/b78c923dcb66773508cd98c120ec4744fc92abdb

But the code is directly from GMP. If the bug exists in MPIR it probably exists there too. Or, as I say, it's just a miscompilation.

I'm not sure there is any special mpn_addmul2/3/4/5 assembly code. So, short of a bug in the code in that commit, I can't really think of anything else. I suppose it is possible that the mpn_addmul2 generic code in MPIR makes different assumptions to the code in MPIR. But that's a stretch. Lots of other things would be expected to break, if so.

wbhart commented 6 years ago

By the way, what is the result of ./config.guess on your machine, so we can rule out an issue with assembly code.

dlaugt commented 6 years ago

Which gmp repo are you referring to? In this repo, I don't see the optimization with mpn_addmul2/3/4/5/6 in mpn/generic/mulmid_basecase.c. The result of ./config.guess is ultrasparc-sun-solaris2.10.

wbhart commented 6 years ago

Ok, so I didn't write this code, and don't know where that addmul2/3/4/5/6 comes from. I presume it has been moved from somewhere else in the GMP repository into this file as we have a slightly different arrangement of files.

This was all part of a speedup of the powm function, from memory, which had been done in GMP, so we moved over the improvements to MPIR. Alternatively, the changes might have been taken from one of our other files, e.g. one of the plain multiplication functions.

I don't know to what extent the addmul2 code is critical to get these performance improvements. We have an addmul2 in the ultrasparc directory, so it's understandable why this might be being picked up on your machine. Perhaps it is faulty.

We do have addmul2 on a number of other architectures, so it is understandable why the person who did this work might have wanted to make use of it. On the machines where it is available, it is faster than addmul1.

Simply removing that code might make it work, but it will also slow everything down on other arches. So unfortunately we need to get to the bottom of the failure before fixing it.

wbhart / mpir

addmul_2/3/4/5 call in mulmid_basecase.c is invalid #259