Closed mhoemmen closed 7 years ago
I've now built Teuchos just using OpenBLAS and gcc. I then modified the standalone test program to use the Teuchos LAPACK interface. The compile line is:
/home/projects/pwr8-rhel73-lsf/gcc/5.4.0/bin/gcc -I/ascldap/users/kbeckwi/Trilinos/packages/teuchos/numerics/src -I/ascldap/users/kbeckwi/Trilinos/packages/teuchos/comm/src -I/ascldap/users/kbeckwi/Trilinos/packages/teuchos/parameterlist/src -I/ascldap/users/kbeckwi/buildTeuchos/packages/teuchos/core/src -I/ascldap/users/kbeckwi/Trilinos/packages/teuchos/core/src teuchosMain.cpp -L/home/projects/pwr8-rhel73-lsf/openblas/0.2.19/gcc/5.3.0/lib/ -lopenblas -lstdc++ buildTeuchos/packages/teuchos/numerics/src/libteuchosnumerics.a
This results in the failed case that we're seeing above:
STEQR test ... ( Lambda min: expected 1; Lambda max: expected 1031) ** On entry to DSTEQR parameter number 1 had an illegal value STEQR: compute symmetric tridiagonal eigenvalues: LAPACK's _STEQR failed with info = -1 < 0.( After the call, Lambda min: expected 1; Lambda max: expected 1031) FAILED ( Lambda min: expected 1, computed 1031; Lambda max: expected 1031, computed 1)
This rules out that the issue is due to a combination of ESSL/OpenBLAS. That this same combination of OpenBLAS library and compiler works when the Teuchos LAPACK interface is not involved suggests some kind of inconsistency in Teuchos. More to follow.
I'm now able to reproduce this issue independent of Teuchos. If I define the interface into LAPACK as follows (this matches the pattern used in Teuchos):
# define PREFIX
# define My_fcd const char *
# define DSTEQR_F77 dsteqr_
#define CHAR_MACRO(char_var) &char_var
extern "C" {
void PREFIX DSTEQR_F77(My_fcd, const int* n, double* D, double* E, double* Z, const int* ldz, double* work, int* info);
}
void STEQR(const char COMPZ, const int n, double* D, double* E, double* Z, const int ldz, double* WORK, int* info)
{
DSTEQR_F77(CHAR_MACRO(COMPZ), &n, D, E, Z, &ldz, WORK, info);
}
and then call into this as:
typedef double ScalarType;
std::vector<ScalarType> diagonal(DIAG_SZ,0.0);
std::vector<ScalarType> subdiagonal(DIAG_SZ-1,0.0);
int dont_call_me_info = 0;
const int dummy_ldz = 1;
ScalarType scalar_dummy = -1.0;
char char_N = 'N';
const int N = DIAG_SZ;
std::vector<MagnitudeType> mag_dummy(N,0.0);
STEQR(char_N, N, &diagonal[0], &subdiagonal[0],
&scalar_dummy, dummy_ldz, &mag_dummy[0], &dont_call_me_info);
then this reproduces the failure above. To gain some more insight, I removed the call into LAPACK and examine what happened to the char when we take a reference to it:
#define CHAR_MACRO(char_var) &char_var
void STEQR(const char COMPZ, const int n, double* D, double* E, double* Z, const int ldz, double* WORK, int* info)
{
const std::string myString = CHAR_MACRO(COMPZ);
std::cout << "myString = " << myString << "; myString.c_str() " << myString.c_str() << "; COMPZ = " << COMPZ << std::endl;
}
The output from this is insightful:
myString = N???; myString.c_str() N???; COMPZ = N
My suspicion is that char is not null-terminated and so the CHAR_MACRO(COMPZ) operation is resulting in a corrupted string. There are two things however. First, the OpenBLAS API is expecting a char* here, so we have to perform &COMPZ. Second, if we do the following:
void STEQR(const char COMPZ, const int n, double* D, double* E, double* Z, const int ldz, double* WORK, int* info)
{
const std::string myString = CHAR_MACRO(COMPZ);
DSTEQR_F77(myString.c_str(), &n, D, E, Z, &ldz, WORK, info);
}
the test passes. This suggests another approach:
void STEQR(const std::string COMPZ, const int n, double* D, double* E, double* Z, const int ldz, double* WORK, int* info)
{
DSTEQR_F77(COMPZ.c_str(), &n, D, E, Z, &ldz, WORK, info);
}
This, however, would require reworking the Teuchos LAPACK API. Are there any insights as to why this was not done in the original implementation as it seems like the more portable approach?
Note that I did try out the initial suggestion on this ticket (redefining the CHAR_MACRO and Teuchos_fcd macros), but this also resulted in the test failing.
@krcb wrote:
Are there any insights as to why this was not done in the original implementation as it seems like the more portable approach?
The Fortran reference implementation of the BLAS and LAPACK only ever reads the first character of any of those strings (you can see it in the code), so it's a little bit unusual for OpenBLAS to expect a zero-terminated character array.
Teuchos::BLAS and Teuchos::LAPACK actually evolved from what was called Tpetra at the time (2003). You can see this in, e.g., the comments at the top of Teuchos_BLAS.hpp by "Kris" (Kampshoff). My guess is that nobody thought very hard about it. Tpetra was just a side research project until 2010. What's weird is that Teuchos::BLAS uses enums (ETransp, EUplo) in place of character arguments, while Teuchos::LAPACK exclusively uses char
arguments.
@bartlettroscoe and I discussed this, with a follow up meeting between @jwillenbring and myself in order to determine how to proceed. As a next step, we will determine whether using a Teuchos::BLAS type interface to LAPACK will resolve the corruption observed in passing a char into the Teuchos::LAPACK::STEQR interface, through the use of a lightweight prototype. I anticipate being able to report back within the next couple of days.
I was able to get the test proposed above to pass by implementing the following code inside the Teuchos::LAPACK interface:
enum CompZENum {
STEQR_EIGENVALUES,
STEQR_BOTH_ORIGINAL,
STEQR_BOTH_TRIDIAG
};
const char* CompZChar[] = {"N","V","Z"};
void LAPACK<int,double>::STEQR(CompZENum compz, const int n, double* D, double* E, double* Z, const int ldz, double* WORK, int* info) const
DSTEQR_F77(CompZChar[compz], &n, D, E, Z, &ldz, WORK, info);
}
This is then called from the test as:
Teuchos::LAPACK<int,ScalarType> lapack;
lapack.STEQR(Teuchos::STEQR_EIGENVALUES, N, &diagonal[0], &subdiagonal[0],
&scalar_dummy[0], dummy_ldz, &mag_dummy[0], &dont_call_me_info);
As a next step, I'm going to verify that this approach fixes at least one test in a downstream package.
I built Belos against the modified LAPACK STEQR interface and then ran:
./Belos_Tpetra_PseudoBlockCG_hb_test.exe --verbose
i.e. the test reported on previously. This test reported passed with the output:
Teuchos::GlobalMPISession::GlobalMPISession(): started serial run
Input parameter list:
Block Size = 1 [unused]
Maximum Iterations = 1805
Convergence Tolerance = 1e-05
Verbosity = 57
Estimate Condition Number = 1
Belos parameter list:
Maximum Iterations = 1805 [unused]
Verbosity = 57 [unused]
Output Frequency = -1 [unused]
Convergence Tolerance = 1e-05 [unused]
Dimension of matrix: 1806
Number of right-hand sides: 1
Block size used by solver: 1
Max number of CG iterations: 1805
Relative residual tolerance: 1e-05
Belos::StatusTestGeneralOutput: Passed
(Num calls,Mod test,State test): (914, 1, Passed)
Passed.......OR Combination ->
OK...........Number of Iterations = 913 < 1805
Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
residual [ 0 ] = 9.20765e-06 < 1e-05
Passed.......OR Combination ->
OK...........Number of Iterations = 913 < 1805
Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
residual [ 0 ] = 9.20765e-06 < 1e-05
================================================================================
TimeMonitor results over 1 processor
Timer Name Global time (num calls)
--------------------------------------------------------------------------------
Belos: Operation Op*x 0.09915 (914)
Belos: Operation Prec*x 0 (0)
Belos: PseudoBlockCGSolMgr total solve time 0.134 (1)
================================================================================
Proc 0: Finished solve
Input parameter list after solve:
Block Size = 1 [unused]
Maximum Iterations = 1805
Convergence Tolerance = 1e-05
Verbosity = 57
Estimate Condition Number = 1
Belos parameter list after solve:
Maximum Iterations = 1805 [unused]
Verbosity = 57 [unused]
Output Frequency = -1 [unused]
Convergence Tolerance = 1e-05 [unused]
---------- Actual Residuals (normalized) ----------
Problem 0 : 9.20765e-06
Condition Estimate = 275399
End Result: TEST PASSED
This appears to be reasonable. The next step is to scope out the level of effort required to propagate this type of change through the API.
@jwillenbring and I are discussing scope of this effort.
Kris,
Would users of STEQR():
Teuchos::LAPACK<int,ScalarType> lapack; lapack.STEQR()
be required to change their code in order to adapt to the change? It is not clear to me. We cannot change the API in a way that forces source changes for users of Teuchos::LAPACK.
@bartlettroscoe and I discussed an approach where we would implement a new Teuchos::LAPACK interface that implemented the API proposed here, while keeping the existing API in place for backwards compatibility (possibly with a deprecated warning). Users of Teuchos::LAPACK would then have a choice of which API to make use of; Tech-X would also refactor usage of Teuchos::LAPACK in Trillinos to use the new API.
Couldn't we just overload Teuchos::LAPACK
functions that take strings, to take the existing Teuchos::E*
enums used in Teuchos::BLAS
instead? That would neither break backwards compatibility, nor create new enums? The implementation of the existing Teuchos enums already uses the strategy suggested by @krcb.
Couldn't we just overload Teuchos::LAPACK functions that take strings, to take the existing Teuchos::E* enums used in Teuchos::BLAS instead? That would neither break backwards compatibility, nor create new enums? The implementation of the existing Teuchos enums already uses the strategy suggested by @krcb.
If the same emums cover all of the LAPACK use cases, I agree. If it does not, then add the new ones that are needed but don't duplicate the enums that already exist.
Given major superiority of enums over strings in static checking, the existing Teuchos::LAPACK interfaces using the raw char strings should be deprecated.
Talking with @nmhamster, it seems that we could bypass this issue if we can get the OpenBLAS implementation to restrict its string check to a single character. This would be in line with other BLAS versions and the NETLIB reference implementation. The standard itself asks for a single character.
Before doing anything ourselves, it would be worth finding out if a change to OpenBLAS is possible. Alternatively, could we use the LAPACK Fortran reference code instead? When combined with ESSL, which provides optimized versions of the most important functions, the performance of reference LAPACK functions shouldn't matter very much.
I am not at all inclined to pursue a full-scale refactoring of this capability in Teuchos unless it is absolutely necessary. This would be a disproportionate response to an issue that has never been a real problem in more than a decade, until now.
@bartlettroscoe
If it does not, then add the new ones that are needed but don't duplicate the enums that already exist.
LAPACK would need a few more enums for things like _GESVD's JOBU and JOBVT arguments:
http://www.netlib.org/lapack/explore-html/d8/d2d/dgesvd_8f.html
I'm personally indifferent to the solution path, as long as there is some working BLAS and LAPACK for IBM POWER + GCC.
I am not at all inclined to pursue a full-scale refactoring of this capability in Teuchos unless it is absolutely necessary. This would be a disproportionate response to an issue that has never been a real problem in more than a decade, until now.
But the current Teuchos::LAPACK that takes raw chars is just a bad interface. It has zero compile-time checking for those arguments. This is not a complex refactoring at all. I suspect some simple sed-like scripts could do the refactoring automatically for 95% of the cases in Trilinos. This should not take more than a day of work and it is 100% guaranteed to solve this problem now and forever more.
Besides, fixing OpenBLAS may not fix the problem since, according to @krcb, that compiler is giving garbage when you take the address of the single char.
The other issue is that fixing our copy of OpenBLAS does not help anyone else and we may keep hearing about this for years to come on different platforms where people are trying to use OpenBLAS with Trilinos and having it fail like this. Then we will hear about how Trilinos is sloppy with how it calls LAPACK, etc. (whether that is true or not).
OpenBLAS is used outside the context of IBM Power8, whatever changes made need to not break OpenBLAS on x86. Is this really an issue with OpenBLAS? Or is it an issue with GNU GCC/Power8. If taking the address of a char is the problem, that is not an OpenBLAS' problem.
@bartlettroscoe If I understood what @maherou and @nmhamster discussed, the idea was to try to get OpenBLAS fixed through a pull-request at the project level, not just our copy.
@jwillenbring:
If I understood what @maherou and @nmhamster discussed, the idea was to try to get OpenBLAS fixed through a pull-request at the project level, not just our copy.
That is better but it may not solve the problem if I understand what @krcb told me (which may not be clearly stated above or I may have misunderstood). I believe what @krcb said is that it may be the case with this compiler that taking the address of a const char
(e.g. &'N'
) may return garbage such that if you dereference the const char*
that you don't get the right value (e.g. you don't get 'N'
).
@krcb, am I stating that correctly? If not, is it just that OpenBLAS is looking for null termination of the const char[]
array pointed to by the const char*
pointer?
This is an important detail that can't be overlooked. If I understood @krcb correctly, no changes to OpenBLAS will fix this code for this compiler.
@bartlettroscoe and I just spent 30 minutes reviewing this. We observed what is best described as inexplicable behavior in the gcc compiler (both for v4.9.3 and v5.4.0) on this machine. To try and get a better handle on this, the next step is to create a skeleton version of DSTEQR and see if that reveals what the compiler is actually doing to the variables that are passed to it.
@krcb, @nmhamster probably knows this better than I, but it is my understanding that IBM Power8 deviated from prior IBM Power machines by supporting both little and big endian memory ordering schemes. By default, it seems Power8's big change was making big endian the default, where little had been the only ordering available on prior machines.
I wonder if changing the target endianness of the compiler to little would resolve this.
gcc --help=target | grep endian
-maltivec=be Generate Altivec instructions using big-endian element order
-maltivec=le Generate Altivec instructions using little-endian element order
-mbig Produce big endian code
-mbig-endian Produce big endian code
-mlittle Produce little endian code
-mlittle-endian Produce little endian code
I checked IBM XL. By default it uses little endian. Also, GCC (5.4) requires little endian as well. Attempting to use -mbig will result in a link error because the standard libraries are compiled for little endian.
A helpful command for see which flags gcc is using:
gcc -Q --help=target | less
I created a skeleton version of DSTEQR in order to examine what is being passed in from the CXX calls:
SUBROUTINE DSTEQR( COMPZ, N, D, E, Z, LDZ, WORK, INFO )
* .. Scalar Arguments ..
CHARACTER COMPZ
INTEGER INFO, LDZ, N
* ..
* .. Array Arguments ..
DOUBLE PRECISION D( * ), E( * ), WORK( * ), Z( LDZ, * )
* ..
*
* =====================================================================
*
* .. Parameters ..
DOUBLE PRECISION ZERO, ONE, TWO, THREE
PARAMETER ( ZERO = 0.0D0, ONE = 1.0D0, TWO = 2.0D0,
$ THREE = 3.0D0 )
INTEGER MAXIT
PARAMETER ( MAXIT = 30 )
* ..
* .. Local Scalars ..
INTEGER I, ICOMPZ, II, ISCALE, J, JTOT, K, L, L1, LEND,
$ LENDM1, LENDP1, LENDSV, LM1, LSV, M, MM, MM1,
$ NM1, NMAXIT
DOUBLE PRECISION ANORM, B, C, EPS, EPS2, F, G, P, R, RT1, RT2,
$ S, SAFMAX, SAFMIN, SSFMAX, SSFMIN, TST
* ..
* .. External Functions ..
LOGICAL LSAME
DOUBLE PRECISION DLAMCH, DLANST, DLAPY2
* ..
* .. Intrinsic Functions ..
* INTRINSIC ABS, MAX, SIGN, SQRT
* Test the input parameters.
*
INFO = 0
*
write (*,*) "COMPZ = ",COMPZ,";"
write (*,*) "LDZ = ",LDZ,";"
write (*,*) "N = ",N,";"
END
Calling into this using a Teuchos::LAPACK-style API:
# define PREFIX
# define My_fcd const char*
# define DSTEQR_F77 dsteqr_
#define CHAR_MACRO(char_var) &char_var
extern "C" {
void PREFIX DSTEQR_F77(My_fcd COMPZ, const int* n, double* D, double* E, double* Z, const int* ldz, double* work, int* info);
}
void STEQR(const char COMPZ, const int n, double* D, double* E, double* Z, const int ldz, double* WORK, int* info)
{
DSTEQR_F77(CHAR_MACRO(COMPZ), &n, D, E, Z, &ldz, WORK, info);
}
int main(int argc, char* argv[])
{
<code removed for clarity>
int dont_call_me_info = 0;
const int dummy_ldz = 1;
ScalarType scalar_dummy = -1.0;
char char_N = 'N';
const int N = DIAG_SZ;
std::vector<MagnitudeType> mag_dummy(N,0.0);
ScalarType lambda_min = 0.0;
ScalarType lambda_max = 0.0;
if( N > 2 ) {
STEQR(char_N, N, &diagonal[0], &subdiagonal[0],
&scalar_dummy, dummy_ldz, &mag_dummy[0], &dont_call_me_info);
<code removed for clarity>
}
yields the following output:
COMPZ = ?;
LDZ = -16722024 ;
N = 15 ;
which is clearly non-sensical. Replacing the STEQR call with:
DSTEQR_F77(&char_N, &N, &diagonal[0], &subdiagonal[0],
&scalar_dummy, &dummy_ldz, &mag_dummy[0], &dont_call_me_info);
(i.e. calling the Fortran function directly) results in sensible output:
COMPZ = N;
LDZ = 1 ;
N = 1031 ;
Note that both of these approaches valgrind reports clean.
@jjellio If I compile both the fortran skeleton code with -mbig, as:
gfortran dsteqr.f -c -mbig
gcc interMain.cpp dsteqr.o -lstdc++ -std=c++11 -lgfortran -mbig
then I get the error:
/home/projects/pwr8-rhel73-lsf/gcc/4.9.3/lib64/../lib64/libstdc++.so: error adding symbols: File in wrong format
collect2: error: ld returned 1 exit status
If I use -mlittle, then the code compiles and links and I get the behavior described above.
@krcb, wow, those are interesting results you show above. I can't understand how changing passing the first argument is corrupting the later arguments. It is almost like the arguments are getting shifted on stack or something? Very strange.
This provides pretty strong evidence that this problem has nothing to do with OpenBLAS and the problem is likely due to how GCC is passing arguments from C++ to the extern-C interface function. This looks like it might be a compiler bug on this architecture. Do we know anyone associated with the 'ride' system who would be interested and willing to look into this strange behavior? These are very small example programs that @krcb shows above so it should be easy for anyone to reproduce on that system.
So where do we go from here? It seems that if we change to the the Teuchos::BLAS calling approach, this avoids this problem (why, we don't fully understand). Otherwise, someone needs to start digging deeper into the GCC compiler behavior on this platform. But if we find that it is a compiler bug then what good does that do to solve our problem?
I have my opinion (i.e. just change the Teuchos::LAPACK interface to use enums which seems to avoid this behavior and is just a superior interface anyway) but what do other people think?
It's probably because the size of the first argument is different. When its a char, it is 1 byte whereas when its a pointer it is 8 bytes. If you pass the wrong sized argument, it will mess up the memory address of the subsequent arguments.
It's probably because the size of the first argument is different. When its a char, it is 1 byte whereas when its a pointer it is 8 bytes. If you pass the wrong sized argument, it will mess up the memory address of the subsequent arguments.
The first argument should be getting passed to Fortran as a char*
pointer in both cases as you can see above.
Further insights are obtained by instrumenting test program to examine the addresses and contents of the variables at each stage in the program:
SUBROUTINE DSTEQR( COMPZ, N, D, E, Z, LDZ, WORK, INFO )
IMPLICIT NONE
* .. Scalar Arguments ..
CHARACTER COMPZ
INTEGER INFO, LDZ, N
* ..
* .. Array Arguments ..
DOUBLE PRECISION D( * ), E( * ), WORK( * ), Z( LDZ, * )
*
INFO = 0
*
write (*,*) "DSTEQR: COMPZ = ",COMPZ,"; LOC(COMPZ) = ",LOC(COMPZ)
write (*,*) "DSTEQR: LDZ = ",LDZ,"; LOC(LDZ) = ",LOC(LDZ)
write (*,*) "DSTEQR: N = ",N,"; LOC(N) = ",LOC(N)
END
with the C++ calls instrumented as:
void STEQR(const char COMPZ, const int n, double* D, double* E, double* Z, const int ldz, double* WORK, int* info)
{
std::cout << "STEQR: Address of COMPZ = " << reinterpret_cast<std::size_t>((void*) &COMPZ) << "; value at address = " << *((&COMPZ)) << std::endl;
std::cout << "STEQR: Address of ldz = " << reinterpret_cast<std::size_t>(&ldz) << "; value at address = " << *((&ldz)) << std::endl;
std::cout << "STEQR: Address of n = " << reinterpret_cast<std::size_t>(&n) << "; value at address = " << *((&n)) << std::endl;
DSTEQR_F77(CHAR_MACRO(COMPZ), &n, D, E, Z, &ldz, WORK, info);
std::cout << "STEQR: Address of COMPZ = " << reinterpret_cast<std::size_t>((void*) &COMPZ) << "; value at address = " << *((&COMPZ)) << std::endl;
std::cout << "STEQR: Address of ldz = " << reinterpret_cast<std::size_t>(&ldz) << "; value at address = " << *((&ldz)) << std::endl;
std::cout << "STEQR: Address of n = " << reinterpret_cast<std::size_t>(&n) << "; value at address = " << *((&n)) << std::end;
}
int main(int argc, char* argv[])
{
<code remove for clarity>
std::cout << "main: Address of char_N = " << reinterpret_cast<std::size_t>((void*) &char_N) << "; value at address = " << *((&char_N)) << std::endl;
std::cout << "main: Address of dummy_ldz = " << reinterpret_cast<std::size_t>(&dummy_ldz) << "; value at address = " << *((&dummy_ldz)) << std::endl;
std::cout << "main: Address of N = " << reinterpret_cast<std::size_t>(&N) << "; value at address = " << *((&N)) << std::endl;
STEQR(char_N, N, &diagonal[0], &subdiagonal[0],
&scalar_dummy, dummy_ldz, &mag_dummy[0], &dont_call_me_info);
std::cout << "main: Address of char_N = " << reinterpret_cast<std::size_t>((void*) &char_N) << "; value at address = " << *((&char_N)) << std::endl;
std::cout << "main: Address of dummy_ldz = " << reinterpret_cast<std::size_t>(&dummy_ldz) << "; value at address = " << *((&dummy_ldz)) << std::endl;
std::cout << "main: Address of N = " << reinterpret_cast<std::size_t>(&N) << "; value at address = " << *((&N)) << std::end;
<code remove for clarity>
}
then this produces the output:
main: Address of char_N = 70368304299456; value at address = N
main: Address of dummy_ldz = 70368304299444; value at address = 1
main: Address of N = 70368304299460; value at address = 1031
STEQR: Address of COMPZ = 70368304299184; value at address = N
STEQR: Address of ldz = 70368304299216; value at address = 1
STEQR: Address of n = 70368304299188; value at address = 1031
DSTEQR: COMPZ = ?; LOC(COMPZ) = 70368304299184
DSTEQR: LDZ = -439878216 ; LOC(LDZ) = 70368304299216
DSTEQR: N = 16383 ; LOC(N) = 70368304299188
STEQR: Address of COMPZ = 70368304299184; value at address = ?
STEQR: Address of ldz = 70368304299216; value at address = -439878216
STEQR: Address of n = 70368304299188; value at address = 16383
main: Address of char_N = 70368304299456; value at address = N
main: Address of dummy_ldz = 70368304299444; value at address = 1
main: Address of N = 70368304299460; value at address = 1031
what can be seen from this is that the CXX compiler is creating temporaries for COMPZ, ldz and n when we call STEQR (as is expected) and the values of these variables are initially sensible. The Fortran DSTEQR routine is receiving the right addresses, but the values at these addresses are corrupted. When we return back into the CXX STEQR routine, the values at these addresses are still corrupted. Now, we change the API on the STEQR function to pass the char variable by reference, rather than value, i.e.
STEQR(const char& COMPZ, const int n, double* D, double* E, double* Z, const int ldz, double* WORK, int* info)
which results in the following output:
main: Address of char_N = 70368098970864; value at address = N
main: Address of dummy_ldz = 70368098970852; value at address = 1
main: Address of N = 70368098970868; value at address = 1031
STEQR: Address of COMPZ = 70368098970864; value at address = N
STEQR: Address of ldz = 70368098970604; value at address = 1
STEQR: Address of n = 70368098970600; value at address = 1031
DSTEQR: COMPZ = N; LOC(COMPZ) = 70368098970864
DSTEQR: LDZ = 16383 ; LOC(LDZ) = 70368098970604
DSTEQR: N = -645207064 ; LOC(N) = 70368098970600
STEQR: Address of COMPZ = 70368098970864; value at address = N
STEQR: Address of ldz = 70368098970604; value at address = 16383
STEQR: Address of n = 70368098970600; value at address = -645207064
main: Address of char_N = 70368098970864; value at address = N
main: Address of dummy_ldz = 70368098970852; value at address = 1
main: Address of N = 70368098970868; value at address = 1031
Here, there is no longer a temporary created for the COMPZ variable (instead the address of COMPZ corresponds to that of char_N, as expected). The Fortran DSTEQR function receives this same address for COMPZ and the value is not corrupted. However, the temporaries that are still being created for LDZ and N in the STEQR function continue to be corrupted. To resolve this latter issue, we have to replace each of the pass by value with pass by reference, i.e.
STEQR(const char& COMPZ, const int& n, double* D, double* E, double* Z, const int& ldz, double* WORK, int* info)
Now no temporaries are created in the call to the STEQR function and the output is:
main: Address of char_N = 70368326238304; value at address = N
main: Address of dummy_ldz = 70368326238292; value at address = 1
main: Address of N = 70368326238308; value at address = 1031
STEQR: Address of COMPZ = 70368326238304; value at address = N
STEQR: Address of ldz = 70368326238292; value at address = 1
STEQR: Address of n = 70368326238308; value at address = 1031
DSTEQR: COMPZ = N; LOC(COMPZ) = 70368326238304
DSTEQR: LDZ = 1 ; LOC(LDZ) = 70368326238292
DSTEQR: N = 1031 ; LOC(N) = 70368326238308
STEQR: Address of COMPZ = 70368326238304; value at address = N
STEQR: Address of ldz = 70368326238292; value at address = 1
STEQR: Address of n = 70368326238308; value at address = 1031
main: Address of char_N = 70368326238304; value at address = N
main: Address of dummy_ldz = 70368326238292; value at address = 1
main: Address of N = 70368326238308; value at address = 1031
i.e. the correct addresses are received by the Fortran function and the value at these addresses are stored correctly.
@jwillenbring, @bartlettroscoe and I met this morning and discussed these results. Our proposed solution is to replace the pass by value components of the Teuchos::LAPACK API with pass by reference, ensuring that the affected variables are declared const (note that this may have to be done for the Teuchos::BLAS API's as well). This appears to be a minimally intrusive way to solve the problem that we're observing on this platform. One further test before doing so is to verify that this solves the issue observed in at least one downstream package (i.e. the Belos test above). Once that is done, we should be able to rapidly refactor the API's as described here without impacting downstream users. However, packages that use custom LAPACK API's within Trillinos may have to address this issue as well.
I updated the Teuchos::LAPACK::STEQR API to pass by reference, rebuilt Belos against this and reran Belos_Tpetra_PseudoBlockCG_hb_test.exe. The test passes. Next: refactor the Teuchos::LAPACK API to replace all pass by value with pass by reference.
The PR #1284 is merged. Therefore, the LAPACK calls on 'ride' should be working now. Therefore, I will not put this into review.
@krcb, @kruger, and @MicheldeMessieres,
Looks like there is a build failure associated with the PR code #1284 for C++98 with GCC 4.4.7 as shown at:
which shows:
/data/jenkins/slave/workspace/Trilinos_XYCE_MPI/MPI_OPT_DEV_XYCE/Trilinos/packages/teuchos/numerics/test/LAPACK/cxx_main.cpp: In function ‘int main(int, char**)’:
/data/jenkins/slave/workspace/Trilinos_XYCE_MPI/MPI_OPT_DEV_XYCE/Trilinos/packages/teuchos/numerics/test/LAPACK/cxx_main.cpp:119: error: using ‘typename’ outside of template
Looks easy to fix. Can one of you take a look at fix this? A non-C++11 build with GCC 4.4.7 is important for the internal SNL customer Xyce.
@MicheldeMessieres is fixing this. We should have a PR shortly.
@bartlettroscoe Fix is up in PR #1322
Fix is up in PR #1322
I reviewed, merged, tested, and pushed with the checkin-test-sems.sh script. I did not test wtih GCC 4.4.7 without C++11 but the change is trivial so I can't imagine it will not fix this but we will see.
@kruger, you are saying that if you build and run the above example code with with -O0 -g
and -O2
, then it fails? This would seem to indicate that we might have gotten lucky with the change pushed in #1284 fixing some failures but not others. The full problem would appear not to be fixed :-(
@kruger and I just talked over the phone to discuss this more. He is going to see if the Teuchos:BLAS approach using enums and static char arrays fixes the problems with his simple example programs. We will wait to see what the outcome of his experimentation shows before deciding on the next step ...
Summary: The available g++/gfortran executables available on ride do not work for the C++ and fortran interoperability required for invoking STEQDR. At present, we do not understand why the BLAS GEMM tests pass, but LAPACK STEQDR tests fail, but the behavior does point to compiler problems.
On ride:~sekruge/BuildAndTest
, I have 4 C++ drivers, a simple makefile to build them, and a script (runme.sh
) that compiles and runs the drivers with different compiler options (runme.sh gcc5
, runme.sh gcc6
, runme.sh pgi
). Attached are different typescript files from the runs.
The different drivers are:
cxx_main_wrappers_1
: Direct call to the fortran using the Teuchos style direct callcxx_main_wrappers_2
: Call a function which calls does the Teuchos-style direct callcxx_main_wrappers_3
: Call a method that belongs to a LAPACK class, similar to Teuchoscxx_main_wrappers_4
: Direct call to the fortran using an enum.While some combinations of compilations flags seem to have some things work with gcc, they in general do not work. All of the tests work for all combinations of flags for PGI. Things also work on other platforms, so this points to the GNU compilers on this system not working.
One possible fix would be to implement a C-version of these calls as part of Teuchos that we could then turn on as a compilation for the flag, but it seems worthwhile to try and fix the compilers.
Below is an email exchange alerting the representative (i.e. @nmhamster) that maintains 'ride' and the response that they will look into the issue.
From: Hammond, Simon David Sent: Monday, May 22, 2017 10:57 PM To: Bartlett, Roscoe A Cc: Pawlowski, Roger P ; Bettencourt, Matthew ; Scott Kruger ; Glass, Micheal W ; Heroux, Michael A ; Willenbring, James M ; Kris Beckwith ; ride-admin Subject: RE: g++/gfortran broken on 'ride'
Ross,
We will look into. Thanks for generating the report.
S
From: Bartlett, Roscoe A Sent: Monday, May 22, 2017 8:50:59 PM To: Hammond, Simon David Cc: Pawlowski, Roger P; Bettencourt, Matthew; Scott Kruger; Glass, Micheal W; Heroux, Michael A; Willenbring, James M; Kris Beckwith Subject: g++/gfortran broken on 'ride'
Hello Si,
Scott Kruger (our newest Tech-X contractor) had done a detailed study of the Trilinos failures with GCC on ‘ride’ described in:
and from his analysis (summarized in the GitHub comment below), it looks like g++ and gfortran can’t be used to create working mixed-language programs on ‘ride’. Scott has a pretty good set of small example programs that demonstrate the problem (see below).
I am contacting you as the Test Bed representative for ATDM because it seems there is not much we can do to fix things if we are stuck using these compilers with these BLAS and LAPACK libraries on ‘ride’.
Here are a few options we might consider to move forward:
It seems like option-3 is the easiest as we could just use clapack but would provide the worst performance. Given that this is an IBM machine, is the XL compiler available? Otherwise, is the PGI compiler an option on that machine? In any case, if someone cares about the GCC/gfortran compilers on this system, then someone should dig into what is happening (but I suspect that could be a big time-sink). But all three of those options realistically requires support from the Test Bed team for ‘ride’.
What do you think we should do?
Thanks,
P.S. I will archive this in the GitHub issue #1208 but I just wanted to make sure that I CCed a few key people who should be alerted to this development.
I built clapack to see if we can use it with gcc to overcome some of the problems. I am now getting a whole set of new tests that I did not have previously:
47 - TeuchosCore_TypeConversions_UnitTest (Failed)
65 - TeuchosParameterList_ObjectBuilder_UnitTests (OTHER_FAULT)
98 - TeuchosNumerics_BLAS_test (SEGFAULT)
100 - TeuchosNumerics_BLAS_tmpl_comp_test (SEGFAULT)
102 - TeuchosNumerics_DenseSolver_test (SEGFAULT)
103 - TeuchosNumerics_BandDenseSolver_test (SEGFAULT)
104 - TeuchosNumerics_SpdDenseSolver_test (SEGFAULT)
105 - TeuchosNumerics_QRDenseSolver_test (SEGFAULT)
106 - TeuchosNumerics_LAPACK_test (SEGFAULT)
109 - TeuchosNumerics_BLAS_example (SEGFAULT)
110 - TeuchosNumerics_DenseMatrix_example (SEGFAULT)
111 - TeuchosNumerics_SymDenseMatrix_example (SEGFAULT)
112 - TeuchosNumerics_hilbert (SEGFAULT)
Looking into what's going on.
Going through the first 3 errors
Test that code {valF = asSafe<float> (os.str ());} throws std::range_error: passed
Exception message for expected exception:
/ascldap/users/sekruge/Trilinos-develop/packages/teuchos/core/src/Teuchos_as.hpp:558:
Throw number = 1
Throw test that evaluated to true: errno == ERANGE && (val != 0)
Teuchos::ValueTypeConversionTraits<float, std::string>::convert: The value in the given string "-1.79769313e+308" overflows float.
Test that code {valF = asSafe<float> (os.str ());} throws std::range_error: passed
Exception message for expected exception:
/ascldap/users/sekruge/Trilinos-develop/packages/teuchos/core/src/Teuchos_as.hpp:558:
Throw number = 2
Throw test that evaluated to true: errno == ERANGE && (val != 0)
Teuchos::ValueTypeConversionTraits<float, std::string>::convert: The value in the given string "1.79769313e+308" overflows float.
5. Teuchos_ObjectBuilder_create_UnitTest ... [Passed] (0.00083 sec)
terminate called after throwing an instance of 'Teuchos::Exceptions::InvalidParameterName'
what(): Error, the parameter {name="Hello",type="string",value="World"}
in the parameter (sub)list "ANONYMOUS"
was not found in the list of valid parameters!
The valid parameters and types are:
{
"Object Type" : string = Foo A
"Foo A" : ParameterList = ...
}
Throw number = 12
6. Teuchos_ObjectBuilder_setParameterList_UnitTest ... Abort (core dumped)
This fails in srot which is part of BLAS.
Need to get the compilers fixed.
@kruger, the test TeuchosParameterList_ObjectBuilder_UnitTests
is a known problematic test (see #831 and #449) soon to be fixed with the merge of the branch associated with #229.
As for the other BLAS and LAPACK-related tests, I might dig a little deeper. If you link against a mock C version of srot, for example, are the arguments passed correctly or not?
@kruger, can you please attach the full output for the test TeuchosCore_TypeConversions_UnitTest
? The output how show above suggests that those unit tests actually passed (because they thrown an expected exception).
@nmhamster, Please update this issue when you get news from the vendor.
TeuchosCore_TypeConversions_UnitTest.txt
Sorry for the delay. Here is the output.
The srot is a C version.
The full build of the clapack can be seen here: ~sekruge/builds-develop/clapack_cmake-3.2.1
@kruger,
Thanks for the unit test output. By my count, what is failing are 17th and 18th checks in this unit test:
Test that code {valD = asSafe<double> (os.str ());} throws std::range_error: failed (code did not throw an exception at all)
Test that code {valD = asSafe<double> (os.str ());} throws std::range_error: failed (code did not throw an exception at all)
...
[FAILED] (0.000282 sec) asSafe_stringToReal_UnitTest
Location: /ascldap/users/sekruge/Trilinos-develop/packages/teuchos/core/test/TypeConversions/TypeConversions_UnitTest.cpp:274
@mhoemmen, by my count, the failing checks in that unit test are coming from the unit test code starting on line 407:
//
// Test string -> double conversions that should throw,
// if sizeof(long double) > sizeof(double).
//
if (sizeof (long double) > sizeof (double)) {
{
std::ostringstream os;
os.precision (36);
os << minLD;
TEST_THROW(valD = asSafe<double> (os.str ()), std::range_error);
}
{
std::ostringstream os;
os.precision (36);
os << maxLD;
TEST_THROW(valD = asSafe<double> (os.str ()), std::range_error);
}
}
So something about these asSafe()
checks must be using LAPACK and the CLAPACK functions must not be returning the right values used in these checks?
@kruger, do any Belos test failures originally reported in #1191 now pass with your build of CLAPACK on 'ride'? If the answer is "yes", then it is worth diving more into these other failing tests. If the answer is "no", then we can consider this approach a dead-end and then we just need to wait on IBM to fix these compilers on this machine (unless someone has a better idea).
@bartlettroscoe Are those the only checks that are failing? My guess is that the compiler implements long double
as a "double-double" (a.k.a. ddreal), not as a true 80-bit or 128-bit IEEE 754 floating-point value. That could affect both the min and the max representable positive long double
values.
I'm all for commenting out the test for now until we have time to research what the min and max values should be for whatever this fake long double
is actually doing.
Are those the only checks that are failing? My guess is that the compiler implements long double as a "double-double" (a.k.a. ddreal), not as a true 80-bit or 128-bit IEEE 754 floating-point value. That could affect both the min and the max representable positive long double values.
I'm all for commenting out the test for now until we have time to research what the min and max values should be for whatever this fake long double is actually doing.
I think these tests passed before with the same compiler. So these new failures don't appear to be related to the compiler itself.
@kruger, can you comment on this since you are on that machine and ran these tests?
Attached are various log files. test.sh.txt is the one that shows all of the tests failing. There are a lot of them. The compiler is gcc 5.4.0. Is there a reason that it would do a long double on power architecture different than Intel? Is there a way of checking from a trilinos build?
test.sh.txt TeuchosCore_TypeConversions_UnitTest.txt TeuchosNumerics_BLAS.txt TeuchosParameterList_ObjectBuilder_UnitTests.exe.txt
@kruger IBM historically implemented 128-bit (hex) floating-point arithmetic (System/370). POWER9 will have hardware 128-bit floating-point support. Thus, it's not surprising that the compiler does something different on POWER.
so what's going on here? do we just need to fix the long double
tests or what?
Next Action Status:
Looks like it was a problem with BLAS and LAPACK after all (see https://github.com/trilinos/Trilinos/issues/2454#issuecomment-386271621).
Blocks: #1191 Duplicates: #347 Analogous to: #1210 CC: @trilinos/teuchos @trilinos/framework
Description:
NOTE: The below original description does not pin-point the actual problem that was causing the failures on 'ride'. The real problem is what appears to be a mixed language calling defect summarized below.
Original Description:
BLAS and LAPACK libraries offer a Fortran 77 interface, but we want to call them from C++ code. This matters both in terms of Fortran function name mangling, and in terms of how to pass arguments into Fortran functions (the "application binary interface" or ABI). Most arguments just go into Fortran from the C or C++ world by pointer. However, some Fortran compilers have a different ABI for passing strings (
CHARACTER(*)
). Some compilers just takechar*
, but others may takechar*, unsigned int
or evenchar*, unsigned int*
(or notunsigned
). Fortran programmers don't see this; only programmers working in other languages (like C or C++), who need to know the Fortran ABI, may see this.Teuchos' BLAS and LAPACK wrappers have two macros that govern the Fortran string ABI:
CHAR_MACRO(c)
andTeuchos_fcd
.CHAR_MACRO(c)
, defined redundantly inteuchos/numerics/src/Teuchos_LAPACK.cpp
andTeuchos_BLAS.cpp
, takes aconst char c
and does the right thing with it to pass it into a Fortran function. Teuchos offers two definitions ofCHAR_MACRO
:Per discussion in #1191, it looks like we need another definition for OpenBLAS on POWER, that passes in the constant
one
by address rather than by value.Teuchos_fcd
, defined redundantly inteuchos/numerics/src/Teuchos_LAPACK_wrappers.hpp
andTeuchos_BLAS_wrappers.hpp
, has the following definition:In the comments,
CRAY_T3X
refers to the Cray T3D and T3E, two mid-'90s distributed-memory computer architectures. This should give you an idea of the age of this code. In any case, we would need to add a new definition of Teuchos_fcd:(See comments below for why it's an
int
and notunsigned int
. It would be better to useint32_t
, but Teuchos cannot assume C++11 / C99 types.)