trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.22k stars 570 forks source link

Teuchos: Address issus with calling Fortran LAPACK rountines from C++ on Power8 with GCC/gfortran #1208

Closed mhoemmen closed 7 years ago

mhoemmen commented 7 years ago

Next Action Status:

Looks like it was a problem with BLAS and LAPACK after all (see https://github.com/trilinos/Trilinos/issues/2454#issuecomment-386271621).

Blocks: #1191 Duplicates: #347 Analogous to: #1210 CC: @trilinos/teuchos @trilinos/framework

Description:

NOTE: The below original description does not pin-point the actual problem that was causing the failures on 'ride'. The real problem is what appears to be a mixed language calling defect summarized below.

Original Description:

BLAS and LAPACK libraries offer a Fortran 77 interface, but we want to call them from C++ code. This matters both in terms of Fortran function name mangling, and in terms of how to pass arguments into Fortran functions (the "application binary interface" or ABI). Most arguments just go into Fortran from the C or C++ world by pointer. However, some Fortran compilers have a different ABI for passing strings (CHARACTER(*)). Some compilers just take char*, but others may take char*, unsigned int or even char*, unsigned int* (or not unsigned). Fortran programmers don't see this; only programmers working in other languages (like C or C++), who need to know the Fortran ABI, may see this.

Teuchos' BLAS and LAPACK wrappers have two macros that govern the Fortran string ABI: CHAR_MACRO(c) and Teuchos_fcd.

CHAR_MACRO(c), defined redundantly in teuchos/numerics/src/Teuchos_LAPACK.cpp and Teuchos_BLAS.cpp, takes a const char c and does the right thing with it to pass it into a Fortran function. Teuchos offers two definitions of CHAR_MACRO:

#if defined (INTEL_CXML)
#define CHAR_MACRO(char_var) &char_var, one
#else
#define CHAR_MACRO(char_var) &char_var
#endif

namespace {
#if defined (INTEL_CXML)
        unsigned int one=1;
#endif
// ...
}

Per discussion in #1191, it looks like we need another definition for OpenBLAS on POWER, that passes in the constant one by address rather than by value.

Teuchos_fcd, defined redundantly in teuchos/numerics/src/Teuchos_LAPACK_wrappers.hpp and Teuchos_BLAS_wrappers.hpp, has the following definition:

#if defined(INTEL_CXML)
#  define PREFIX __stdcall
#  define Teuchos_fcd const char *, unsigned int
#elif defined(INTEL_MKL)
#  define PREFIX
#  define Teuchos_fcd const char *
#else /* Not CRAY_T3X or INTEL_CXML or INTEL_MKL */
#  define PREFIX
#  define Teuchos_fcd const char *
#endif

In the comments, CRAY_T3X refers to the Cray T3D and T3E, two mid-'90s distributed-memory computer architectures. This should give you an idea of the age of this code. In any case, we would need to add a new definition of Teuchos_fcd:

#  define Teuchos_fcd const char*, const int*

(See comments below for why it's an int and not unsigned int. It would be better to use int32_t, but Teuchos cannot assume C++11 / C99 types.)

krcb commented 7 years ago

I've now built Teuchos just using OpenBLAS and gcc. I then modified the standalone test program to use the Teuchos LAPACK interface. The compile line is:

/home/projects/pwr8-rhel73-lsf/gcc/5.4.0/bin/gcc -I/ascldap/users/kbeckwi/Trilinos/packages/teuchos/numerics/src -I/ascldap/users/kbeckwi/Trilinos/packages/teuchos/comm/src -I/ascldap/users/kbeckwi/Trilinos/packages/teuchos/parameterlist/src -I/ascldap/users/kbeckwi/buildTeuchos/packages/teuchos/core/src -I/ascldap/users/kbeckwi/Trilinos/packages/teuchos/core/src teuchosMain.cpp -L/home/projects/pwr8-rhel73-lsf/openblas/0.2.19/gcc/5.3.0/lib/ -lopenblas -lstdc++ buildTeuchos/packages/teuchos/numerics/src/libteuchosnumerics.a

This results in the failed case that we're seeing above:

STEQR test ... ( Lambda min: expected 1; Lambda max: expected 1031) ** On entry to DSTEQR parameter number 1 had an illegal value STEQR: compute symmetric tridiagonal eigenvalues: LAPACK's _STEQR failed with info = -1 < 0.( After the call, Lambda min: expected 1; Lambda max: expected 1031) FAILED ( Lambda min: expected 1, computed 1031; Lambda max: expected 1031, computed 1)

This rules out that the issue is due to a combination of ESSL/OpenBLAS. That this same combination of OpenBLAS library and compiler works when the Teuchos LAPACK interface is not involved suggests some kind of inconsistency in Teuchos. More to follow.

krcb commented 7 years ago

I'm now able to reproduce this issue independent of Teuchos. If I define the interface into LAPACK as follows (this matches the pattern used in Teuchos):

#  define PREFIX
#  define My_fcd const char *
#  define DSTEQR_F77  dsteqr_
#define CHAR_MACRO(char_var) &char_var

extern "C" {
void PREFIX DSTEQR_F77(My_fcd, const int* n, double* D, double* E, double* Z, const int* ldz, double* work, int* info);
}

void STEQR(const char COMPZ, const int n, double* D, double* E, double* Z, const int ldz, double* WORK, int* info)
{
    DSTEQR_F77(CHAR_MACRO(COMPZ), &n, D, E, Z, &ldz, WORK, info);
}

and then call into this as:

typedef double ScalarType;
std::vector<ScalarType> diagonal(DIAG_SZ,0.0);
std::vector<ScalarType> subdiagonal(DIAG_SZ-1,0.0);
int dont_call_me_info = 0;
const int dummy_ldz = 1;
ScalarType scalar_dummy = -1.0;
char char_N = 'N';
const int N = DIAG_SZ;
std::vector<MagnitudeType> mag_dummy(N,0.0);
STEQR(char_N, N, &diagonal[0], &subdiagonal[0],
                   &scalar_dummy, dummy_ldz, &mag_dummy[0], &dont_call_me_info);

then this reproduces the failure above. To gain some more insight, I removed the call into LAPACK and examine what happened to the char when we take a reference to it:

#define CHAR_MACRO(char_var) &char_var

void STEQR(const char COMPZ, const int n, double* D, double* E, double* Z, const int ldz, double* WORK, int* info)
{
    const std::string myString = CHAR_MACRO(COMPZ);
    std::cout << "myString = " << myString << "; myString.c_str() " << myString.c_str() << "; COMPZ = " << COMPZ << std::endl;
}

The output from this is insightful:

myString = N???; myString.c_str() N???; COMPZ = N

My suspicion is that char is not null-terminated and so the CHAR_MACRO(COMPZ) operation is resulting in a corrupted string. There are two things however. First, the OpenBLAS API is expecting a char* here, so we have to perform &COMPZ. Second, if we do the following:

void STEQR(const char COMPZ, const int n, double* D, double* E, double* Z, const int ldz, double* WORK, int* info)
{
    const std::string myString = CHAR_MACRO(COMPZ);
    DSTEQR_F77(myString.c_str(), &n, D, E, Z, &ldz, WORK, info);
}

the test passes. This suggests another approach:

void STEQR(const std::string COMPZ, const int n, double* D, double* E, double* Z, const int ldz, double* WORK, int* info)
{
    DSTEQR_F77(COMPZ.c_str(), &n, D, E, Z, &ldz, WORK, info);
}

This, however, would require reworking the Teuchos LAPACK API. Are there any insights as to why this was not done in the original implementation as it seems like the more portable approach?

Note that I did try out the initial suggestion on this ticket (redefining the CHAR_MACRO and Teuchos_fcd macros), but this also resulted in the test failing.

mhoemmen commented 7 years ago

@krcb wrote:

Are there any insights as to why this was not done in the original implementation as it seems like the more portable approach?

The Fortran reference implementation of the BLAS and LAPACK only ever reads the first character of any of those strings (you can see it in the code), so it's a little bit unusual for OpenBLAS to expect a zero-terminated character array.

Teuchos::BLAS and Teuchos::LAPACK actually evolved from what was called Tpetra at the time (2003). You can see this in, e.g., the comments at the top of Teuchos_BLAS.hpp by "Kris" (Kampshoff). My guess is that nobody thought very hard about it. Tpetra was just a side research project until 2010. What's weird is that Teuchos::BLAS uses enums (ETransp, EUplo) in place of character arguments, while Teuchos::LAPACK exclusively uses char arguments.

krcb commented 7 years ago

@bartlettroscoe and I discussed this, with a follow up meeting between @jwillenbring and myself in order to determine how to proceed. As a next step, we will determine whether using a Teuchos::BLAS type interface to LAPACK will resolve the corruption observed in passing a char into the Teuchos::LAPACK::STEQR interface, through the use of a lightweight prototype. I anticipate being able to report back within the next couple of days.

krcb commented 7 years ago

I was able to get the test proposed above to pass by implementing the following code inside the Teuchos::LAPACK interface:

enum CompZENum {
  STEQR_EIGENVALUES,
  STEQR_BOTH_ORIGINAL,
  STEQR_BOTH_TRIDIAG
};

const char* CompZChar[] = {"N","V","Z"};

void LAPACK<int,double>::STEQR(CompZENum compz, const int n, double* D, double* E, double* Z, const int ldz, double* WORK, int* info) const
    DSTEQR_F77(CompZChar[compz], &n, D, E, Z, &ldz, WORK, info);
  }

This is then called from the test as:

Teuchos::LAPACK<int,ScalarType> lapack;
lapack.STEQR(Teuchos::STEQR_EIGENVALUES, N, &diagonal[0], &subdiagonal[0],
                         &scalar_dummy[0], dummy_ldz, &mag_dummy[0], &dont_call_me_info);

As a next step, I'm going to verify that this approach fixes at least one test in a downstream package.

krcb commented 7 years ago

I built Belos against the modified LAPACK STEQR interface and then ran:

./Belos_Tpetra_PseudoBlockCG_hb_test.exe --verbose

i.e. the test reported on previously. This test reported passed with the output:

Teuchos::GlobalMPISession::GlobalMPISession(): started serial run
Input parameter list: 
Block Size = 1   [unused]
Maximum Iterations = 1805
Convergence Tolerance = 1e-05
Verbosity = 57
Estimate Condition Number = 1

Belos parameter list: 
Maximum Iterations = 1805   [unused]
Verbosity = 57   [unused]
Output Frequency = -1   [unused]
Convergence Tolerance = 1e-05   [unused]

Dimension of matrix: 1806
Number of right-hand sides: 1
Block size used by solver: 1
Max number of CG iterations: 1805
Relative residual tolerance: 1e-05

Belos::StatusTestGeneralOutput: Passed
  (Num calls,Mod test,State test): (914, 1, Passed)
   Passed.......OR Combination -> 
     OK...........Number of Iterations = 913 < 1805
     Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
                  residual [ 0 ] = 9.20765e-06 < 1e-05

Passed.......OR Combination -> 
  OK...........Number of Iterations = 913 < 1805
  Converged....(2-Norm Imp Res Vec) / (2-Norm Res0)
               residual [ 0 ] = 9.20765e-06 < 1e-05

================================================================================

                      TimeMonitor results over 1 processor

Timer Name                                     Global time (num calls)    
--------------------------------------------------------------------------------
Belos: Operation Op*x                          0.09915 (914)              
Belos: Operation Prec*x                        0 (0)                      
Belos: PseudoBlockCGSolMgr total solve time    0.134 (1)                  
================================================================================
Proc 0: Finished solve
Input parameter list after solve: 
Block Size = 1   [unused]
Maximum Iterations = 1805
Convergence Tolerance = 1e-05
Verbosity = 57
Estimate Condition Number = 1

Belos parameter list after solve: 
Maximum Iterations = 1805   [unused]
Verbosity = 57   [unused]
Output Frequency = -1   [unused]
Convergence Tolerance = 1e-05   [unused]

---------- Actual Residuals (normalized) ----------

Problem 0 :     9.20765e-06
Condition Estimate = 275399

End Result: TEST PASSED

This appears to be reasonable. The next step is to scope out the level of effort required to propagate this type of change through the API.

krcb commented 7 years ago

@jwillenbring and I are discussing scope of this effort.

maherou commented 7 years ago

Kris,

Would users of STEQR():

Teuchos::LAPACK<int,ScalarType> lapack; lapack.STEQR()

be required to change their code in order to adapt to the change? It is not clear to me. We cannot change the API in a way that forces source changes for users of Teuchos::LAPACK.

krcb commented 7 years ago

@bartlettroscoe and I discussed an approach where we would implement a new Teuchos::LAPACK interface that implemented the API proposed here, while keeping the existing API in place for backwards compatibility (possibly with a deprecated warning). Users of Teuchos::LAPACK would then have a choice of which API to make use of; Tech-X would also refactor usage of Teuchos::LAPACK in Trillinos to use the new API.

mhoemmen commented 7 years ago

Couldn't we just overload Teuchos::LAPACK functions that take strings, to take the existing Teuchos::E* enums used in Teuchos::BLAS instead? That would neither break backwards compatibility, nor create new enums? The implementation of the existing Teuchos enums already uses the strategy suggested by @krcb.

bartlettroscoe commented 7 years ago

Couldn't we just overload Teuchos::LAPACK functions that take strings, to take the existing Teuchos::E* enums used in Teuchos::BLAS instead? That would neither break backwards compatibility, nor create new enums? The implementation of the existing Teuchos enums already uses the strategy suggested by @krcb.

If the same emums cover all of the LAPACK use cases, I agree. If it does not, then add the new ones that are needed but don't duplicate the enums that already exist.

Given major superiority of enums over strings in static checking, the existing Teuchos::LAPACK interfaces using the raw char strings should be deprecated.

maherou commented 7 years ago

Talking with @nmhamster, it seems that we could bypass this issue if we can get the OpenBLAS implementation to restrict its string check to a single character. This would be in line with other BLAS versions and the NETLIB reference implementation. The standard itself asks for a single character.

Before doing anything ourselves, it would be worth finding out if a change to OpenBLAS is possible. Alternatively, could we use the LAPACK Fortran reference code instead? When combined with ESSL, which provides optimized versions of the most important functions, the performance of reference LAPACK functions shouldn't matter very much.

I am not at all inclined to pursue a full-scale refactoring of this capability in Teuchos unless it is absolutely necessary. This would be a disproportionate response to an issue that has never been a real problem in more than a decade, until now.

mhoemmen commented 7 years ago

@bartlettroscoe

If it does not, then add the new ones that are needed but don't duplicate the enums that already exist.

LAPACK would need a few more enums for things like _GESVD's JOBU and JOBVT arguments:

http://www.netlib.org/lapack/explore-html/d8/d2d/dgesvd_8f.html

I'm personally indifferent to the solution path, as long as there is some working BLAS and LAPACK for IBM POWER + GCC.

bartlettroscoe commented 7 years ago

I am not at all inclined to pursue a full-scale refactoring of this capability in Teuchos unless it is absolutely necessary. This would be a disproportionate response to an issue that has never been a real problem in more than a decade, until now.

But the current Teuchos::LAPACK that takes raw chars is just a bad interface. It has zero compile-time checking for those arguments. This is not a complex refactoring at all. I suspect some simple sed-like scripts could do the refactoring automatically for 95% of the cases in Trilinos. This should not take more than a day of work and it is 100% guaranteed to solve this problem now and forever more.

Besides, fixing OpenBLAS may not fix the problem since, according to @krcb, that compiler is giving garbage when you take the address of the single char.

The other issue is that fixing our copy of OpenBLAS does not help anyone else and we may keep hearing about this for years to come on different platforms where people are trying to use OpenBLAS with Trilinos and having it fail like this. Then we will hear about how Trilinos is sloppy with how it calls LAPACK, etc. (whether that is true or not).

jjellio commented 7 years ago

OpenBLAS is used outside the context of IBM Power8, whatever changes made need to not break OpenBLAS on x86. Is this really an issue with OpenBLAS? Or is it an issue with GNU GCC/Power8. If taking the address of a char is the problem, that is not an OpenBLAS' problem.

jwillenbring commented 7 years ago

@bartlettroscoe If I understood what @maherou and @nmhamster discussed, the idea was to try to get OpenBLAS fixed through a pull-request at the project level, not just our copy.

bartlettroscoe commented 7 years ago

@jwillenbring:

If I understood what @maherou and @nmhamster discussed, the idea was to try to get OpenBLAS fixed through a pull-request at the project level, not just our copy.

That is better but it may not solve the problem if I understand what @krcb told me (which may not be clearly stated above or I may have misunderstood). I believe what @krcb said is that it may be the case with this compiler that taking the address of a const char (e.g. &'N') may return garbage such that if you dereference the const char* that you don't get the right value (e.g. you don't get 'N').

@krcb, am I stating that correctly? If not, is it just that OpenBLAS is looking for null termination of the const char[] array pointed to by the const char* pointer?

This is an important detail that can't be overlooked. If I understood @krcb correctly, no changes to OpenBLAS will fix this code for this compiler.

krcb commented 7 years ago

@bartlettroscoe and I just spent 30 minutes reviewing this. We observed what is best described as inexplicable behavior in the gcc compiler (both for v4.9.3 and v5.4.0) on this machine. To try and get a better handle on this, the next step is to create a skeleton version of DSTEQR and see if that reveals what the compiler is actually doing to the variables that are passed to it.

jjellio commented 7 years ago

@krcb, @nmhamster probably knows this better than I, but it is my understanding that IBM Power8 deviated from prior IBM Power machines by supporting both little and big endian memory ordering schemes. By default, it seems Power8's big change was making big endian the default, where little had been the only ordering available on prior machines.

I wonder if changing the target endianness of the compiler to little would resolve this.

gcc --help=target | grep endian
  -maltivec=be                Generate Altivec instructions using big-endian element order
  -maltivec=le                Generate Altivec instructions using little-endian element order
  -mbig                       Produce big endian code
  -mbig-endian                Produce big endian code
  -mlittle                    Produce little endian code
  -mlittle-endian             Produce little endian code
jjellio commented 7 years ago

I checked IBM XL. By default it uses little endian. Also, GCC (5.4) requires little endian as well. Attempting to use -mbig will result in a link error because the standard libraries are compiled for little endian.

A helpful command for see which flags gcc is using: gcc -Q --help=target | less

krcb commented 7 years ago

I created a skeleton version of DSTEQR in order to examine what is being passed in from the CXX calls:

      SUBROUTINE DSTEQR( COMPZ, N, D, E, Z, LDZ, WORK, INFO )
*     .. Scalar Arguments ..
      CHARACTER          COMPZ
      INTEGER            INFO, LDZ, N
*     ..
*     .. Array Arguments ..
      DOUBLE PRECISION   D( * ), E( * ), WORK( * ), Z( LDZ, * )
*     ..
*
*  =====================================================================
*
*     .. Parameters ..
      DOUBLE PRECISION   ZERO, ONE, TWO, THREE
      PARAMETER          ( ZERO = 0.0D0, ONE = 1.0D0, TWO = 2.0D0,
     $                   THREE = 3.0D0 )
      INTEGER            MAXIT
      PARAMETER          ( MAXIT = 30 )
*     ..
*     .. Local Scalars ..
      INTEGER            I, ICOMPZ, II, ISCALE, J, JTOT, K, L, L1, LEND,
     $                   LENDM1, LENDP1, LENDSV, LM1, LSV, M, MM, MM1,
     $                   NM1, NMAXIT
      DOUBLE PRECISION   ANORM, B, C, EPS, EPS2, F, G, P, R, RT1, RT2,
     $                   S, SAFMAX, SAFMIN, SSFMAX, SSFMIN, TST
*     ..
*     .. External Functions ..
      LOGICAL            LSAME
      DOUBLE PRECISION   DLAMCH, DLANST, DLAPY2
*     ..
*     .. Intrinsic Functions ..
*      INTRINSIC          ABS, MAX, SIGN, SQRT
*     Test the input parameters.
*
      INFO = 0
*
      write (*,*) "COMPZ = ",COMPZ,";"
      write (*,*) "LDZ = ",LDZ,";"
      write (*,*) "N = ",N,";"

      END

Calling into this using a Teuchos::LAPACK-style API:

#  define PREFIX
#  define My_fcd const char*
#  define DSTEQR_F77  dsteqr_
#define CHAR_MACRO(char_var) &char_var

extern "C" {
void PREFIX DSTEQR_F77(My_fcd COMPZ, const int* n, double* D, double* E, double* Z, const int* ldz, double* work, int* info);
}

void STEQR(const char COMPZ, const int n, double* D, double* E, double* Z, const int ldz, double* WORK, int* info)
{
    DSTEQR_F77(CHAR_MACRO(COMPZ), &n, D, E, Z, &ldz, WORK, info);
}

int main(int argc, char* argv[])
{
  <code removed for clarity>

   int dont_call_me_info = 0;
   const int dummy_ldz = 1;
   ScalarType scalar_dummy = -1.0;
   char char_N = 'N';
   const int N = DIAG_SZ;
   std::vector<MagnitudeType> mag_dummy(N,0.0); 

   ScalarType lambda_min = 0.0;
   ScalarType lambda_max = 0.0;

   if( N > 2 ) {
     STEQR(char_N, N, &diagonal[0], &subdiagonal[0],
                   &scalar_dummy, dummy_ldz, &mag_dummy[0], &dont_call_me_info);

  <code removed for clarity>
}

yields the following output:

 COMPZ = ?;
 LDZ =    -16722024 ;
 N =           15 ;

which is clearly non-sensical. Replacing the STEQR call with:

     DSTEQR_F77(&char_N, &N, &diagonal[0], &subdiagonal[0],
                   &scalar_dummy, &dummy_ldz, &mag_dummy[0], &dont_call_me_info);

(i.e. calling the Fortran function directly) results in sensible output:

 COMPZ = N;
 LDZ =            1 ;
 N =         1031 ;

Note that both of these approaches valgrind reports clean.

krcb commented 7 years ago

@jjellio If I compile both the fortran skeleton code with -mbig, as:

gfortran dsteqr.f -c -mbig
gcc interMain.cpp dsteqr.o -lstdc++ -std=c++11 -lgfortran -mbig

then I get the error:

/home/projects/pwr8-rhel73-lsf/gcc/4.9.3/lib64/../lib64/libstdc++.so: error adding symbols: File in wrong format
collect2: error: ld returned 1 exit status

If I use -mlittle, then the code compiles and links and I get the behavior described above.

bartlettroscoe commented 7 years ago

@krcb, wow, those are interesting results you show above. I can't understand how changing passing the first argument is corrupting the later arguments. It is almost like the arguments are getting shifted on stack or something? Very strange.

This provides pretty strong evidence that this problem has nothing to do with OpenBLAS and the problem is likely due to how GCC is passing arguments from C++ to the extern-C interface function. This looks like it might be a compiler bug on this architecture. Do we know anyone associated with the 'ride' system who would be interested and willing to look into this strange behavior? These are very small example programs that @krcb shows above so it should be easy for anyone to reproduce on that system.

So where do we go from here? It seems that if we change to the the Teuchos::BLAS calling approach, this avoids this problem (why, we don't fully understand). Otherwise, someone needs to start digging deeper into the GCC compiler behavior on this platform. But if we find that it is a compiler bug then what good does that do to solve our problem?

I have my opinion (i.e. just change the Teuchos::LAPACK interface to use enums which seems to avoid this behavior and is just a superior interface anyway) but what do other people think?

etphipp commented 7 years ago

It's probably because the size of the first argument is different. When its a char, it is 1 byte whereas when its a pointer it is 8 bytes. If you pass the wrong sized argument, it will mess up the memory address of the subsequent arguments.

bartlettroscoe commented 7 years ago

It's probably because the size of the first argument is different. When its a char, it is 1 byte whereas when its a pointer it is 8 bytes. If you pass the wrong sized argument, it will mess up the memory address of the subsequent arguments.

The first argument should be getting passed to Fortran as a char* pointer in both cases as you can see above.

krcb commented 7 years ago

Further insights are obtained by instrumenting test program to examine the addresses and contents of the variables at each stage in the program:

      SUBROUTINE DSTEQR( COMPZ, N, D, E, Z, LDZ, WORK, INFO )
      IMPLICIT NONE
*     .. Scalar Arguments ..
      CHARACTER          COMPZ
      INTEGER            INFO, LDZ, N
*     ..
*     .. Array Arguments ..
      DOUBLE PRECISION   D( * ), E( * ), WORK( * ), Z( LDZ, * )
*
      INFO = 0
*
      write (*,*) "DSTEQR: COMPZ = ",COMPZ,"; LOC(COMPZ) = ",LOC(COMPZ)
      write (*,*) "DSTEQR: LDZ = ",LDZ,"; LOC(LDZ) = ",LOC(LDZ)
      write (*,*) "DSTEQR: N = ",N,"; LOC(N) = ",LOC(N)

      END

with the C++ calls instrumented as:

void STEQR(const char COMPZ, const int n, double* D, double* E, double* Z, const int ldz, double* WORK, int* info)
{
    std::cout << "STEQR: Address of COMPZ = " << reinterpret_cast<std::size_t>((void*) &COMPZ) << "; value at address = " << *((&COMPZ)) << std::endl;
    std::cout << "STEQR: Address of ldz = " << reinterpret_cast<std::size_t>(&ldz) << "; value at address = " << *((&ldz)) << std::endl;
    std::cout << "STEQR: Address of n = " << reinterpret_cast<std::size_t>(&n) << "; value at address = " << *((&n)) << std::endl;
    DSTEQR_F77(CHAR_MACRO(COMPZ), &n, D, E, Z, &ldz, WORK, info);
    std::cout << "STEQR: Address of COMPZ = " << reinterpret_cast<std::size_t>((void*) &COMPZ) << "; value at address = " << *((&COMPZ)) << std::endl;
    std::cout << "STEQR: Address of ldz = " << reinterpret_cast<std::size_t>(&ldz) << "; value at address = " << *((&ldz)) << std::endl;
    std::cout << "STEQR: Address of n = " << reinterpret_cast<std::size_t>(&n) << "; value at address = " << *((&n)) << std::end;
}

int main(int argc, char* argv[])
{
   <code remove for clarity>

     std::cout << "main: Address of char_N = " << reinterpret_cast<std::size_t>((void*) &char_N) << "; value at address = " << *((&char_N)) << std::endl;
     std::cout << "main: Address of dummy_ldz = " << reinterpret_cast<std::size_t>(&dummy_ldz) << "; value at address = " << *((&dummy_ldz)) << std::endl;
     std::cout << "main: Address of N = " << reinterpret_cast<std::size_t>(&N) << "; value at address = " << *((&N)) << std::endl;
     STEQR(char_N, N, &diagonal[0], &subdiagonal[0],
                   &scalar_dummy, dummy_ldz, &mag_dummy[0], &dont_call_me_info);
     std::cout << "main: Address of char_N = " << reinterpret_cast<std::size_t>((void*) &char_N) << "; value at address = " << *((&char_N)) << std::endl;
     std::cout << "main: Address of dummy_ldz = " << reinterpret_cast<std::size_t>(&dummy_ldz) << "; value at address = " << *((&dummy_ldz)) << std::endl;
     std::cout << "main: Address of N = " << reinterpret_cast<std::size_t>(&N) << "; value at address = " << *((&N)) << std::end;

   <code remove for clarity>
}

then this produces the output:

main: Address of char_N = 70368304299456; value at address = N
main: Address of dummy_ldz = 70368304299444; value at address = 1
main: Address of N = 70368304299460; value at address = 1031
STEQR: Address of COMPZ = 70368304299184; value at address = N
STEQR: Address of ldz = 70368304299216; value at address = 1
STEQR: Address of n = 70368304299188; value at address = 1031
 DSTEQR: COMPZ = ?; LOC(COMPZ) =        70368304299184
 DSTEQR: LDZ =   -439878216 ; LOC(LDZ) =        70368304299216
 DSTEQR: N =        16383 ; LOC(N) =        70368304299188
STEQR: Address of COMPZ = 70368304299184; value at address = ?
STEQR: Address of ldz = 70368304299216; value at address = -439878216
STEQR: Address of n = 70368304299188; value at address = 16383
main: Address of char_N = 70368304299456; value at address = N
main: Address of dummy_ldz = 70368304299444; value at address = 1
main: Address of N = 70368304299460; value at address = 1031

what can be seen from this is that the CXX compiler is creating temporaries for COMPZ, ldz and n when we call STEQR (as is expected) and the values of these variables are initially sensible. The Fortran DSTEQR routine is receiving the right addresses, but the values at these addresses are corrupted. When we return back into the CXX STEQR routine, the values at these addresses are still corrupted. Now, we change the API on the STEQR function to pass the char variable by reference, rather than value, i.e.

STEQR(const char& COMPZ, const int n, double* D, double* E, double* Z, const int ldz, double* WORK, int* info)

which results in the following output:

main: Address of char_N = 70368098970864; value at address = N
main: Address of dummy_ldz = 70368098970852; value at address = 1
main: Address of N = 70368098970868; value at address = 1031
STEQR: Address of COMPZ = 70368098970864; value at address = N
STEQR: Address of ldz = 70368098970604; value at address = 1
STEQR: Address of n = 70368098970600; value at address = 1031
 DSTEQR: COMPZ = N; LOC(COMPZ) =        70368098970864
 DSTEQR: LDZ =        16383 ; LOC(LDZ) =        70368098970604
 DSTEQR: N =   -645207064 ; LOC(N) =        70368098970600
STEQR: Address of COMPZ = 70368098970864; value at address = N
STEQR: Address of ldz = 70368098970604; value at address = 16383
STEQR: Address of n = 70368098970600; value at address = -645207064
main: Address of char_N = 70368098970864; value at address = N
main: Address of dummy_ldz = 70368098970852; value at address = 1
main: Address of N = 70368098970868; value at address = 1031

Here, there is no longer a temporary created for the COMPZ variable (instead the address of COMPZ corresponds to that of char_N, as expected). The Fortran DSTEQR function receives this same address for COMPZ and the value is not corrupted. However, the temporaries that are still being created for LDZ and N in the STEQR function continue to be corrupted. To resolve this latter issue, we have to replace each of the pass by value with pass by reference, i.e.

STEQR(const char& COMPZ, const int& n, double* D, double* E, double* Z, const int& ldz, double* WORK, int* info)

Now no temporaries are created in the call to the STEQR function and the output is:

main: Address of char_N = 70368326238304; value at address = N
main: Address of dummy_ldz = 70368326238292; value at address = 1
main: Address of N = 70368326238308; value at address = 1031
STEQR: Address of COMPZ = 70368326238304; value at address = N
STEQR: Address of ldz = 70368326238292; value at address = 1
STEQR: Address of n = 70368326238308; value at address = 1031
 DSTEQR: COMPZ = N; LOC(COMPZ) =        70368326238304
 DSTEQR: LDZ =            1 ; LOC(LDZ) =        70368326238292
 DSTEQR: N =         1031 ; LOC(N) =        70368326238308
STEQR: Address of COMPZ = 70368326238304; value at address = N
STEQR: Address of ldz = 70368326238292; value at address = 1
STEQR: Address of n = 70368326238308; value at address = 1031
main: Address of char_N = 70368326238304; value at address = N
main: Address of dummy_ldz = 70368326238292; value at address = 1
main: Address of N = 70368326238308; value at address = 1031

i.e. the correct addresses are received by the Fortran function and the value at these addresses are stored correctly.

krcb commented 7 years ago

@jwillenbring, @bartlettroscoe and I met this morning and discussed these results. Our proposed solution is to replace the pass by value components of the Teuchos::LAPACK API with pass by reference, ensuring that the affected variables are declared const (note that this may have to be done for the Teuchos::BLAS API's as well). This appears to be a minimally intrusive way to solve the problem that we're observing on this platform. One further test before doing so is to verify that this solves the issue observed in at least one downstream package (i.e. the Belos test above). Once that is done, we should be able to rapidly refactor the API's as described here without impacting downstream users. However, packages that use custom LAPACK API's within Trillinos may have to address this issue as well.

krcb commented 7 years ago

I updated the Teuchos::LAPACK::STEQR API to pass by reference, rebuilt Belos against this and reran Belos_Tpetra_PseudoBlockCG_hb_test.exe. The test passes. Next: refactor the Teuchos::LAPACK API to replace all pass by value with pass by reference.

bartlettroscoe commented 7 years ago

The PR #1284 is merged. Therefore, the LAPACK calls on 'ride' should be working now. Therefore, I will not put this into review.

bartlettroscoe commented 7 years ago

@krcb, @kruger, and @MicheldeMessieres,

Looks like there is a build failure associated with the PR code #1284 for C++98 with GCC 4.4.7 as shown at:

which shows:

/data/jenkins/slave/workspace/Trilinos_XYCE_MPI/MPI_OPT_DEV_XYCE/Trilinos/packages/teuchos/numerics/test/LAPACK/cxx_main.cpp: In function ‘int main(int, char**)’:
/data/jenkins/slave/workspace/Trilinos_XYCE_MPI/MPI_OPT_DEV_XYCE/Trilinos/packages/teuchos/numerics/test/LAPACK/cxx_main.cpp:119: error: using ‘typename’ outside of template

Looks easy to fix. Can one of you take a look at fix this? A non-C++11 build with GCC 4.4.7 is important for the internal SNL customer Xyce.

krcb commented 7 years ago

@MicheldeMessieres is fixing this. We should have a PR shortly.

MicheldeMessieres commented 7 years ago

@bartlettroscoe Fix is up in PR #1322

bartlettroscoe commented 7 years ago

Fix is up in PR #1322

I reviewed, merged, tested, and pushed with the checkin-test-sems.sh script. I did not test wtih GCC 4.4.7 without C++11 but the change is trivial so I can't imagine it will not fix this but we will see.

bartlettroscoe commented 7 years ago

@kruger, you are saying that if you build and run the above example code with with -O0 -g and -O2, then it fails? This would seem to indicate that we might have gotten lucky with the change pushed in #1284 fixing some failures but not others. The full problem would appear not to be fixed :-(

bartlettroscoe commented 7 years ago

@kruger and I just talked over the phone to discuss this more. He is going to see if the Teuchos:BLAS approach using enums and static char arrays fixes the problems with his simple example programs. We will wait to see what the outcome of his experimentation shows before deciding on the next step ...

kruger commented 7 years ago

Summary: The available g++/gfortran executables available on ride do not work for the C++ and fortran interoperability required for invoking STEQDR. At present, we do not understand why the BLAS GEMM tests pass, but LAPACK STEQDR tests fail, but the behavior does point to compiler problems.

On ride:~sekruge/BuildAndTest, I have 4 C++ drivers, a simple makefile to build them, and a script (runme.sh) that compiles and runs the drivers with different compiler options (runme.sh gcc5, runme.sh gcc6, runme.sh pgi). Attached are different typescript files from the runs.

The different drivers are:

  1. cxx_main_wrappers_1: Direct call to the fortran using the Teuchos style direct call
  2. cxx_main_wrappers_2: Call a function which calls does the Teuchos-style direct call
  3. cxx_main_wrappers_3: Call a method that belongs to a LAPACK class, similar to Teuchos
  4. cxx_main_wrappers_4: Direct call to the fortran using an enum.

While some combinations of compilations flags seem to have some things work with gcc, they in general do not work. All of the tests work for all combinations of flags for PGI. Things also work on other platforms, so this points to the GNU compilers on this system not working.

One possible fix would be to implement a C-version of these calls as part of Teuchos that we could then turn on as a compilation for the flag, but it seems worthwhile to try and fix the compilers.

bartlettroscoe commented 7 years ago

Below is an email exchange alerting the representative (i.e. @nmhamster) that maintains 'ride' and the response that they will look into the issue.


From: Hammond, Simon David Sent: Monday, May 22, 2017 10:57 PM To: Bartlett, Roscoe A Cc: Pawlowski, Roger P ; Bettencourt, Matthew ; Scott Kruger ; Glass, Micheal W ; Heroux, Michael A ; Willenbring, James M ; Kris Beckwith ; ride-admin Subject: RE: g++/gfortran broken on 'ride'

Ross,

We will look into. Thanks for generating the report.

S


From: Bartlett, Roscoe A Sent: Monday, May 22, 2017 8:50:59 PM To: Hammond, Simon David Cc: Pawlowski, Roger P; Bettencourt, Matthew; Scott Kruger; Glass, Micheal W; Heroux, Michael A; Willenbring, James M; Kris Beckwith Subject: g++/gfortran broken on 'ride'

Hello Si,

Scott Kruger (our newest Tech-X contractor) had done a detailed study of the Trilinos failures with GCC on ‘ride’ described in:

and from his analysis (summarized in the GitHub comment below), it looks like g++ and gfortran can’t be used to create working mixed-language programs on ‘ride’. Scott has a pretty good set of small example programs that demonstrate the problem (see below).

I am contacting you as the Test Bed representative for ATDM because it seems there is not much we can do to fix things if we are stuck using these compilers with these BLAS and LAPACK libraries on ‘ride’.

Here are a few options we might consider to move forward:

  1. Have someone dig into what is happening with g++ and gfortran that is causing these problems and perhaps fix them
  2. Change to a compiler that works (e.g. PGI)
  3. Use a C-only implementation of LAPACK (and perhaps BLAS using CLAPACK?)

It seems like option-3 is the easiest as we could just use clapack but would provide the worst performance. Given that this is an IBM machine, is the XL compiler available? Otherwise, is the PGI compiler an option on that machine? In any case, if someone cares about the GCC/gfortran compilers on this system, then someone should dig into what is happening (but I suspect that could be a big time-sink). But all three of those options realistically requires support from the Test Bed team for ‘ride’.

What do you think we should do?

Thanks,

P.S. I will archive this in the GitHub issue #1208 but I just wanted to make sure that I CCed a few key people who should be alerted to this development.

kruger commented 7 years ago

I built clapack to see if we can use it with gcc to overcome some of the problems. I am now getting a whole set of new tests that I did not have previously:

 47 - TeuchosCore_TypeConversions_UnitTest (Failed)
 65 - TeuchosParameterList_ObjectBuilder_UnitTests (OTHER_FAULT)
 98 - TeuchosNumerics_BLAS_test (SEGFAULT)
100 - TeuchosNumerics_BLAS_tmpl_comp_test (SEGFAULT)
102 - TeuchosNumerics_DenseSolver_test (SEGFAULT)
103 - TeuchosNumerics_BandDenseSolver_test (SEGFAULT)
104 - TeuchosNumerics_SpdDenseSolver_test (SEGFAULT)
105 - TeuchosNumerics_QRDenseSolver_test (SEGFAULT)
106 - TeuchosNumerics_LAPACK_test (SEGFAULT)
109 - TeuchosNumerics_BLAS_example (SEGFAULT)
110 - TeuchosNumerics_DenseMatrix_example (SEGFAULT)
111 - TeuchosNumerics_SymDenseMatrix_example (SEGFAULT)
112 - TeuchosNumerics_hilbert (SEGFAULT)

Looking into what's going on.

kruger commented 7 years ago

Going through the first 3 errors

47 - TeuchosCore_TypeConversions_UnitTest (Failed)

Test that code {valF = asSafe<float> (os.str ());} throws std::range_error: passed

 Exception message for expected exception:

  /ascldap/users/sekruge/Trilinos-develop/packages/teuchos/core/src/Teuchos_as.hpp:558:

  Throw number = 1

  Throw test that evaluated to true: errno == ERANGE && (val != 0)

  Teuchos::ValueTypeConversionTraits<float, std::string>::convert: The value in the given string "-1.79769313e+308" overflows float.

 Test that code {valF = asSafe<float> (os.str ());} throws std::range_error: passed

 Exception message for expected exception:

  /ascldap/users/sekruge/Trilinos-develop/packages/teuchos/core/src/Teuchos_as.hpp:558:

  Throw number = 2

  Throw test that evaluated to true: errno == ERANGE && (val != 0)

  Teuchos::ValueTypeConversionTraits<float, std::string>::convert: The value in the given string "1.79769313e+308" overflows float.

65 - TeuchosParameterList_ObjectBuilder_UnitTests (OTHER_FAULT)

5. Teuchos_ObjectBuilder_create_UnitTest ... [Passed] (0.00083 sec)
terminate called after throwing an instance of 'Teuchos::Exceptions::InvalidParameterName'
  what():  Error, the parameter {name="Hello",type="string",value="World"}
in the parameter (sub)list "ANONYMOUS"
was not found in the list of valid parameters!

The valid parameters and types are:
  {
    "Object Type" : string = Foo A
    "Foo A" : ParameterList = ...
  }

Throw number = 12

6. Teuchos_ObjectBuilder_setParameterList_UnitTest ... Abort (core dumped)

98 - TeuchosNumerics_BLAS_test (SEGFAULT)

This fails in srot which is part of BLAS.

Conclusions

Need to get the compilers fixed.

bartlettroscoe commented 7 years ago

@kruger, the test TeuchosParameterList_ObjectBuilder_UnitTests is a known problematic test (see #831 and #449) soon to be fixed with the merge of the branch associated with #229.

As for the other BLAS and LAPACK-related tests, I might dig a little deeper. If you link against a mock C version of srot, for example, are the arguments passed correctly or not?

bartlettroscoe commented 7 years ago

@kruger, can you please attach the full output for the test TeuchosCore_TypeConversions_UnitTest? The output how show above suggests that those unit tests actually passed (because they thrown an expected exception).

maherou commented 7 years ago

@nmhamster, Please update this issue when you get news from the vendor.

kruger commented 7 years ago

TeuchosCore_TypeConversions_UnitTest.txt

Sorry for the delay. Here is the output.

kruger commented 7 years ago

The srot is a C version.

The full build of the clapack can be seen here: ~sekruge/builds-develop/clapack_cmake-3.2.1

bartlettroscoe commented 7 years ago

@kruger,

Thanks for the unit test output. By my count, what is failing are 17th and 18th checks in this unit test:

Test that code {valD = asSafe<double> (os.str ());} throws std::range_error: failed (code did not throw an exception at all) 
 Test that code {valD = asSafe<double> (os.str ());} throws std::range_error: failed (code did not throw an exception at all) 
...
 [FAILED]  (0.000282 sec) asSafe_stringToReal_UnitTest 
 Location: /ascldap/users/sekruge/Trilinos-develop/packages/teuchos/core/test/TypeConversions/TypeConversions_UnitTest.cpp:274 

@mhoemmen, by my count, the failing checks in that unit test are coming from the unit test code starting on line 407:

  //
  // Test string -> double conversions that should throw,
  // if sizeof(long double) > sizeof(double).
  //
  if (sizeof (long double) > sizeof (double)) {
    {
      std::ostringstream os;
      os.precision (36);
      os << minLD;
      TEST_THROW(valD = asSafe<double> (os.str ()), std::range_error);
    }
    {
      std::ostringstream os;
      os.precision (36);
      os << maxLD;
      TEST_THROW(valD = asSafe<double> (os.str ()), std::range_error);
    }
  }

So something about these asSafe() checks must be using LAPACK and the CLAPACK functions must not be returning the right values used in these checks?

@kruger, do any Belos test failures originally reported in #1191 now pass with your build of CLAPACK on 'ride'? If the answer is "yes", then it is worth diving more into these other failing tests. If the answer is "no", then we can consider this approach a dead-end and then we just need to wait on IBM to fix these compilers on this machine (unless someone has a better idea).

mhoemmen commented 7 years ago

@bartlettroscoe Are those the only checks that are failing? My guess is that the compiler implements long double as a "double-double" (a.k.a. ddreal), not as a true 80-bit or 128-bit IEEE 754 floating-point value. That could affect both the min and the max representable positive long double values.

I'm all for commenting out the test for now until we have time to research what the min and max values should be for whatever this fake long double is actually doing.

bartlettroscoe commented 7 years ago

Are those the only checks that are failing? My guess is that the compiler implements long double as a "double-double" (a.k.a. ddreal), not as a true 80-bit or 128-bit IEEE 754 floating-point value. That could affect both the min and the max representable positive long double values.

I'm all for commenting out the test for now until we have time to research what the min and max values should be for whatever this fake long double is actually doing.

I think these tests passed before with the same compiler. So these new failures don't appear to be related to the compiler itself.

@kruger, can you comment on this since you are on that machine and ran these tests?

kruger commented 7 years ago

Attached are various log files. test.sh.txt is the one that shows all of the tests failing. There are a lot of them. The compiler is gcc 5.4.0. Is there a reason that it would do a long double on power architecture different than Intel? Is there a way of checking from a trilinos build?

test.sh.txt TeuchosCore_TypeConversions_UnitTest.txt TeuchosNumerics_BLAS.txt TeuchosParameterList_ObjectBuilder_UnitTests.exe.txt

mhoemmen commented 7 years ago

@kruger IBM historically implemented 128-bit (hex) floating-point arithmetic (System/370). POWER9 will have hardware 128-bit floating-point support. Thus, it's not surprising that the compiler does something different on POWER.

mhoemmen commented 7 years ago

so what's going on here? do we just need to fix the long double tests or what?