trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.2k stars 564 forks source link

Zoltan Test Failures on Knights Landing with OpenMPI 1.10.4 and Intel 17.0.098 #600

Closed nmhamster closed 7 years ago

nmhamster commented 8 years ago

Zoltan team I am seeing some issues with the latest builds on Knights Landing and Intel 17.0.098 compilers. I am seeing an insufficient memory error on the failing cases. The node has 16GB + 96GB of memory so I think this should be sufficient?

$ ctest -V -R Zoltan_ch_drake_zoltan_parallel
UpdateCTestConfiguration  from :/home/sdhammo/git/trilinos-github-repo/build-knl-170098/DartConfiguration.tcl
Parse Config file:/home/sdhammo/git/trilinos-github-repo/build-knl-170098/DartConfiguration.tcl
 Add coverage exclude regular expressions.
SetCTestConfiguration:CMakeCommand:/home/projects/x86-64-knl/cmake/3.5.2/bin/cmake
UpdateCTestConfiguration  from :/home/sdhammo/git/trilinos-github-repo/build-knl-170098/DartConfiguration.tcl
Parse Config file:/home/sdhammo/git/trilinos-github-repo/build-knl-170098/DartConfiguration.tcl
Test project /home/sdhammo/git/trilinos-github-repo/build-knl-170098
Constructing a list of tests
Done constructing a list of tests
Checking test dependency graph...
Checking test dependency graph end
test 199
    Start 199: Zoltan_ch_drake_zoltan_parallel

199: Test command: /home/projects/x86-64-knl/cmake/3.5.2/bin/cmake "-DTEST_CONFIG=" "-P" "/home/sdhammo/git/trilinos-github-repo/build-knl-170098/packages/zoltan/test/ch_drake/Zoltan_ch_drake_zoltan_parallel.cmake"
199: Test timeout computed to be: 1500
199:
199: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
199:
199: Advanced Test: Zoltan_ch_drake_zoltan_parallel
199:
199: Selected Test/CTest Propeties:
199:   CATEGORIES = NIGHTLYPERFORMANCE
199:   PROCESSORS = 3
199:   TIMEOUT    = DEFAULT
199:
199: Running test commands: TEST_0
199:
199: ================================================================================
199:
199: TEST_0
199:
199: Running: "/usr/bin/perl" "../ctest_zoltan.pl" "--np" "3" "--debug" "--mpiexec" "/home/projects/x86-64-knl/openmpi/1.10.4/intel/17.0.098/bin/mpiexec" "--mpiexecarg" "-np" "--pkg" "Zoltan"
199:
199: --------------------------------------------------------------------------------
199:
199: CTEST_FULL_OUTPUT
199: --np3--debug--mpiexec/home/projects/x86-64-knl/openmpi/1.10.4/intel/17.0.098/bin/mpiexec--mpiexecarg-np--pkgZoltan
199: DEBUG HOSTNAME node02.bowman.sandia.gov node0
199: DEBUG:  package Zoltan
199:  08:21:00 up 58 days, 22:32,  0 users,  load average: 3.69, 2.34, 1.08
199: DEBUG:  mpiexec /home/projects/x86-64-knl/openmpi/1.10.4/intel/17.0.098/bin/mpiexec --mca mpi_yield_when_idle 1 -np
199: DEBUG  Dir /home/sdhammo/git/trilinos-github-repo/build-knl-170098/packages/zoltan/test/ch_drake dirname drake
199: DEBUG  Outfilebase: ;  Dropbase:
199: DEBUG  Running test 0 on zdrive.inp.rcb
199: DEBUG  Test name:  rcb
199: DEBUG  Archfilebase: drake.rcb.3.; Dropbase: drake.rcb.drops.3.
199: DEBUG Executing now:  /home/projects/x86-64-knl/openmpi/1.10.4/intel/17.0.098/bin/mpiexec --mca mpi_yield_when_idle 1 -np 3 ../zdrive.exe zdrive.inp.rcb 2>&1 | tee drake.rcb.3.outerr
199:
199:
199:
199: Reading the command file, zdrive.inp.rcb
199: Input values:
199:   Zoltan version 3.83
199:   zdrive version 1.0
199:   Total number of Processors = 3
199:
199:   Performing load balance using rcb.
199:    Parameters:
199:        remap 0
199:        obj_weight_dim 1
199:        keep_cuts 1
199:        debug_level 3
199:        timer user
199:
199:   Initially distribute input objects according to assignments in file.
199: ##########################################################
199: ZOLTAN Load balancing method = 3 (RCB)
199: Starting iteration 1
199: =========================messages from Proc 0=========================
199: Proc 0:    fatal: insufficient memory
199: Proc 0:        in file /home/sdhammo/git/trilinos-github-repo/packages/zoltan/src/ch/ch_dist_graph.c
199: Proc 0:            at line 407
199: Proc 0:    fatal: Error returned from chaco_dist_graph
199: Proc 0:        in file /home/sdhammo/git/trilinos-github-repo/packages/zoltan/src/driver/dr_chaco_io.c
199: Proc 0:            at line 248
199: Proc 0:    fatal: Error returned from read_chaco_mesh
199:
199: Proc 0:        in file /home/sdhammo/git/trilinos-github-repo/packages/zoltan/src/driver/dr_main.c
199: Proc 0:            at line 571
199: Proc 0:    fatal: Error returned from read_mesh
199:
199: Proc 0:        in file /home/sdhammo/git/trilinos-github-repo/packages/zoltan/src/driver/dr_main.c
199: Proc 0:            at line 334
199: --------------------------------------------------------------------------
199: MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
199: with errorcode -1.
199:
199: NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
199: You may or may not see output from other processes, depending on
199: exactly when Open MPI kills them.
199: --------------------------------------------------------------------------
199: DEBUG system results 0
199: Using default indextype
199: DEBUG moving files:  drake.out.3.0 output/drake.rcb.3.0
199: Test drake:rcb FAILED (Missing output files)
199: DEBUG  Running test 1 on zdrive.inp.rcb-ts
199: DEBUG  Test name:  rcb-ts
199: DEBUG  Archfilebase: drake.rcb-ts.3.; Dropbase: drake.rcb-ts.drops.3.
199: DEBUG Executing now:  /home/projects/x86-64-knl/openmpi/1.10.4/intel/17.0.098/bin/mpiexec --mca mpi_yield_when_idle 1 -np 3 ../zdrive.exe zdrive.inp.rcb-ts 2>&1 | tee drake.rcb-ts.3.outerr
199:
199:
199:
199: Reading the command file, zdrive.inp.rcb-ts
199: Input values:
199:   Zoltan version 3.83
199:   zdrive version 1.0
199:   Total number of Processors = 3
199:
199:   Performing load balance using rcb.
199:    Parameters:
199:        remap 0
199:        obj_weight_dim 1
199:        tflops_special 1
199:        debug_level 3
199:        timer user
199:
199:   Initially distribute input objects according to assignments in file.
199: ##########################################################
199: ZOLTAN Load balancing method = 3 (RCB)
199: Starting iteration 1
199: =========================messages from Proc 0=========================
199: Proc 0:    fatal: insufficient memory
199: Proc 0:        in file /home/sdhammo/git/trilinos-github-repo/packages/zoltan/src/ch/ch_dist_graph.c
199: Proc 0:            at line 407
199: Proc 0:    fatal: Error returned from chaco_dist_graph
199: Proc 0:        in file /home/sdhammo/git/trilinos-github-repo/packages/zoltan/src/driver/dr_chaco_io.c
199: Proc 0:            at line 248
199: Proc 0:    fatal: Error returned from read_chaco_mesh
199:
199: Proc 0:        in file /home/sdhammo/git/trilinos-github-repo/packages/zoltan/src/driver/dr_main.c
199: Proc 0:            at line 571
199: Proc 0:    fatal: Error returned from read_mesh
199:
199: Proc 0:        in file /home/sdhammo/git/trilinos-github-repo/packages/zoltan/src/driver/dr_main.c
199: Proc 0:            at line 334
199: --------------------------------------------------------------------------
199: MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
199: with errorcode -1.
199:
199: NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
199: You may or may not see output from other processes, depending on
199: exactly when Open MPI kills them.
199: --------------------------------------------------------------------------
199: DEBUG system results 0
199: Using default indextype
199: DEBUG moving files:  drake.out.3.0 output/drake.rcb-ts.3.0
199: Test drake:rcb-ts FAILED (Missing output files)
199: DEBUG  Running test 2 on zdrive.inp.rib
199: DEBUG  Test name:  rib
199: DEBUG  Archfilebase: drake.rib.3.; Dropbase: drake.rib.drops.3.
199: DEBUG Executing now:  /home/projects/x86-64-knl/openmpi/1.10.4/intel/17.0.098/bin/mpiexec --mca mpi_yield_when_idle 1 -np 3 ../zdrive.exe zdrive.inp.rib 2>&1 | tee drake.rib.3.outerr
199:
199:
199:
199: Reading the command file, zdrive.inp.rib
199: Input values:
199:   Zoltan version 3.83
199:   zdrive version 1.0
199:   Total number of Processors = 3
199:
199:   Performing load balance using rib.
199:    Parameters:
199:        remap 0
199:        obj_weight_dim 1
199:        keep_cuts 1
199:        debug_level 3
199:        timer user
199:
199:   Initially distribute input objects according to assignments in file.
199: ##########################################################
199: ZOLTAN Load balancing method = 7 (RIB)
199: Starting iteration 1
199: =========================messages from Proc 0=========================
199: Proc 0:    fatal: insufficient memory
199: Proc 0:        in file /home/sdhammo/git/trilinos-github-repo/packages/zoltan/src/ch/ch_dist_graph.c
199: Proc 0:            at line 407
199: Proc 0:    fatal: Error returned from chaco_dist_graph
199: Proc 0:        in file /home/sdhammo/git/trilinos-github-repo/packages/zoltan/src/driver/dr_chaco_io.c
199: Proc 0:            at line 248
199: Proc 0:    fatal: Error returned from read_chaco_mesh
199:
199: Proc 0:        in file /home/sdhammo/git/trilinos-github-repo/packages/zoltan/src/driver/dr_main.c
199: Proc 0:            at line 571
199: Proc 0:    fatal: Error returned from read_mesh
199:
199: Proc 0:        in file /home/sdhammo/git/trilinos-github-repo/packages/zoltan/src/driver/dr_main.c
199: Proc 0:            at line 334
199: --------------------------------------------------------------------------
199: MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
199: with errorcode -1.
199:
199: NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
199: You may or may not see output from other processes, depending on
199: exactly when Open MPI kills them.
199: --------------------------------------------------------------------------
199: DEBUG system results 0
199: Using default indextype
199: DEBUG moving files:  drake.out.3.0 output/drake.rib.3.0
199: Test drake:rib FAILED (Missing output files)
199: DEBUG  Running test 3 on zdrive.inp.rib-ts
199: DEBUG  Test name:  rib-ts
199: DEBUG  Archfilebase: drake.rib-ts.3.; Dropbase: drake.rib-ts.drops.3.
199: DEBUG Executing now:  /home/projects/x86-64-knl/openmpi/1.10.4/intel/17.0.098/bin/mpiexec --mca mpi_yield_when_idle 1 -np 3 ../zdrive.exe zdrive.inp.rib-ts 2>&1 | tee drake.rib-ts.3.outerr
199:
199:
199:
199: Reading the command file, zdrive.inp.rib-ts
199: Input values:
199:   Zoltan version 3.83
199:   zdrive version 1.0
199:   Total number of Processors = 3
199:
199:   Performing load balance using rib.
199:    Parameters:
199:        remap 0
199:        obj_weight_dim 1
199:        tflops_special 1
199:        debug_level 3
199:        timer user
199:
199:   Initially distribute input objects according to assignments in file.
199: ##########################################################
199: ZOLTAN Load balancing method = 7 (RIB)
199: Starting iteration 1
199: =========================messages from Proc 0=========================
199: Proc 0:    fatal: insufficient memory
199: Proc 0:        in file /home/sdhammo/git/trilinos-github-repo/packages/zoltan/src/ch/ch_dist_graph.c
199: Proc 0:            at line 407
199: Proc 0:    fatal: Error returned from chaco_dist_graph
199: Proc 0:        in file /home/sdhammo/git/trilinos-github-repo/packages/zoltan/src/driver/dr_chaco_io.c
199: Proc 0:            at line 248
199: Proc 0:    fatal: Error returned from read_chaco_mesh
199:
199: Proc 0:        in file /home/sdhammo/git/trilinos-github-repo/packages/zoltan/src/driver/dr_main.c
199: Proc 0:            at line 571
199: Proc 0:    fatal: Error returned from read_mesh
199:
199: Proc 0:        in file /home/sdhammo/git/trilinos-github-repo/packages/zoltan/src/driver/dr_main.c
199: Proc 0:            at line 334
199: --------------------------------------------------------------------------
199: MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
199: with errorcode -1.
199:
199: NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
199: You may or may not see output from other processes, depending on
199: exactly when Open MPI kills them.
199: --------------------------------------------------------------------------
199: DEBUG system results 0
199: Using default indextype
199: DEBUG moving files:  drake.out.3.0 output/drake.rib-ts.3.0
199: Test drake:rib-ts FAILED (Missing output files)
199: DEBUG  Running test 4 on zdrive.inp.hsfc
199: DEBUG  Test name:  hsfc
199: DEBUG  Archfilebase: drake.hsfc.3.; Dropbase: drake.hsfc.drops.3.
199: DEBUG Executing now:  /home/projects/x86-64-knl/openmpi/1.10.4/intel/17.0.098/bin/mpiexec --mca mpi_yield_when_idle 1 -np 3 ../zdrive.exe zdrive.inp.hsfc 2>&1 | tee drake.hsfc.3.outerr
199:
199:
199:
199: Reading the command file, zdrive.inp.hsfc
199: Input values:
199:   Zoltan version 3.83
199:   zdrive version 1.0
199:   Total number of Processors = 3
199:
199:   Performing load balance using hsfc.
199:    Parameters:
199:        remap 0
199:        obj_weight_dim 1
199:        keep_cuts 1
199:        debug_level 3
199:        timer user
199:
199:   Initially distribute input objects according to assignments in file.
199: ##########################################################
199: ZOLTAN Load balancing method = 8 (HSFC)
199: Starting iteration 1
199: =========================messages from Proc 0=========================
199: Proc 0:    fatal: insufficient memory
199: Proc 0:        in file /home/sdhammo/git/trilinos-github-repo/packages/zoltan/src/ch/ch_dist_graph.c
199: Proc 0:            at line 407
199: Proc 0:    fatal: Error returned from chaco_dist_graph
199: Proc 0:        in file /home/sdhammo/git/trilinos-github-repo/packages/zoltan/src/driver/dr_chaco_io.c
199: Proc 0:            at line 248
199: Proc 0:    fatal: Error returned from read_chaco_mesh
199:
199: Proc 0:        in file /home/sdhammo/git/trilinos-github-repo/packages/zoltan/src/driver/dr_main.c
199: Proc 0:            at line 571
199: Proc 0:    fatal: Error returned from read_mesh
199:
199: Proc 0:        in file /home/sdhammo/git/trilinos-github-repo/packages/zoltan/src/driver/dr_main.c
199: Proc 0:            at line 334
199: --------------------------------------------------------------------------
199: MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
199: with errorcode -1.
199:
199: NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
199: You may or may not see output from other processes, depending on
199: exactly when Open MPI kills them.
199: --------------------------------------------------------------------------
199: DEBUG system results 0
199: Using default indextype
199: DEBUG moving files:  drake.out.3.0 output/drake.hsfc.3.0
199: Test drake:hsfc FAILED (Missing output files)
199: DEBUG  Running test 5 on zdrive.inp.phg
199: DEBUG  Test name:  phg
199: DEBUG  Archfilebase: drake.phg.3.; Dropbase: drake.phg.drops.3.
199: DEBUG Executing now:  /home/projects/x86-64-knl/openmpi/1.10.4/intel/17.0.098/bin/mpiexec --mca mpi_yield_when_idle 1 -np 3 ../zdrive.exe zdrive.inp.phg 2>&1 | tee drake.phg.3.outerr
199:
199:
199:
199: Reading the command file, zdrive.inp.phg
199: Input values:
199:   Zoltan version 3.83
199:   zdrive version 1.0
199:   Total number of Processors = 3
199:
199:   Performing load balance using hypergraph.
199:    Parameters:
199:        remap 0
199:        obj_weight_dim 1
199:        phg_edge_size_threshold 1.0
199:
199:   Initially distribute input objects according to assignments in file.
199: ##########################################################
199: ZOLTAN Load balancing method = 10 (HYPERGRAPH)
199: Starting iteration 1
199: =========================messages from Proc 0=========================
199: Proc 0:    fatal: insufficient memory
199: Proc 0:        in file /home/sdhammo/git/trilinos-github-repo/packages/zoltan/src/ch/ch_dist_graph.c
199: Proc 0:            at line 407
199: Proc 0:    fatal: Error returned from chaco_dist_graph
199: Proc 0:        in file /home/sdhammo/git/trilinos-github-repo/packages/zoltan/src/driver/dr_chaco_io.c
199: Proc 0:            at line 248
199: Proc 0:    fatal: Error returned from read_chaco_mesh
199:
199: Proc 0:        in file /home/sdhammo/git/trilinos-github-repo/packages/zoltan/src/driver/dr_main.c
199: Proc 0:            at line 571
199: Proc 0:    fatal: Error returned from read_mesh
199:
199: Proc 0:        in file /home/sdhammo/git/trilinos-github-repo/packages/zoltan/src/driver/dr_main.c
199: Proc 0:            at line 334
199: --------------------------------------------------------------------------
199: MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
199: with errorcode -1.
199:
199: NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
199: You may or may not see output from other processes, depending on
199: exactly when Open MPI kills them.
199: --------------------------------------------------------------------------
199: DEBUG system results 0
199: Using default indextype
199: DEBUG moving files:  drake.out.3.0 output/drake.phg.3.0
199: Test drake:phg FAILED (Missing output files)
199: Test drake:  0 out of 6 tests PASSED.
199: Test drake:  6 out of 6 tests FAILED.
199:
199: --------------------------------------------------------------------------------
199:
199: TEST_0: Return code = 0
199: TEST_0: Pass criteria = Return code
199: TEST_0: Result = PASSED
199:
199: ================================================================================
199:
199: OVERALL FINAL RESULT: TEST PASSED (Zoltan_ch_drake_zoltan_parallel)
199:
199: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
199:
1/1 Test #199: Zoltan_ch_drake_zoltan_parallel ...***Failed  Error regular expression found in output. Regex=[FAILED]  9.20 sec

0% tests passed, 1 tests failed out of 1

Label Time Summary:
Zoltan    =   9.20 sec (1 test)

Total Test time (real) =  10.39 sec

The following tests FAILED:
    199 - Zoltan_ch_drake_zoltan_parallel (Failed)
Errors while running CTest
kddevin commented 8 years ago

Indeed, this test graph has only four vertices, so we shouldn't exhaust memory while reading the graph.

Unfortunately, I am unable to reproduce this problem in my environment. I don't have access to Intel 17.0. The problem builds/runs fine on my workstations with intel 16.0 as well as with clang and gcc.

@nmhamster Do you see this problem with other versions of the Intel compiler? May I access your build to do some debugging (and, if yes, then how?)?

nmhamster commented 8 years ago

@kddevin do you have access to the Bowman test machine?

kddevin commented 8 years ago

@nmhamster I just requested it...stay tuned.

vjleung commented 8 years ago

I do not.

Sent from my iPhone

On Sep 8, 2016, at 3:58 PM, Si Hammond notifications@github.com<mailto:notifications@github.com> wrote:

@kddevinhttps://github.com/kddevin do you have access to the Bowman test machine?

You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/trilinos/Trilinos/issues/600#issuecomment-245754983, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHLap-iBGrandHgNwYz51qr_hphxW6Y0ks5qoIUTgaJpZM4J4DPt.

nmhamster commented 8 years ago

@kddevin - thanks, I am hoping we can use the system to see the same results. @vjleung - can you request?

kddevin commented 8 years ago

@vjleung No worries, Vitus; I will handle this issue with Si. No need to request a new account.

vjleung commented 8 years ago

@kddevin Okay, thanks.

Sent from my iPhone

On Sep 8, 2016, at 4:11 PM, K Devine notifications@github.com<mailto:notifications@github.com> wrote:

@vjleunghttps://github.com/vjleung No worries, Vitus; I will handle this issue with Si. No need to request a new account.

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/trilinos/Trilinos/issues/600#issuecomment-245757863, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHLap-O_OwnQTHQlylXCv2cHTeiLv0LBks5qoIf2gaJpZM4J4DPt.

mndevec commented 8 years ago

@kddevin @nmhamster Si, Karen, I run this both with -O3 and -O0 flag. I did not track down the bug, but it fails only with -O3 flag, as I have experienced previously in my code. It is a bold claim, but I suspect if something is wrong with the compiler. The previous test example that was resulting in wrong computation with -O3 flag is below.

#include <iostream>

int main(){
  int el = 0;
  for (int i = 0; i < 5; ++i){
    int k = i;
    for (;k; ++el){
      k &= k-1;
    }
  }
  std::cout << el << std::endl;

  el = 0;
  for (int i = 0; i < 5; ++i){
    int el2 = 0;
    int k = i;
    for (;k; ++el2){
      k &= k-1;
    }
    el += el2;
  }
  std::cout << el << std::endl;

}
mndevec commented 8 years ago

Similar to -O0 flag, it seems that changing the loop below around line ~400

          for (i = 0; i < nsend; i++) {
            v = vtx_list[i];
            nvtx_edges += old_xadj[v+1] - old_xadj[v];
          } 

to

          for (i = 0; i < nsend; i++) {
            v = vtx_list[i];
            volatile int tmp = old_xadj[v+1] - old_xadj[v];
            nvtx_edges += tmp;
          }

is fixing the problem. But I don't see why would volatile be necessary.

kddevin commented 8 years ago

As a sanity test, I ran these tests through purify (with gcc 4.7.2); no memory misbehavior was reported.

kddevin commented 7 years ago

@nmhamster Does this bug persist?
Similar issue #1010 was closed recently.

kddevin commented 7 years ago

Hi, @nmhamster . Can we close this issue? Or does it persist? Thanks.

nmhamster commented 7 years ago

@kddevin - I think we can close it. If I find the problem again, I will reopen. Thank you!

kddevin commented 7 years ago

Thanks, @nmhamster