pnnl / ExaGO

High-performance power grid optimization for stochastic, security-constrained, and multi-period ACOPF problems.
Other
68 stars 9 forks source link

Incline Test Failures #92

Open jaelynlitz opened 11 months ago

jaelynlitz commented 11 months ago

Issue type

Relates to

Summary

There are two isolated test failures on Incline - one seg fault and one timeout. These are not occurring on Deception or Newell. TBD on other AMD platforms. These were introduced potentially with hiop@1.0.0

Creating a separate issue for these failures to isolate from #3 and #43 and let #84 continue without these tests blocking.

Exact commands to reproduce, if applicable

Relevant logs and/or screenshots, if applicable

  1. FUNCTIONALITY_TEST_OPFLOW_RAJAHIOP_SPARSE_GPU_TOML_TESTSUITE

    • seg fault
    • https://gitlab.pnnl.gov/exasgd/frameworks/exago-github-mirror/-/jobs/138118#L108
      20/57 Test #20: FUNCTIONALITY_TEST_OPFLOW_RAJAHIOP_SPARSE_GPU_TOML_TESTSUITE .................***Failed    2.76 sec
      [ExaGO] Creating OPFlow Functionality Test
      Test Description: datafiles/case9/case9mod.m base case
      [Warning] Hiop does not understand option 'dualsInitialization' and will ignore its value 'zero'.
      [Warning] Detected 1 fixed variables out of a total of 24.
      ===============
      Hiop SOLVER
      ===============
      Using 1 MPI ranks.
      ---------------
      Problem Summary
      ---------------
      Total number of variables: 24
      lower/upper/lower_and_upper bounds: 16 / 16 / 16
      Total number of equality constraints: 18
      Total number of inequality constraints: 18
      lower/upper/lower_and_upper bounds: 18 / 18 / 18
      iter    objective     inf_pr     inf_du   lg(mu)  alpha_du   alpha_pr linesrch
      0  1.0318125e+04 1.800e+00  4.460e+03  -1.00  0.000e+00  0.000e+00  -(-)
      [0]PETSC ERROR: ------------------------------------------------------------------------
      [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
      [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
      [0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and https://petsc.org/release/faq/
      [0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
      [0]PETSC ERROR: to get more information on the crash.
      [0]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
      --------------------------------------------------------------------------
      MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
      with errorcode 59.
      NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
      You may or may not see output from other processes, depending on
      exactly when Open MPI kills them.
      --------------------------------------------------------------------------
  2. FUNCTIONALITY_TEST_SOPFLOW_SCENARIO_RAJA_GPU_TOML

abhyshr commented 11 months ago

Is this issue only on Ascent OR does this happen on other platforms too?

jaelynlitz commented 11 months ago

Is this issue only on Ascent OR does this happen on other platforms too?

This behavior is only happening on Incline (not Deception, Newell, or Ascent), @nkoukpaizan was also seeing similar failures on Frontier in #89 so likely AMD related