icl-utk-edu / heffte

BSD 3-Clause "New" or "Revised" License
24 stars 19 forks source link

Some test fail on s390x #59

Open junghans opened 2 weeks ago

junghans commented 2 weeks ago

From https://koji.fedoraproject.org/koji/taskinfo?taskID=125093717:

25/25 Test #25: heffte_longlong_np4 ..............***Failed  375.02 sec
--------------------------------------------------------------------------------
                               heffte::fft class
--------------------------------------------------------------------------------
     float                  -np 4  test int/long long<stock>              pass
24% tests passed, 19 tests failed out of 25
Total Test time (real) = 1354.20 sec
The following tests FAILED:
      4 - heffte_reshape3d_np4 (Failed)
      5 - heffte_reshape3d_np7 (Failed)
      6 - heffte_reshape3d_np12 (Failed)
      8 - heffte_fft3d_np2 (Failed)
      9 - heffte_fft3d_np4 (Failed)
     10 - heffte_fft3d_np6 (Failed)
     11 - heffte_fft3d_np8 (Failed)
     12 - heffte_fft3d_np12 (Failed)
     13 - heffte_streams_np6 (Failed)
     14 - test_subcomm_np8 (Failed)
     15 - test_subcomm_np12 (Failed)
     17 - heffte_fft3d_r2c_np2 (Failed)
     18 - heffte_fft2d_r2c_np4 (Failed)
     19 - heffte_fft3d_r2c_np6 (Failed)
     20 - heffte_fft3d_r2c_np8 (Failed)
     21 - heffte_fft3d_r2c_np12 (Failed)
     23 - test_cos_np2 (Failed)
     24 - test_cos_np4 (Failed)
     25 - heffte_longlong_np4 (Failed)
Errors while running CTest
error: Bad exit status from /var/tmp/rpm-tmp.kyrg4o (%check)
RPM build errors:
    Bad exit status from /var/tmp/rpm-tmp.kyrg4o (%check)
Child return code was: 1

Full build log: build_s390x.log.txt.zip

It says v2.4.0, but is actually c7c8f69ce78395040a2690bcb6984299449176ce.

Aarch64, ppc64le and x86_64 work.

mkstoyanov commented 2 weeks ago

I've never tested on s390 but looking at the log I suspect this is an MPI issue. The tests are passing when they don't use MPI or run only on a single node, as soon as the test tries 2 or more notes it fails.

Is MPI configured correctly in the test environment?

junghans commented 2 weeks ago

The test environment is just the Fedora package, @keszybz would know the details.

keszybz commented 2 weeks ago

The "mpi environment" is just what the test sets up. The build is done in a dedicated VM, the hw_info.log file linked from the build describes the machine.

The test does this:

%check
# allow openmpi to oversubscribe, i.e. runs test with more
# cores than the builder has
export PRTE_MCA_rmaps_default_mapping_policy=:oversubscribe

for mpi in mpich openmpi; do
  test -n "${mpi}" && module load mpi/${mpi}-%{_arch}
  %ctest
  test -n "${mpi}" && module unload mpi/${mpi}-%{_arch}
done

i.e.

export PRTE_MCA_rmaps_default_mapping_policy=:oversubscribe

for mpi in mpich openmpi; do
  test -n "${mpi}" && module load mpi/${mpi}-x86_64  
  /usr/bin/ctest --test-dir "redhat-linux-build" \
           --output-on-failure \
           --force-new-ctest-process \
            -j${RPM_BUILD_NCPUS} 
  test -n "${mpi}" && module unload mpi/${mpi}-x86_64
done

I can try to answer some general questions, but I know nothing about this package and about as much about s390x ;)

``` CPU info: Architecture: s390x CPU op-mode(s): 32-bit, 64-bit Byte Order: Big Endian CPU(s): 3 On-line CPU(s) list: 0-2 Vendor ID: IBM/S390 Model name: - Machine type: 3931 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s) per book: 1 Book(s) per drawer: 1 Drawer(s): 3 CPU dynamic MHz: 5200 CPU static MHz: 5200 BogoMIPS: 3331.00 Dispatching mode: horizontal Flags: esan3 zarch stfle msa ldisp eimm dfp edat etf3eh highgprs te vx vxd vxe gs vxe2 vxp sort dflt vxp2 nnpa sie Hypervisor: KVM/Linux Hypervisor vendor: KVM Virtualization type: full L1d cache: 384 KiB (3 instances) L1i cache: 384 KiB (3 instances) L2 cache: 96 MiB (3 instances) L3 cache: 256 MiB NUMA node(s): 1 NUMA node0 CPU(s): 0-2 ```
mkstoyanov commented 2 weeks ago

All MPI tests are failing, the non-MPI tests are passing. The log does not contain details, e.g., the output of ctest -V

Also, for some reason, the tests are running in parallel which further messes up the ctest output. CMake tells ctest to run everything in series, otherwise we can get really nasty over-subscription of resources. There are multiple tests that take 12 MPI ranks, and each rank may or may not be using multiple threads.

How can we get the details from at least one failing test, e.g., the test_reshape3d running with 4 ranks.

junghans commented 2 weeks ago

—output-on-failure should be the same as -V on error, but let me add a -j1 to the ctest.

junghans commented 2 weeks ago

67 should help with the over subscribing issue.

mkstoyanov commented 2 weeks ago

@junghans Let me know if the PR helped or if you need to use -D Heffte_SEQUENTIAL_TESTING

I would like to close this issue before tagging the release.

junghans commented 2 weeks ago

I did another build, https://koji.fedoraproject.org/koji/taskinfo?taskID=125159590:

Test project /builddir/build/BUILD/heffte-2.4.0-build/heffte-master/s390x-redhat-linux-gnu-mpich
      Start  1: heffte_fortran_fftw
      Start  2: unit_tests_nompi
 1/25 Test  #1: heffte_fortran_fftw ..............   Passed    0.08 sec
      Start  3: unit_tests_stock
      Start  7: heffte_fft3d_np1
 2/25 Test  #3: unit_tests_stock .................   Passed    0.00 sec
      Start 16: heffte_fft3d_r2c_np1
 3/25 Test #16: heffte_fft3d_r2c_np1 .............   Passed    0.08 sec
      Start 22: test_cos_np1
 4/25 Test  #7: heffte_fft3d_np1 .................   Passed    0.13 sec
 5/25 Test #22: test_cos_np1 .....................   Passed    0.08 sec
      Start  8: heffte_fft3d_np2
 6/25 Test  #8: heffte_fft3d_np2 .................***Failed    0.08 sec
--------------------------------------------------------------------------------
                               heffte::fft class
--------------------------------------------------------------------------------
                            constructor heffte::fft3d<stock>              pass
Abort(676410127) on node 1 (rank 1 in comm 0): Fatal error in internal_Barrier: Other MPI error, error stack:
internal_Barrier(7454).....................: MPI_Barrier(MPI_COMM_WORLD) failed
MPID_Barrier(167)..........................: 
MPIDI_Barrier_allcomm_composition_json(132): 
MPIDI_POSIX_mpi_bcast(238).................: 
MPIDI_POSIX_mpi_release_gather_release(218): message sizes do not match across processes in the collective routine: Received 0 but expected 1
      Start 17: heffte_fft3d_r2c_np2
 7/25 Test  #2: unit_tests_nompi .................   Passed    0.38 sec
 8/25 Test #17: heffte_fft3d_r2c_np2 .............***Failed    0.14 sec
--------------------------------------------------------------------------------
                             heffte::fft_r2c class
--------------------------------------------------------------------------------
                        constructor heffte::fft3d_r2c<stock>              pass
Abort(676410127) on node 1 (rank 1 in comm 0): Fatal error in internal_Barrier: Other MPI error, error stack:
internal_Barrier(7454).....................: MPI_Barrier(MPI_COMM_WORLD) failed
MPID_Barrier(167)..........................: 
MPIDI_Barrier_allcomm_composition_json(132): 
MPIDI_POSIX_mpi_bcast(238).................: 
MPIDI_POSIX_mpi_release_gather_release(218): message sizes do not match across processes in the collective routine: Received 0 but expected 1
      Start  4: heffte_reshape3d_np4
 9/25 Test  #4: heffte_reshape3d_np4 .............***Failed    0.75 sec
--------------------------------------------------------------------------------
                             heffte reshape methods
--------------------------------------------------------------------------------
                                   heffte::mpi::gather_boxes              pass
      Start  5: heffte_reshape3d_np7
10/25 Test  #5: heffte_reshape3d_np7 .............***Failed    1.70 sec
--------------------------------------------------------------------------------
                             heffte reshape methods
--------------------------------------------------------------------------------
                                   heffte::mpi::gather_boxes              pass
     float         -np 7  heffte::reshape3d_alltoall all-2-1              pass
      Start  6: heffte_reshape3d_np12
11/25 Test  #6: heffte_reshape3d_np12 ............***Failed    3.13 sec
--------------------------------------------------------------------------------
                             heffte reshape methods
--------------------------------------------------------------------------------
                                   heffte::mpi::gather_boxes              pass
      Start  9: heffte_fft3d_np4
12/25 Test  #9: heffte_fft3d_np4 .................***Failed    8.16 sec
--------------------------------------------------------------------------------
                               heffte::fft class
--------------------------------------------------------------------------------
  ccomplex                  -np 4  test heffte::fft2d<stock>              pass
      Start 10: heffte_fft3d_np6
13/25 Test #10: heffte_fft3d_np6 .................***Failed   25.74 sec
--------------------------------------------------------------------------------
                               heffte::fft class
--------------------------------------------------------------------------------
     float                  -np 6  test heffte::fft3d<stock>              pass
      Start 11: heffte_fft3d_np8
14/25 Test #11: heffte_fft3d_np8 .................***Failed   45.90 sec
--------------------------------------------------------------------------------
                               heffte::fft class
--------------------------------------------------------------------------------
     float                  -np 8  test heffte::fft3d<stock>              pass
      Start 12: heffte_fft3d_np12
15/25 Test #12: heffte_fft3d_np12 ................***Failed   58.30 sec
--------------------------------------------------------------------------------
                               heffte::fft class
--------------------------------------------------------------------------------
     float                 -np 12  test heffte::fft3d<stock>              pass
      Start 13: heffte_streams_np6
16/25 Test #13: heffte_streams_np6 ...............***Failed   17.86 sec
--------------------------------------------------------------------------------
                              heffte::fft streams
--------------------------------------------------------------------------------
  ccomplex         -np 6  test heffte::fft3d (stream)<stock>              pass
Abort(676410127) on node 3 (rank 3 in comm 0): Fatal error in internal_Barrier: Other MPI error, error stack:
internal_Barrier(7454).....................: MPI_Barrier(MPI_COMM_WORLD) failed
MPID_Barrier(167)..........................: 
MPIDI_Barrier_allcomm_composition_json(132): 
MPIDI_POSIX_mpi_bcast(238).................: 
MPIDI_POSIX_mpi_release_gather_release(218): message sizes do not match across processes in the collective routine: Received 0 but expected 1
      Start 14: test_subcomm_np8
17/25 Test #14: test_subcomm_np8 .................***Failed   76.87 sec
--------------------------------------------------------------------------------
                          heffte::fft subcommunicators
--------------------------------------------------------------------------------
    double                -np 8  test subcommunicator<stock>              pass
      Start 15: test_subcomm_np12
18/25 Test #15: test_subcomm_np12 ................***Failed  116.44 sec
--------------------------------------------------------------------------------
                          heffte::fft subcommunicators
--------------------------------------------------------------------------------
    double               -np 12  test subcommunicator<stock>              pass
      Start 18: heffte_fft2d_r2c_np4
19/25 Test #18: heffte_fft2d_r2c_np4 .............***Failed   31.25 sec
--------------------------------------------------------------------------------
                             heffte::fft_r2c class
--------------------------------------------------------------------------------
     float              -np 4  test heffte::fft2d_r2c<stock>              pass
      Start 19: heffte_fft3d_r2c_np6
20/25 Test #19: heffte_fft3d_r2c_np6 .............***Failed   63.12 sec
--------------------------------------------------------------------------------
                             heffte::fft_r2c class
--------------------------------------------------------------------------------
     float              -np 6  test heffte::fft3d_r2c<stock>              pass
      Start 20: heffte_fft3d_r2c_np8
21/25 Test #20: heffte_fft3d_r2c_np8 .............***Failed  104.04 sec
--------------------------------------------------------------------------------
                             heffte::fft_r2c class
--------------------------------------------------------------------------------
     float              -np 8  test heffte::fft3d_r2c<stock>              pass
      Start 21: heffte_fft3d_r2c_np12
22/25 Test #21: heffte_fft3d_r2c_np12 ............***Failed  155.48 sec
--------------------------------------------------------------------------------
                             heffte::fft_r2c class
--------------------------------------------------------------------------------
     float             -np 12  test heffte::fft3d_r2c<stock>              pass
      Start 23: test_cos_np2
23/25 Test #23: test_cos_np2 .....................***Failed    1.13 sec
      Start 24: test_cos_np4
24/25 Test #24: test_cos_np4 .....................***Failed    6.16 sec
--------------------------------------------------------------------------------
                               cosine transforms
--------------------------------------------------------------------------------
     float             -np 4  test cosine<stock-cos-type-II>              pass
      Start 25: heffte_longlong_np4
25/25 Test #25: heffte_longlong_np4 ..............***Failed  105.14 sec
--------------------------------------------------------------------------------
                               heffte::fft class
--------------------------------------------------------------------------------
     float                  -np 4  test int/long long<stock>              pass
24% tests passed, 19 tests failed out of 25
Total Test time (real) = 821.67 sec
The following tests FAILED:
      4 - heffte_reshape3d_np4 (Failed)
      5 - heffte_reshape3d_np7 (Failed)
      6 - heffte_reshape3d_np12 (Failed)
      8 - heffte_fft3d_np2 (Failed)
      9 - heffte_fft3d_np4 (Failed)
     10 - heffte_fft3d_np6 (Failed)
     11 - heffte_fft3d_np8 (Failed)
     12 - heffte_fft3d_np12 (Failed)
     13 - heffte_streams_np6 (Failed)
     14 - test_subcomm_np8 (Failed)
     15 - test_subcomm_np12 (Failed)
     17 - heffte_fft3d_r2c_np2 (Failed)
     18 - heffte_fft2d_r2c_np4 (Failed)
     19 - heffte_fft3d_r2c_np6 (Failed)
     20 - heffte_fft3d_r2c_np8 (Failed)
     21 - heffte_fft3d_r2c_np12 (Failed)
     23 - test_cos_np2 (Failed)
     24 - test_cos_np4 (Failed)
     25 - heffte_longlong_np4 (Failed)
Errors while running CTest
mkstoyanov commented 2 weeks ago

Something is wrong with MPI, MPI_Barrier(MPI_COMM_WORLD) should always work. This is the first MPI method called after MPI_Init().

Hard to figure this out without hands onto the hardware.

junghans commented 2 weeks ago

The sequential one, https://koji.fedoraproject.org/koji/taskinfo?taskID=125161367, fails as well:

4/25 Test  #4: heffte_reshape3d_np4 .............***Failed    0.51 sec
--------------------------------------------------------------------------------
                             heffte reshape methods
--------------------------------------------------------------------------------
                                   heffte::mpi::gather_boxes              pass
Abort(676410127) on node 1 (rank 1 in comm 0): Fatal error in internal_Barrier: Other MPI error, error stack:
internal_Barrier(7454).....................: MPI_Barrier(MPI_COMM_WORLD) failed
MPID_Barrier(167)..........................: 
MPIDI_Barrier_allcomm_composition_json(132): 
MPIDI_POSIX_mpi_bcast(238).................: 
MPIDI_POSIX_mpi_release_gather_release(218): message sizes do not match across processes in the collective routine: Received 0 but expected 1
Abort(676410127) on node 3 (rank 3 in comm 0): Fatal error in internal_Barrier: Other MPI error, error stack:
internal_Barrier(7454).....................: MPI_Barrier(MPI_COMM_WORLD) failed
MPID_Barrier(167)..........................: 
MPIDI_Barrier_allcomm_composition_json(132): 
MPIDI_POSIX_mpi_bcast(238).................: 
MPIDI_POSIX_mpi_release_gather_release(218): message sizes do not match across processes in the collective routine: Received 0 but expected 1

But I think @mkstoyanov is right, that looks very much like a more fundamental issue in the mpich package of Fedora. @keszybz, who is Fedora's mpich package maintainer?

mkstoyanov commented 2 weeks ago

In my experience, the Red Hat family is rather paranoid. There are a bunch of flags about "hardened" and "secure" that I have not used and I don't know what those mean. I won't put it past them to have something in the environment that blocks processes from communicating with MPI. The error message says that MPI_Barrier failed to send/recv even a single byte.

I don't think I can help here.

junghans commented 2 weeks ago

Let me add a MPI hello world to build and see if that fails, too!

junghans commented 2 weeks ago

Hmm, https://koji.fedoraproject.org/koji/taskinfo?taskID=125191181, hello world worked:

+ mpicc /builddir/build/SOURCES/mpi_hello_world.c -o mpi_hello_world
+ mpiexec -np 12 ./mpi_hello_world
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 0 out of 12 processors
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 2 out of 12 processors
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 1 out of 12 processors
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 3 out of 12 processors
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 8 out of 12 processors
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 4 out of 12 processors
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 6 out of 12 processors
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 10 out of 12 processors
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 9 out of 12 processors
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 7 out of 12 processors
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 11 out of 12 processors
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 5 out of 12 processors
keszybz commented 2 weeks ago

@keszybz, who is Fedora's mpich package maintainer?

That'd be me. But I only picked up mpich because nobody else wanted it. I'm not qualified to fix real issues.

junghans commented 2 weeks ago

@opoplawski any ideas? Otherwise, I will ping the developer mailing list.

mkstoyanov commented 1 week ago

Hmm, https://koji.fedoraproject.org/koji/taskinfo?taskID=125191181, hello world worked:

I can't find the source code within the build logs. In the hello-world example, do you have any code other than MPI init and the print statement? You should add at least an MPI_Barrier, i.e.,

  int me, nranks;
  MPI_Comm_rank(MPI_COMM_WORLD, &me);
  MPI_Comm_size(MPI_COMM_WORLD, &nranks);

  for (int i=0; i<nranks; i++) {
    if (me == i)
      std::cout << "hello from rank: " << me << std::endl;
    MPI_Barrier(MPI_COMM_WORLD);
  }

That will call the method which failed in the heffte logs and the order of the ranks should be sequential, i.e., 0, 1, 2, 3, ...

junghans commented 1 week ago

Sorry, it was https://github.com/mpitutorial/mpitutorial/blob/gh-pages/tutorials/mpi-hello-world/code/mpi_hello_world.c

junghans commented 1 week ago

Ok, I made it print the source and added the suggested loop as well in https://koji.fedoraproject.org/koji/taskinfo?taskID=125198447

junghans commented 1 week ago

Hmm

Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 0 out of 12 processors
hello from rank: 0
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 2 out of 12 processors
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 4 out of 12 processors
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 1 out of 12 processors
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 5 out of 12 processors
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 8 out of 12 processors
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 10 out of 12 processors
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 6 out of 12 processors
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 11 out of 12 processors
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 3 out of 12 processors
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 9 out of 12 processors
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 7 out of 12 processors
hello from rank: 1
hello from rank: 2
error: Bad exit status from /var/tmp/rpm-tmp.ttBmu8 (%build)
    Bad exit status from /var/tmp/rpm-tmp.ttBmu8 (%build)
RPM build errors:
Child return code was: 1
mkstoyanov commented 1 week ago

The commands to read the process info, rank, comm-size, etc. Those do not require actual communication but rather on-node work. The log shows a crash on the second call to the MPI_Barrier() so the hello-world is failing, while working on the other systems.

You can play around with send/recv to see how those act and if they work properly, but there's something wrong with MPI in this environment.

junghans commented 1 week ago

Yeah, that will need some deeper investigations.

I would just go ahead with v2.4.1 and not wait for this issue.