intel / mpi-benchmarks

145 stars 63 forks source link

EXT and RMA accumulate aggregate mode issues #35

Closed aingerson closed 3 years ago

aingerson commented 3 years ago

I'm working on getting all IMB tests running with MPI+OFI and am getting various errors (not limited to the two addressed in this issue). While trying to figure out why verbs;ofi_rxm and verbs;ofi_rxd are not working with the EXT accumulate test (and assert error and a hang), I managed to get rid of the issue by changing something in the test. A similar fix seems to fix a similar issue with the RMA accumulate test (consistent hang). I am not very familiar with the setup of the test suite so I'm hoping someone can explain to me why this is fixing the problem and what the proper fix for these issues is. I'm seeing the consistent behavior with both Intel MPI and MPICH.

Here's the change that I made to fix the EXT accumulate failure:

index 39b86b5..2bfb2c0 100644
--- a/src_c/IMB_ones_accu.c
+++ b/src_c/IMB_ones_accu.c
@@ -188,7 +188,8 @@ Output variables:
 #ifdef CHECK
             for (i = 0; i < ITERATIONS->r_cache_iter; i++)
 #else
-            for (i = 0; i < ITERATIONS->n_sample; i++)
+//            for (i = 0; i < ITERATIONS->n_sample; i++)
+            for (i = 0; i < ITERATIONS->r_cache_iter; i++)
 #endif
             {
                 MPI_ERRHAND(MPI_Accumulate((char*)c_info->s_buffer + i%ITERATIONS->s_cache_iter*ITERATIONS->s_offs,

and here's the one to fix the RMA accumulate failure:

index c3052a9..0c93fb5 100644
--- a/src_c/IMB_rma_atomic.c
+++ b/src_c/IMB_rma_atomic.c
@@ -103,7 +103,8 @@ void IMB_rma_accumulate(struct comm_info* c_info, int size,
         MPI_Win_lock(MPI_LOCK_SHARED, root, 0, c_info->WIN);
         if (run_mode->AGGREGATE) {
             res_time = MPI_Wtime();
-            for (i = 0; i < iterations->n_sample; i++) {
+     //       for (i = 0; i < iterations->n_sample; i++) {
+            for (i = 0; i < iterations->r_cache_iter; i++) {
                 MPI_ERRHAND(MPI_Accumulate((char*)c_info->s_buffer + i%iterations->s_cache_iter*iterations->s_offs,
                                            s_num, c_info->red_data_type, root,
                                            i%iterations->r_cache_iter*r_off, r_num,
aingerson commented 3 years ago

@ooststep

VinnitskiV commented 3 years ago

hi @aingerson , thank you for your interest! Could you please provide little bit more information:

  1. Which version of IMB do you use? Please check it in IMB output.
  2. Please provide your command line
aingerson commented 3 years ago

@VinnitskiV Thanks for you quick response! I just pulled the latest from this repo, latest commit is this one:

Author: vladimir.vinnitski <vvinnits@nnlmpihsw02.inn.intel.com>
Date:   Tue Mar 31 17:19:17 2020 +0300
    Intel(R) MPI Benchmarks 2019 Update 6 release

And this is the version shown on the output: Intel(R) MPI Benchmarks 2018, MPI-2 part

aingerson commented 3 years ago

@tatyana-en

VinnitskiV commented 3 years ago

@aingerson Thank you! First of all you need to recompile IMB benchmark from root directory (to get IMB 2019u6) and try again w\ and w\o your fix. And it would be great if you share your OFI/MPI/IMB options. Also, could you please check this problem (w\o your fix) with additional IMB potions -iter 1 and one more test w\ -iter 10.