intel / mpi-benchmarks

145 stars 63 forks source link

Questions on mem_manager.c code #24

Closed Akshay-Venkatesh closed 3 years ago

Akshay-Venkatesh commented 4 years ago

Hi,

I'm seeing some failures with -DCHECK enabled and I was looking at buffer value assignment logic. I have a couple of questions:

  1. When is this (L280) path taken if there's already this (L273) check ahead?

    273    if (pos2 >= pos1) {                                              
    274        size_t a_pos1, a_pos2, i, j;
    275        a_pos1 = pos1 / asize;
    276
    277        if (pos2 >= pos1)                                           
    278            a_pos2 = pos2 / asize;
    279        else
    280            a_pos2 = a_pos1 - 1;                                    
    281
    282        if (value)
    283            for (i = a_pos1, j = 0; i <= a_pos2; i++, j++)
    284                ((assign_type *)buf)[j] = BUF_VALUE(rank, i);
    285        else
    286            for (i = a_pos1, j = 0; i <= a_pos2; i++, j++)
    287                ((assign_type *)buf)[j] = 0.;
    288
    289        if (a_pos1*asize != pos1) {                                 
    290            void* xx = (void*)(((char*)buf) + pos1 - a_pos1*asize);
    291            memmove(buf, xx, pos2 - pos1 + 1);
    292        }
    293    } /*if( pos2>= pos1 )*/

    If I'm not mistaken if line 273 is a true statement then line 277 is always true and so 280 is never taken. Correct me if I'm wrong.

  2. Definition of IMB_ass_buf seems to indicate that assignment occurs over byte positions pos1 till pos2 as indicated here:

    void IMB_ass_buf(void* buf, int rank, size_t pos1, 
                 size_t pos2, int value) {
    /*
                      Assigns values to a buffer
    Input variables:
    -rank                 (type int)
                      Rank of calling process
    -pos1                 (type int)
    -pos2                 (type int)
                      Assignment between byte positions pos1, pos2

    But lines 284 and 287 seem to touch the buffer starting from offset 0 (buf[j] for j = 0 ... apos2). Can you explain if this is correct?

  3. Can you explain the logic behind this (L289) path? For instance, if pos1 and pos2 are 15 and 32 respectively, then if size(assign_type) == 4 then a_pos1 = 3 and a_pos2 = 8. If this is the case then there there's a memmove of 18 bytes (pos2 - pos1 + 1) when there should 17 bytes changed.

VinnitskiV commented 4 years ago

Hi @Akshay-Venkatesh This is really good questions.

  1. You are right, it's extra check, we will remove it.
  2. In this function we using pos1 and pos2 just to calculate count of elements and:
  3. This path needs for remove first extra bytes.
yosefe commented 4 years ago

@VinnitskiV @Akshay-Venkatesh it seems to me that the allocation size is not correct: it allocates r_len/s_len bytes, but the communication routines send s_len/r_len elements. This results in access to invalid memory locations.

diff --git a/src_c/IMB_mem_manager.c b/src_c/IMB_mem_manager.c
index af35538..f4d1df7 100644
--- a/src_c/IMB_mem_manager.c
+++ b/src_c/IMB_mem_manager.c
@@ -164,7 +164,7 @@ In/out variables:
     if (c_info->s_alloc < s_len) {
         size_t size;
         IMB_v_free((void**)&c_info->s_buffer);
-        size = s_len * ((size_t)c_info->size_scale);
+        size = s_len * ((size_t)c_info->size_scale) * sizeof(assign_type);
         c_info->s_buffer = IMB_v_alloc(size, where);
         c_info->s_alloc = size / ((size_t)c_info->size_scale);
         c_info->s_data = (assign_type*)c_info->s_buffer;
@@ -173,7 +173,7 @@ In/out variables:
     if (c_info->r_alloc < r_len) {
         size_t size;
         IMB_v_free((void**)&c_info->r_buffer);
-        size = r_len * ((size_t)c_info->size_scale);
+        size = r_len * ((size_t)c_info->size_scale) * sizeof(assign_type);
         c_info->r_buffer = IMB_v_alloc(size, where);
         c_info->r_alloc = size / ((size_t)c_info->size_scale);
         c_info->r_data = (assign_type*)c_info->r_buffer;
dmitrygx commented 4 years ago

Also, in addition to @yosefe's fix, I'd suggest to not calculate c_info->s_alloc and c_info->r_alloc, just assign s_len and r_len directly:

diff --git a/src_c/IMB_mem_manager.c b/src_c/IMB_mem_manager.c
index af35538..937222d 100644
--- a/src_c/IMB_mem_manager.c
+++ b/src_c/IMB_mem_manager.c
@@ -164,18 +164,18 @@ In/out variables:
     if (c_info->s_alloc < s_len) {
         size_t size;
         IMB_v_free((void**)&c_info->s_buffer);
-        size = s_len * ((size_t)c_info->size_scale);
+        size = s_len * ((size_t)c_info->size_scale) * asize;
         c_info->s_buffer = IMB_v_alloc(size, where);
-        c_info->s_alloc = size / ((size_t)c_info->size_scale);
+        c_info->s_alloc = s_len;
         c_info->s_data = (assign_type*)c_info->s_buffer;
     }

     if (c_info->r_alloc < r_len) {
         size_t size;
         IMB_v_free((void**)&c_info->r_buffer);
-        size = r_len * ((size_t)c_info->size_scale);
+        size = r_len * ((size_t)c_info->size_scale) * asize;
         c_info->r_buffer = IMB_v_alloc(size, where);
-        c_info->r_alloc = size / ((size_t)c_info->size_scale);
+        c_info->r_alloc = r_len;
         c_info->r_data = (assign_type*)c_info->r_buffer;
     }
 }
nikitaxgusev commented 4 years ago

Hello @yosefe, let't specify. Please, do make in ./src_cpp direcoty: example: make TARGET=MPI1

Could you provide the reproducer, please? Because your version of fix might be applies not for current place. In logic, s_len should already consider the sizeof(assign_type). We need to understand, what brought you to find this issue.

Thanks.

yosefe commented 4 years ago

@nikitaxgusev i've noticed the test can sometime fail with memory registration - SB Perhaps the fix is not in right place, but i've traced the code by debug prints and it seems like s_len does NOT consider sizeof(assign_type).

/usr/mpi/gcc/openmpi-4.0.2rc3/bin/mpirun --np 8 --use-hwthread-cpus --map-by node --allow-run-as-root --hostfile <hostfile-3-hosts> --rank-by node --mca patcher ^overwrite -x SEED=1 --mca btl ^vader,tcp,openib,uct --mca pml ucx -x UCX_NET_DEVICES=mlx5_1:1 -x UCX_TLS=rc_x,self,sm <path>/mpi-benchmarks//IMB-NBC Ireduce_scatter -iter 1000 -npmin 10000 -time 1200

#-----------------------------------------------------------------------------
# Benchmarking Ireduce_scatter 
# #processes = 8 
#-----------------------------------------------------------------------------
       #bytes #repetitions t_ovrl[usec] t_pure[usec]  t_CPU[usec]   overlap[%]
            0         1000         1.24         0.14         1.00         0.00
            4         1000        18.10         9.30         9.14         3.58
            8         1000        19.58        10.41        10.01         8.08
           16         1000        20.05        10.87         9.85         6.13
           32         1000        24.13        13.05        12.55        11.24
           64         1000        26.39        14.54        14.16        15.86
          128         1000        35.18        19.05        18.30        11.41
          256         1000        49.47        28.59        27.55        23.33
          512         1000        66.18        38.43        37.10        24.34
         1024         1000        80.36        45.18        46.18        23.81
         2048         1000       129.47        73.15        70.96        20.01
         4096         1000       246.00       139.34       134.84        20.22
         8192         1000       324.82       222.77       214.25        50.37
        16384         1000       597.89       419.87       401.51        53.23
        32768         1000      1153.29       811.17       774.29        53.28
        65536          640      2194.00      1554.23      1494.43        54.99
       131072          320      3732.21      2712.41      2604.81        58.44
       262144          160      4856.38      2683.59      2611.91        16.36
       524288           80      9922.41      5487.63      5306.16        15.88
      1048576           40     19130.30     10764.41     10351.29        18.44
      2097152           20     52138.26     28795.94     10036.47         0.00
[1577264313.774982] [clx-zeus-26:36370:0]          ib_md.c:285  UCX  ERROR ibv_exp_reg_mr(address=0x227a9b0, length=33554432, access=0xf) failed: Cannot allocate memory
[1577264313.775042] [clx-zeus-26:36370:0]         ucp_mm.c:116  UCX  ERROR failed to register address 0x227a9b0 length 33554432 on md[4]=mlx5_1: Input/output error
[1577264313.775070] [clx-zeus-26:36370:0]    ucp_request.c:268  UCX  ERROR failed to register user buffer datatype 0x20 address 0x227a9b0 len 33554432: Input/output error
[1577264313.774719] [clx-zeus-28:34514:0]          ib_md.c:285  UCX  ERROR ibv_exp_reg_mr(address=0x25e36e0, length=33554432, access=0xf) failed: Cannot allocate memory
[1577264313.774751] [clx-zeus-28:34514:0]         ucp_mm.c:116  UCX  ERROR failed to register address 0x25e36e0 length 33554432 on md[4]=mlx5_1: Input/output error
[1577264313.774767] [clx-zeus-28:34514:0]    ucp_request.c:268  UCX  ERROR failed to register user buffer datatype 0x20 address 0x25e36e0 len 33554432: Input/output error
[1577264313.774731] [clx-zeus-27:36348:0]          ib_md.c:285  UCX  ERROR ibv_exp_reg_mr(address=0x21807e0, length=33554432, access=0xf) failed: Cannot allocate memory
[1577264313.774764] [clx-zeus-27:36348:0]         ucp_mm.c:116  UCX  ERROR failed to register address 0x21807e0 length 33554432 on md[4]=mlx5_1: Input/output error
[1577264313.774768] [clx-zeus-27:36348:0]    ucp_request.c:268  UCX  ERROR failed to register user buffer datatype 0x20 address 0x21807e0 len 33554432: Input/output error
yosefe commented 4 years ago

Mellanox internal issue ref: https://redmine.mellanox.com/issues/2036975

nikitaxgusev commented 4 years ago

@yosefe Thanks for example, we will try to reproduce the issue locally. Your link is internal, I can't reach it.

yosefe commented 4 years ago

@nikitaxgusev yes, the link is internal, i've added it just for reference. the failure details are copied to https://github.com/intel/mpi-benchmarks/issues/24#issuecomment-582318799. Thank you for checking this!

VinnitskiV commented 4 years ago

@yosefe Could you please check this problem with 2019u5 - https://github.com/intel/mpi-benchmarks/releases