intel / mpi-benchmarks

145 stars 63 forks source link

Non-aggregate Accumulate data validation error with Aggregate warm-up #29

Closed zhngaj closed 1 month ago

zhngaj commented 3 years ago

Hello,

I ran into data validation issue with IMB-EXT non-aggregate mode Accumulate with 2 processes.

IMB: IMB-v2019.6 Open MPI: v4.1.x c71e1fa1db v4.1.x: schizo/jsm: Disable binding when direct launched

[ec2-user@ip-172-31-9-184 ompi]$ mpirun --prefix /fsx/ompi/install -n 2 --mca btl ofi --mca osc rdma --mca btl_ofi_provider_include efa --hostfile /fsx/hosts -x PATH -x LD_LIBRARY_PATH /fsx/mpi-benchmarks/IMB-EXT Accumulate -npmin 2 -iter 1 -aggregate_mode non_aggregate -warm_up 1
Warning: Permanently added 'ip-172-31-13-230,172.31.13.230' (ECDSA) to the list of known hosts.
#------------------------------------------------------------
#    Intel(R) MPI Benchmarks 2019 Update 6, MPI-2 part
#------------------------------------------------------------
# Date                  : Thu Jul 23 23:21:51 2020
# Machine               : x86_64
# System                : Linux
# Release               : 4.14.165-103.209.amzn1.x86_64
# Version               : #1 SMP Sun Feb 9 00:23:26 UTC 2020
# MPI Version           : 3.1
# MPI Thread Environment:

# Calling sequence was:

# /fsx/mpi-benchmarks/IMB-EXT Accumulate -npmin 2 -iter 1 -aggregate_mode non_aggregate -warm_up 1

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# Accumulate

#-----------------------------------------------------------------------------
# Benchmarking Accumulate
# #processes = 2
#-----------------------------------------------------------------------------
#
#    MODE: NON-AGGREGATE
#
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]      defects
            0            1         9.04         9.23         9.13         0.00
            4            1        17.77        18.18        17.97         0.00
            8            1        18.73        19.88        19.30         0.00
           16            1        20.24        21.11        20.68         0.00
           32            1        20.54        20.58        20.56         0.00
0: Error Accumulate,size = 64,sample #0
Process 0: Got invalid buffer:
Buffer entry: 0.600000
pos: 0
Process 0: Expected    buffer:
Buffer entry: 0.300000
           64            1        42.08        43.74        42.91         1.00

I found that the IMB-EXT non-aggregate Accumulate validation issue is because of its warm up procedure (see line), which uses aggregate mode (see line).

My theory is that rank 1 first finishes the warm-up and fetches the element values (accumulated during warm-up) which has not been reset by rank 0. Therefore, we got value 0.6, while the expected one is 0.3.

After using non-aggregate mode for both warm-up and later run, the benchmark runs fine to me. Can you please take a look, and let me know if it makes sense?

zhngaj commented 3 years ago

Any comments?

VinnitskiV commented 3 years ago

Hi @zhngaj
Yes, you are right. You can find fix - https://github.com/intel/mpi-benchmarks/pull/30

zhngaj commented 3 years ago

Hi @zhngaj Yes, you are right. You can find fix - #30

Thanks, can this fix be merged into master?

JuliaRS commented 1 month ago

Please see in the master: https://github.com/intel/mpi-benchmarks/blob/master/src_c/IMB_ones_accu.c#L166 It's already fixed.