intel / mpi-benchmarks

146 stars 63 forks source link

Error observed while running IMB with -DCHECK option #22

Closed jyoti2306 closed 4 years ago

jyoti2306 commented 5 years ago

I am using ‘Intel MPI Benchmarks 2019 update 2’ with -DCHECK option enabled only with the C source files. The benchmark fails with data check error (sample error given below) when tried with shared memory, sockets, psm2 and dapl.

==================start of error======================================

-----------------------------------------------------------------------------

- Benchmarking Reduce_scatter

- #processes = 16

-----------------------------------------------------------------------------

   #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]      defects
        0         1000         0.15         0.27         0.19         0.00

15: Error Reduce_scatter,size = 4,sample #0 Process 15: Got invalid buffer: Buffer entry: 13.600000 pos: 0 Process 15: Expected buffer: Buffer entry: 253.600006 4 1000 1.57 4.66 2.41 0.00 Application error code 1 occurred application called MPI_Abort(MPI_COMM_WORLD, 16) - process 15 ===================end of error=====================================

Following are the steps I used to install IMB.

  1. downloaded mpi-benchmarks-master.zip from GitHub and extracted it using unzip command.
  2. cd imb/src_c
  3. export CFLAGS = -DCHECK
  4. make

Following are the errors in detail.

1) When running it with ‘MPICH-3.3’ over shared memory, it fails at ‘Reduce_scatter’ for sample size 4. When running it over TCP, it fails at the same place. OS version ‘CentOS Linux release 7.6.1810 (Core)’.

Same is the case with ‘Intel MPI Library 2017 Update 3 for Linux’ over shared memory (default), ofi (I_MPI_FABRICS=ofi) and dapl (I_MPI_FABRICS=dapl). OS version ‘CentOS Linux release 7.3.1611 (Core)’.

2) In the file ‘IMB_settings.h’, I changed the ‘#define BUFFERS_FLOAT’ to ‘#define BUFFERS_INT’ to check for integer type values and compiled it again.

Keeping the environment and test cases same, it fails at ‘Allreduce’ for sample size 4.

Also, even when the benchmark fails, the ‘defects’ column entry shows 0.00 which means the benchmark was successful whereas it was not.

If I use it without the -DCHECK option enabled, the benchmark completes successfully.

Can someone comment on these observations ?

VinnitskiV commented 5 years ago

Hi @jyoti2306 This is IMB2018 checker problem, we are working on fix. As workaround you can use IMB2019 (make from root directory).

jyoti2306 commented 5 years ago

I am using the latest IMB in the master branch. I checked IMB v2019.0 and v2019.1 but there are syntax errors in the code; make shows the error.

Can you specify which IMB2019 are you referring to? And what do you mean by make from root directory?

VinnitskiV commented 5 years ago

@jyoti2306 Basically, when do you ran make inside src_c folder - you compile IMB2018, so for using IMB2019 you must use Makefile for root directory - https://github.com/intel/mpi-benchmarks/blob/master/Makefile

jyoti2306 commented 5 years ago

@VinnitskiV I tried as you suggested. I compiled it with -DCHECK option and executed the benchmark over shared memory, psm2 and dapl. It is giving error at 'Reduce_local' (part of the error shown below).

=======================start of error snippet===========================================

-----------------------------------------------------------------------------

- Benchmarking Reduce_local

- #processes = 16

-----------------------------------------------------------------------------

   #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]      defects
        0         1000         0.04         0.21         0.06         0.00

0: Error Reduce_local,size = 4,sample #0 Process 0: Got invalid buffer: Buffer entry: 0.100000 pos: 0 Process 0: Expected buffer: Buffer entry: 13.599999 1: Error Reduce_local,size = 4,sample #0 Process 1: Got invalid buffer: Buffer entry: 0.200000 pos: 0 Process 1: Expected buffer: Buffer entry: 13.599999 =======================end of error snippet==========================================

Is this a known error? Can you please tell me a version of IMB which does not fail with the -DCHECK option?

jyoti2306 commented 5 years ago

Hi, It would be really helpful if you could just let me know if this a fault in the application as I have a task to complete!

dong0321 commented 5 years ago

I got the same problem with -DCHECK option. I am very sure I compiled the right version which is tagged IMB-v2019.2.

0: Error Reduce_local,size = 32768,sample #0 Process 0: Got invalid buffer: 1: Error Reduce_local,size = 32768,sample #0 Buffer entry: 0.100000 Process 1: Got invalid buffer: pos: 0 Process 0: Expected buffer: Buffer entry: 0.200000 pos: 0 Buffer entry: 0.300000 Process 1: Expected buffer: Buffer entry: 0.300000

VinnitskiV commented 5 years ago

Hi @jyoti2306 and @dong0321 Sorry for so long delay, we are working on this problem. Also, this is a problem of the verification algorithm only.

rajachan commented 4 years ago

@VinnitskiV has this issue with the verification algorithm been resolved? I am seeing verification issues with Reduce_scatter (when running with master, 2019 Update 6, MPI-1 part) tests and I am trying to figure out if it has to do with the validation or a bug elsewhere in the network stack.

VinnitskiV commented 4 years ago

Hi @rajachan , could you please provide log, we are fixed problems with reduce_scatter from this thread. Thank you.