Closed jyoti2306 closed 4 years ago
Hi @jyoti2306 This is IMB2018 checker problem, we are working on fix. As workaround you can use IMB2019 (make from root directory).
I am using the latest IMB in the master branch. I checked IMB v2019.0 and v2019.1 but there are syntax errors in the code; make shows the error.
Can you specify which IMB2019 are you referring to? And what do you mean by make from root directory?
@jyoti2306 Basically, when do you ran make inside src_c folder - you compile IMB2018, so for using IMB2019 you must use Makefile for root directory - https://github.com/intel/mpi-benchmarks/blob/master/Makefile
@VinnitskiV I tried as you suggested. I compiled it with -DCHECK option and executed the benchmark over shared memory, psm2 and dapl. It is giving error at 'Reduce_local' (part of the error shown below).
=======================start of error snippet===========================================
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] defects
0 1000 0.04 0.21 0.06 0.00
0: Error Reduce_local,size = 4,sample #0 Process 0: Got invalid buffer: Buffer entry: 0.100000 pos: 0 Process 0: Expected buffer: Buffer entry: 13.599999 1: Error Reduce_local,size = 4,sample #0 Process 1: Got invalid buffer: Buffer entry: 0.200000 pos: 0 Process 1: Expected buffer: Buffer entry: 13.599999 =======================end of error snippet==========================================
Is this a known error? Can you please tell me a version of IMB which does not fail with the -DCHECK option?
Hi, It would be really helpful if you could just let me know if this a fault in the application as I have a task to complete!
I got the same problem with -DCHECK option. I am very sure I compiled the right version which is tagged IMB-v2019.2.
0: Error Reduce_local,size = 32768,sample #0 Process 0: Got invalid buffer: 1: Error Reduce_local,size = 32768,sample #0 Buffer entry: 0.100000 Process 1: Got invalid buffer: pos: 0 Process 0: Expected buffer: Buffer entry: 0.200000 pos: 0 Buffer entry: 0.300000 Process 1: Expected buffer: Buffer entry: 0.300000
Hi @jyoti2306 and @dong0321 Sorry for so long delay, we are working on this problem. Also, this is a problem of the verification algorithm only.
@VinnitskiV has this issue with the verification algorithm been resolved? I am seeing verification issues with Reduce_scatter (when running with master, 2019 Update 6, MPI-1 part
) tests and I am trying to figure out if it has to do with the validation or a bug elsewhere in the network stack.
Hi @rajachan , could you please provide log, we are fixed problems with reduce_scatter from this thread. Thank you.
I am using ‘Intel MPI Benchmarks 2019 update 2’ with -DCHECK option enabled only with the C source files. The benchmark fails with data check error (sample error given below) when tried with shared memory, sockets, psm2 and dapl.
==================start of error======================================
-----------------------------------------------------------------------------
- Benchmarking Reduce_scatter
- #processes = 16
-----------------------------------------------------------------------------
15: Error Reduce_scatter,size = 4,sample #0 Process 15: Got invalid buffer: Buffer entry: 13.600000 pos: 0 Process 15: Expected buffer: Buffer entry: 253.600006 4 1000 1.57 4.66 2.41 0.00 Application error code 1 occurred application called MPI_Abort(MPI_COMM_WORLD, 16) - process 15 ===================end of error=====================================
Following are the steps I used to install IMB.
Following are the errors in detail.
1) When running it with ‘MPICH-3.3’ over shared memory, it fails at ‘Reduce_scatter’ for sample size 4. When running it over TCP, it fails at the same place. OS version ‘CentOS Linux release 7.6.1810 (Core)’.
Same is the case with ‘Intel MPI Library 2017 Update 3 for Linux’ over shared memory (default), ofi (I_MPI_FABRICS=ofi) and dapl (I_MPI_FABRICS=dapl). OS version ‘CentOS Linux release 7.3.1611 (Core)’.
2) In the file ‘IMB_settings.h’, I changed the ‘#define BUFFERS_FLOAT’ to ‘#define BUFFERS_INT’ to check for integer type values and compiled it again.
Keeping the environment and test cases same, it fails at ‘Allreduce’ for sample size 4.
Also, even when the benchmark fails, the ‘defects’ column entry shows 0.00 which means the benchmark was successful whereas it was not.
If I use it without the -DCHECK option enabled, the benchmark completes successfully.
Can someone comment on these observations ?