intel / mpi-benchmarks

146 stars 63 forks source link

testing over hfi1 fails with "mca_sharedfp_lockedfile_file_open: Error during file open" #13

Closed jarodwilson closed 5 years ago

jarodwilson commented 5 years ago

I'm attempting to run some basic tests over a pair of hfi1-equipped hosts using openmpi, and quite a few of them are failing with similar output:

[root@rdma-dev-15 ~]$ mpirun -hostfile /root/hfile_one_core -np 2 --allow-run-as-root --map-by node -mca btl_openib_if_include hfi1_0 -mca pml cm -mca mtl psm2 -x PSM2_PKEY=0x8020 mpitests-IMB-IO C_Read_Shared -time 1.5
#------------------------------------------------------------
#    Intel(R) MPI Benchmarks 2019 Update 1, MPI-IO part
#------------------------------------------------------------
# Date                  : Thu Dec  6 15:52:19 2018
# Machine               : x86_64
# System                : Linux
# Release               : 4.18.0-47.el8.x86_64
# Version               : #1 SMP Thu Nov 29 19:43:32 UTC 2018
# MPI Version           : 3.1
# MPI Thread Environment:

# Calling sequence was:

# mpitests-IMB-IO C_Read_Shared -time 1.5

# Minimum io portion in bytes:   0
# Maximum io portion in bytes:   4194304
#
#
#

# List of Benchmarks to run:

# C_Read_Shared

#-----------------------------------------------------------------------------
# Benchmarking C_Read_Shared
# #processes = 1
# ( 1 additional process waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
#
#    MODE: AGGREGATE
#
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
            0         1000         0.13         0.13         0.13         0.00
            1         1000         0.94         0.94         0.94         1.06
            2         1000         0.94         0.94         0.94         2.14
            4         1000         0.94         0.94         0.94         4.25
            8         1000         0.95         0.95         0.95         8.46
           16         1000         0.94         0.94         0.94        17.00
           32         1000         0.94         0.94         0.94        33.98
           64         1000         0.95         0.95         0.95        67.44
          128         1000         0.96         0.96         0.96       133.60
          256         1000         0.96         0.96         0.96       267.88
          512         1000         0.97         0.97         0.97       525.62
         1024         1000         0.99         0.99         0.99      1033.96
         2048         1000         1.09         1.09         1.09      1886.29
         4096         1000         1.22         1.22         1.22      3351.03
         8192         1000         1.66         1.66         1.66      4944.17
        16384         1000         2.70         2.70         2.70      6061.39
        32768         1000         4.55         4.55         4.55      7195.70
        65536          640         8.36         8.36         8.36      7840.92
       131072          320        16.37        16.37        16.37      8008.80
       262144          160        33.23        33.23        33.23      7889.39
       524288           80        65.20        65.20        65.20      8041.71
      1048576           40       128.37       128.37       128.37      8168.12
      2097152           20       256.37       256.37       256.37      8180.12
      4194304           10       532.93       532.93       532.93      7870.25
[rdma-dev-16:20799] [1]mca_sharedfp_lockedfile_file_open: Error during file open

From a quick little bit of debugging, I know this is from the second instance of this error message in mca_sharedfp_lockedfile_file_open in the openmpi code, not the initial one, but I haven't gotten any further than that, not sure if the bug is in openmpi or the tests, and not sure where to look next.

jarodwilson commented 5 years ago

Never mind, the openmpi folks straightened me out, this was a configuration issue with the underlying storage, nothing to do with the tests.