mercury-hpc / mercury

Mercury is a C library for implementing RPC, optimized for HPC.
http://www.mcs.anl.gov/projects/mercury/
BSD 3-Clause "New" or "Revised" License
162 stars 60 forks source link

mercury does not appear to understand 'mrail' protocol #326

Open roblatham00 opened 4 years ago

roblatham00 commented 4 years ago

Describe the bug

I am unable to request the 'mrail' libfabric provider from mercury (master)

To Reproduce

I have tried the margo-p2p-bw test with the following network strings:

    mpiexec -f hostfile -launcher ssh -ppn 1 -n 2 ./margo-p2p-bw -x 13072 -n 'mrail://' -c 4 -D 10
   # NA -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/na/na_ofi.c:2878
 # na_ofi_check_protocol(): Protocol mrail not supported
# NA -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/na/na.c:276
 # NA_Initialize_opt(): Specified class name does not support requested protocol
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury_core.c:1130
 # hg_core_init(): Could not initialize NA class
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury_core.c:3628
 # HG_Core_init_opt(): Cannot initialize HG core layer
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury.c:1093
 # HG_Init_opt(): Could not create HG core class
# NA -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/na/na_ofi.c:2878
 # na_ofi_check_protocol(): Protocol mrail not supported
# NA -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/na/na.c:276
 # NA_Initialize_opt(): Specified class name does not support requested protocol
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury_core.c:1130
 # hg_core_init(): Could not initialize NA class
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury_core.c:3628
 # HG_Core_init_opt(): Cannot initialize HG core layer
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury.c:1093
 # HG_Init_opt(): Could not create HG core class

Ok, let's try explicitly requesting OFI:

mpiexec -f hostfile -launcher ssh -ppn 1 -n 2 ./margo-p2p-bw -x 13072 -n 'ofi+mrail://' -c 4 -D 10
# NA -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/na/na_ofi.c:2878
 # na_ofi_check_protocol(): Protocol mrail not supported
# NA -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/na/na.c:276
 # NA_Initialize_opt(): Specified class name does not support requested protocol
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury_core.c:1130
 # hg_core_init(): Could not initialize NA class
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury_core.c:3628
 # HG_Core_init_opt(): Cannot initialize HG core layer
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury.c:1093
 # HG_Init_opt(): Could not create HG core class
# NA -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/na/na_ofi.c:2878
 # na_ofi_check_protocol(): Protocol mrail not supported
# NA -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/na/na.c:276
 # NA_Initialize_opt(): Specified class name does not support requested protocol
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury_core.c:1130
 # hg_core_init(): Could not initialize NA class
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury_core.c:3628
 # HG_Core_init_opt(): Cannot initialize HG core layer
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-master-57ozcr44yhv2a2f5522zb5tipbakp6ka/spack-src/src/mercury.c:1093
 # HG_Init_opt(): Could not create HG core class

Expected behavior

Does mercury need to know about any possible libfabric provider? I see configuration for verbs and gni, but that seems like a pretty major abstraction violation

Platform (please complete the following information):

Additional context Add any other context about the problem here.

carns commented 4 years ago

There is a big x-macro that enumerates all of the OFI providers that Mercury supports in the code here:

https://github.com/mercury-hpc/mercury/blob/master/src/na/na_ofi.c#L114

... and yes, as it stands right now Mercury will only run atop things that it can find in the array of config structs that macro generates.

Philosophically it would be nice if Mercury would run atop any provider transparently, but Mercury takes a bunch of different strategies depending on what capabilities are likely to work in each one.

Maybe we could have a fall-back that just tries it's best if it's given an ofi+ that's not in the table? Or maybe there is a more clever way to differentiate settings between providers than a hard coded table?

roblatham00 commented 4 years ago

I opened this issue for the philisophical point, but in this specific case it looks like mrail requires a lot of legwork to use

soumagne commented 4 years ago

I think it should be feasible to simply default to whatever OFI returns and just have a warning printed in that case with some information so that we have a chance to know what we are using at least :)