ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
549 stars 375 forks source link

prov/shm: shm not used for single-node MPI runs #5465

Closed MatthAlex closed 2 years ago

MatthAlex commented 4 years ago

Forcing shm to be opened on an MPI single-node run returns the following error:

[4] libfabric:11572:core:core:fi_fabric_():1152<info> Opened fabric: shm
[4] libfabric:11572:shm:cntr:smr_cntr_open():42<info> cntr wait not yet supported
[4] Abort(1091471) on node 4 (rank 4 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
[4] MPIR_Init_thread(703)........: 
[4] MPID_Init(923)...............: 
[4] MPIDI_OFI_mpi_init_hook(1057): OFI event counter create failed (ofi_init.c:1057:MPIDI_OFI_mpi_init_hook:Function not implemented)

shm man pages do mention

No support for counters.

However, Intel MPI 2018.3 will correctly work with shm, if forced to. Defaults to using verbs instead if enabled.

Tested with v1.9.0 (and rc1,2,3), intel MPI 2018.3 and 2019.5.

Configuration used:

./configure --prefix=$install_dir --enable-verbs=yes --enable-shm=yes --enable-rxm=yes \
--disable-psm --disable-psm2 --disable-sockets --disable-usnic --disable-gni \
--disable-xpmem --disable-udp --disable-tcp --disable-mrail --disable-bgq \
--disable-rstream --disable-perf --disable-hook_debug --disable-efa --disable-rxd

Variables used:

export I_MPI_FABRICS="shm:ofi"
export I_MPI_OFI_LIBRARY_INTERNAL=0
export FI_LOG_LEVEL=debug
export FI_VERBS_IFACE=ib0
export FI_PROVIDER="^verbs"
export I_MPI_DEBUG=6
shefty commented 4 years ago

Thanks for the report. This sounds like it might be an MPI issue. I'll forward this report to the Intel MPI team for discussion.

ddurnov commented 4 years ago

@MatthAlex You may try to disable RMA path with the counters on IMPI 2019 level: MPIR_CVAR_CH4_OFI_ENABLE_RMA=0

BTW, IMPI 2019 U6 is publicly available.

MatthAlex commented 4 years ago

Tried exporting MPIR_CVAR_CH4_OFI_ENABLE_RMA=0 on runtime, but unfortunately:

[0] [0] MPI startup(): libfabric provider: shm
[0] libfabric:24107:core:core:fi_fabric_():1152<info> Opened fabric: shm
[0] libfabric:24107:core:core:fi_param_get_():280<info> variable universe_size=<not set>
[0] libfabric:24107:shm:av:util_av_init():455<info> AV size 1024
[0] libfabric:24107:shm:av:smr_map_to_region():174<warn> shm_open error
[0] libfabric:24107:shm:av:smr_map_to_region():174<warn> shm_open error
[0] Abort(1091471) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
[0] MPIR_Init_thread(703)........: 
[0] MPID_Init(923)...............: 
[0] MPIDI_OFI_mpi_init_hook(1335): OFI get address vector map failed
[1] libfabric:26209:shm:av:smr_map_to_region():174<warn> shm_open error
[1] libfabric:26209:shm:av:smr_map_to_region():174<warn> shm_open error

shm-2ranks1node.txt Additionally, running two ranks over two nodes with round robin pinning failed in the same way: shm-2ranks2nodes-rr.txt

Do you think update 6 will make any difference? Is it a hard suggestion?

ddurnov commented 4 years ago

@MatthAlex Thanks! We are analyzing what kind of issue you have faced with. (the "shm_open error") Looks like IMPI 2019 U6 won't help here. We will let you know as soon as we get somethig here. Please note that Intel MPI relies on own shm transport as primary one.

aingerson commented 4 years ago

@MatthAlex I'm still chasing down one bug to get this working properly but could you try running this with master and see if your issues are mostly fixed?

MatthAlex commented 4 years ago

@aingerson I can confirm that master branch resolved the shm error, checked with 2 and 16 ranks.

On a tangent, erroneously, I did a test with two nodes. Without the logs it might be difficult to quickly discern that shm isn't supposed to run on >1 nodes. I contrast this to the Intel MPI 2018 behaviour, where the failure is explicit, without the need of I_MPI_DEBUG I believe.

[19] Abort(1615759) on node 19 (rank 19 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
[19] MPIR_Init_thread(703)........: 
[19] MPID_Init(923)...............: 
[19] MPIDI_OFI_mpi_init_hook(1287): 
[19] MPIDU_bc_table_create(344)...: Missing hostname or invalid host/port description in business card
MatthAlex commented 4 years ago

Additional testing has produced another issue. I'm logging it here for the time being since I tested verbs and worked, but didn't on shm.

During testing of Fortran coarrays, when the second rank happened to be pinned on the same package, the program was killed. If the two ranks is separate there is no issue. That means it doesn't work for >2 ranks, since a 3rd will always be placed on some preoccupied package. Attached are the logs from both runs and the Fortran program. shm-coarray-test.txt coarray.txt (change .txt to .f90) Edit: If you think this warrants a new Issue I could move it.

aingerson commented 4 years ago

@MatthAlex Thanks for the update. Could you try testing again with the changes in #5503 ? I finished chasing down a couple bugs and now the provider is working with the Intel MPI benchmarks up to 32 ranks.

MatthAlex commented 4 years ago

@aingerson Testing has been successful. shm ran on 2 and 16 ranks (our higher-core nodes are still lacking IPoIB so I can't test). When FI_PROVIDERS="shm,verbs", verbs is chosen instead. Performance seems margin-of-error close. Is that the intended or expected outcome? Should shm be forced on single nodes?

As for the coarray issue, there was no progress with #5503 . I feel that it might be a separate issue.

aingerson commented 4 years ago

Thanks for testing again.

I don't think FI_PROVIDERS will do anything. It should be FI_PROVIDER Right now there is no way to layer the shm provider with another provider (such as verbs). That is a work in progress. What it sounds like is happening when you set FI_PROVIDERS="shm,verbs" is that it does not register that and instead just picks verbs for all communication which is why you're seeing similar performance.

As for the coarray issue, yeah it seems separate. I can look into it. To help in isolating the issue, could you run it again with I_MPI_DEBUG=1 MPIR_CVAR_CH4_OFI_ENABLE_RMA=0 MPIR_CVAR_CH4_OFI_ENABLE_ATOMICS=0 and attach the debug log?

MatthAlex commented 4 years ago

Of course you are right. I meant FI_PROVIDER="shm,verbs", which is what I've used for testing. So, no layering between the two providers means that verbs will be picked first if at all available, then shm?

On the coarray tests, I'm attaching a log with the variables you mentioned and one with I_MPI_DEBUG=6 and FI_LOG_LEVEL=debug added, for the same exact run. It runs to completion successfully. shm-test.txt shm-test-lite.txt

aingerson commented 4 years ago

So, no layering between the two providers means that verbs will be picked first if at all available, then shm?

Right, if you include verbs in the FI_PROVIDER list, it will pick up verbs since it is the highest ranking provider in that list. The order it will pick from should be the same order as the output of fi_info Really the only way to run shm is to explicitly request it using FI_PROVIDER=shm or set hints->fabric_attr->prov_name="shm" in hints. Once it is able to be layered with core providers, this will change.

Thanks for the updated test runs This looks like an issue with atomics. I'll look into it and keep you updated!

MatthAlex commented 4 years ago

Hello, I'm resurrecting this issue with new details. Testing Intel MPI 2020 update 1 with gcc 6.3, 7.1, 8.2, and libfabric v1.9.1, had the same exact error resurface.

libfabric:4443:core:core:fi_fabric_():1163<info> Opened fabric: shm
libfabric:4443:shm:cntr:smr_cntr_open():42<info> cntr wait not yet supported
Abort(1091215) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136)........: 
MPID_Init(904)...............: 
MPIDI_OFI_mpi_init_hook(1210): OFI event counter create failed (ofi_init.c:1210:MPIDI_OFI_mpi_init_hook:Function not implemented)

Not only that, but passing the MPIR_CVAR_CH4_OFI_ENABLE_RMA=0 and MPIR_CVAR_CH4_OFI_ENABLE_ATOMICS=0 fixes the issue.

aingerson commented 4 years ago

@MatthAlex Thanks for resurrecting the issue!

The 1.9 branch does not have the new wait object needed for shm which Intel MPI needs to enable RMA and atomics. 1.10 should have this and works for me. Could you give that a try?

MatthAlex commented 4 years ago

@aingerson I can confirm that it does work!

I can also confirm that disabling ATOMICS and RMA is still needed for Coarray Fortran to work as expected. That hasn't change with Intel Par. XE 2020 update1.

aingerson commented 4 years ago

Great news that at least it can run! Can you provide a reproducer for the issue? I can't seem to recreate it to debug the atomics/rma issue?

MatthAlex commented 4 years ago

Reproducing the issue proved to be a bit more involved than I thought.

CAF_verbs_failure.txt CAF_shm_pass.txt

The issue could only be reproduced for verbs rather than shm. shm does produce some weird "info" for atomics, however the exit code is 0 nonetheless.

aingerson commented 4 years ago

The shm warning I think is ok. I believe MPI queries/tests out the different atomic op/datatype combinations to see what it can support and OFI produces a warning whenever it doesn't support a certain combination. If the exit code is 0 then I think it's ok. Not sure about the verbs error. Could you point me to the error line in the CAF_verbs_failure.txt log? I don't see the problem there. If you are seeing a problem with verbs, I would suggest trying it with the MR cache turned off since we have found some issues withe it. To do this, set FI_MR_CACHE_MAX_COUNT=0.

MatthAlex commented 4 years ago

Oh, my bad.. I bundled the successful with the failed logs on CAF_verbs_failure. The failure took place on 389-393. Edit: As an update, MR cache off still fails on verbs.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 360 days with no activity. Remove stale label or comment, otherwise it will be closed in 7 days.