UCX iWARP support does not provide the same performance as the openib counterpart?

pllopis commented 4 years ago

Background information

I am trying to build OpenMPI in a way that works with good performance on two different clusters. One cluster is iWARP, the other is Infiniband.

With OpenMPI 3, since both Infiniband and iWARP were supported by ob1, the same OpenMPI build worked for both clusters. With OpenMPI4, things change and Infiniband is no longer supported by pml/ob1, while iWARP is still supported by pml/ob1.

Since both iWARP and Infiniband are supported on UCX, I was trying to make an OpenMPI build that defaults to using UCX for both networks. However, the performance when using UCX on iWARP is equivalent to that of TCP.

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

4.0.3 installed from a tarball.

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

./configure --prefix=%{install_prefix} \
            --libdir=%{install_prefix}/lib64 \
            --enable-mpi-cxx --enable-cxx-exceptions --enable-mpi-thread-multiple --enable-orterun-prefix-by-default \
            --with-slurm --with-pmi=/usr \
            CFLAGS='-m64 -O2 -pipe -Wall -Wshadow' \
            CXXFLAGS='-m64 -O2 -pipe -Wall -Weffc++ -Wshadow' \
            FCFLAGS='-m64 -O2 -pipe -Wall' \
            FFLAGS='-m64 -O2 -pipe -Wall'

Please describe the system on which you are running

Operating system/version: CentOS 7.8
Computer hardware: x86 and Chelsio T520-LL-CR
Network type: Infiniband, iWARP (Chelsio)

Details of the problem

Latency between two nodes over iWARP using OpenMPI pml/ucx:

mpirun -mca pml ucx --map-by node --bind-to core -n 2 -host hpc002,hpc003  $OSUTESTS/pt2pt/osu_latency -m1                                                  
# OSU MPI Latency Test v5.4.1
# Size          Latency (us)
0                      16.78
1                      16.75

Latency between the same two nodes using pml/ob1:

mpirun -mca pml ob1 --map-by node --bind-to core -n 2 -host hpc002,hpc003 $OSUTESTS/pt2pt/osu_latency -m1
# OSU MPI Latency Test v5.4.1
# Size          Latency (us)
0                       3.75
1                       3.66

Is there a way to make UCX work well with iWARP?

ucx_info reports the following:

# Memory domain: ib/cxgb4_0
#            component: ib
#             register: unlimited, cost: 90 nsec
#           remote key: 16 bytes
#           local memory handle is required for zcopy
#   < no supported devices found >
#
# Memory domain: rdmacm
#            component: rdmacm
#           supports client-server connection establishment via sockaddr
#   < no supported devices found >

Is it possible that my ucx setup isn't properly supporting my iWARP card, do I need to build OpenMPI differently, or is something else going on?

According to: https://www.open-mpi.org/faq/?category=openfabrics#iwarp-support it should be possible to have UCX+iWARP, but I'm missing something here :)

Thanks in advance for the help, Pablo

hjelmn commented 4 years ago

You seem to have the two reversed. Swap "Latency between two nodes over iWARP using OpenMPI pml/ob1:" with "Latency between the same two nodes using UCX:".

I agree that is bad. Someone from the OpenUCX community needs to address this since the openib BTL is gone in later releases.

Can you try with --mca ob1 --mca btl vader,self,uct --mca btl_uct_memory_domains ib/cxgb4_0,rdmacm and see if that is any better? I doubt it but it is a good sanity check.

jsquyres commented 4 years ago

Is UCX falling back to TCP?

hjelmn commented 4 years ago

@jsquyres That is what I was thinking. By eliminating the top layer or UCX it helps to isolate what is going wrong.

jsquyres commented 4 years ago

Does UCX have controls like Open MPI to limit which devices/plugins/endpoints/whatever it uses?

yosefe commented 4 years ago

yes, set UCX_NET_DEVICES=cxgb4_0:1 for example However, UCX does NOT support iWarp for the time being (probably need to update OpenMPI FAQ)

hjelmn commented 4 years ago

Wait, but wasn't the removal of the openib btl based on OpenUCX supporting iWarp? It doesn't matter to me but that is what I remember from the discussion.

yosefe commented 4 years ago

Wait, but wasn't the removal of the openib btl based on OpenUCX supporting iWarp? It doesn't matter to me but that is what I remember from the discussion.

Seems like a miscommunication. UCX can work over iWarp in TCP mode, but with RDMA (hence the worse performance)

jsquyres commented 4 years ago

We discussed this last week on the Tuesday teleconf: no one could remember precisely, but we all had the feeling that iWARP was supposed to be supported somehow. We thought it was UCX, but perhaps it was Libfabric...?

In any case, the original report on this issue may well be that UCX fell back to TCP, thereby giving un-offloaded-TCP levels of performance.

Seems like a miscommunication. UCX can work over iWarp in TCP mode, but with RDMA (hence the worse performance)

@yosefe Can you explain what that statement means? iWARP is TCP, so I'm not quite sure how to parse your statement...

yosefe commented 4 years ago

@yosefe Can you explain what that statement means? iWARP is TCP, so I'm not quite sure how to parse your statement...

Sorry. let me rephrase that: UCX can work over iWarp in using TCP sockets, but not in RDMA-over-TCP mode (hence the worse performance reported in this issue)

pllopis commented 4 years ago

@yosefe Can you explain what that statement means? iWARP is TCP, so I'm not quite sure how to parse your statement...

Sorry. let me rephrase that: UCX can work over iWarp in using TCP sockets, but not in RDMA-over-TCP mode (hence the worse performance reported in this issue)

So UCX provides a (sort of) SoftiWARP implementation, if I understand this correctly? It does rdma over tcp, but without using any offloading capabilities?

jsquyres commented 4 years ago

@yosefe Sorry, I'm still confused. The iWARP protocol is message passing and RDMA over TCP sockets.

I think the question is: does UCX support iWARP devices via the IB verbs API? Because if so, then UCX should support both regular messaging and RDMA over TCP sockets (because the iWARP devices will utilize the iWARP wire protocol, which -- for both regular messaging and for RDMA -- is fundamentally based on TCP sockets).

Instead, are you saying that UCX does not support iWARP devices via the IB verbs stack? I.e., UCX supports all Ethernet devices with not-OS-bypass / not-offloaded / plain POSIX sockets?

yosefe commented 4 years ago

does UCX support iWARP devices via the IB verbs API?

No. UCX does not support RDMA over TCP socket.

Instead, are you saying that UCX does not support iWARP devices via the IB verbs stack? I.e., UCX supports all Ethernet devices with not-OS-bypass / not-offloaded / plain POSIX sockets?

Yes, iWarp is working in not-OS-bypass / not-offloaded / plain POSIX sockets mode

jsquyres commented 4 years ago

Yes, iWarp is working in not-OS-bypass / not-offloaded / plain POSIX sockets mode

@yosefe You are not making things any clearer. ☹️ iWARP is a protocol. There are many types of Ethernet devices; some natively support the iWARP protocol in hardware, others do not.

Are you saying that UCX has a userspace software-based iWARP implementation (that assumedly works on all Ethernet devices, not just iWARP hardware devices)?

Or are you saying that UCX has a TCP mode that has nothing to do with the iWARP protocol that works on all Ethernet devices (not just iWARP hardware devices)?

yosefe commented 4 years ago

Or are you saying that UCX has a TCP mode that has nothing to do with the iWARP protocol that works on all Ethernet devices (not just iWARP hardware devices)?

Yes

jsquyres commented 4 years ago

Ok. So just to be 1000% clear for those who end up on this issue in the future:

As of today (23 June 2020 / UCX v1.8.), UCX does not support iWARP at all.

jsquyres commented 4 years ago

@pllopis I believe we wrote that FAQ entry back when we understood that UCX supported iWARP. Apparently, that is incorrect. Could you try the OFI MTL (i.e., libfabric)?

mpirun --mca pml cm --mca mtl ofi ...

pllopis commented 4 years ago

Thanks to all for the awesome support! Indeed I'll try out what was suggested. The issue is clearly understood now.

I'm trying to find a configuration that works well for both Infiniband and iWARP, is there any way to achieve that on OpenMPI 4? (i.e. one config that gets the low latency on both networks, without changing component configurations for each case) If you prefer, I can open a new issue about this and you can consider this one closed.

jsquyres commented 4 years ago

No need for a new issue, this is all kinda the same thing...

UCX is definitely Mellanox's preferred mechanism for IB support these days. So mpirun --mca pml ucx ... is the best way to go for IB.

IB does work with the openib BTL, but it's not preferred (and will be going away in the Open MPI v5.x series). It's mostly unmaintained at this point, so if you find bugs in the v4.x series, it may be a struggle for us to get them fixed.

I'm hopeful that iWARP support works in recent versions of Libfabric, and therefore mpirun --mca pml cm --mca mtl ofi ... will work for you.

Since UCX (at least at the moment) doesn't support iWARP, I don't think you're going to find a common mpirun command line that supports both IB and iWARP, sorry.

hjelmn commented 4 years ago

For the BTL route the UCT btl has replaced openib for IB but requires extra configuration until I can find time to ensure that enabling btl/uct has no negative impacts on pml/ucx.

jsquyres commented 4 years ago

I have confirmed that Libfabric supports iWARP. Specifically:

The verbs provider supports iWARP devices.
The rxm provider supports the tagged message support that Open MPI needs.

Meaning: there is likely some way to get mpirun --mca pml cm --mca mtl ofi ... to work. If that command line doesn't work out of the box (i.e., if Open MPI doesn't automatically select rxm+verbs for your iWARP device), you may need to specify some additional parameters to select to use rxm and verbs inside Libfabric, but there should be a way to do this.

I'm afraid that I have no iWARP hardware with which to test this (as you can probably surmise from this discussion, there probably aren't very many Open MPI iWARP users left).

pllopis commented 4 years ago

I'm hopeful that iWARP support works in recent versions of Libfabric, and therefore mpirun --mca pml cm --mca mtl ofi ... will work for you.

It does work, albeit with slightly higher latency than ob1:

mpirun  -mca pml cm -mca mtl ofi --map-by node --bind-to core -n 2 -host hpc002,hpc003  $OSUTESTS/pt2pt/osu_latency -m1
# OSU MPI Latency Test v5.4.1
# Size          Latency (us)
0                       4.70
1                       4.62

In my build it picks it up automatically even if I just use -mca pml cm.

Thanks for the support.

pllopis commented 4 years ago

Can you try with --mca ob1 --mca btl vader,self,uct --mca btl_uct_memory_domains ib/cxgb4_0,rdmacm and see if that is any better? I doubt it but it is a good sanity check.

I think this is the expected outcome, but just to re-confirm, I tested this:

mpirun -mca pml ob1 --mca btl vader,self,uct --mca btl_uct_memory_domains ib/cxgb4_0,rdmacm --map-by node --bind-to core -n 2 -host hpc002,hpc003  $OSUTESTS
/pt2pt/osu_latency -m1                                                                                                                                        
# OSU MPI Latency Test v5.4.1
# Size          Latency (us)
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[2540,1],0]) is on host: hpc002
  Process 2 ([[2540,1],1]) is on host: hpc003
  BTLs attempted: self

Your MPI job is now going to abort; sorry.

Cheers

pllopis commented 4 years ago

Sorry to not have added this before, but I just wanted to provide a bit of background/clarification on why I am getting this increased latency that I reported above when I using pml/cm:

I suspect this is related to what I reported in #7784 where using pml/cm and ofi resulted in high intra-node latencies (I get 3us inter-process latencies, on the same node, only when using pml/cm. With ob1 or ucx they're always sub-microsecond). I am not sure how to force pml/cm with shared memory. My fi_info lists, amongst all the others:

provider: shm
    fabric: shm
    domain: shm
    version: 1.0
    type: FI_EP_RDM
    protocol: FI_PROTO_SHM

Using -mca mtl_ofi_provider_include 'verbs,shm' yields the same as -mca mtl_ofi_provider_include 'verbs' and the same as leaving the defaults. Using only -mca mtl_ofi_provider_include 'shm' fails with: No components were able to be opened in the pml framework..

hjelmn commented 4 years ago

What is the latency with the ofi btl? --mca pml ob1 ---mca btl self,vader,ofi. I don't know if any other settings are needed.

pllopis commented 4 years ago

In every case where I don't use pml/cm, the intra-node latency is <1us

mpirun --mca pml ob1 --mca mtl ofi --mca btl self,vader --map-by node --bind-to core -n 2 -host hpc002:2  $OSUTESTS/pt2pt/osu_latency -m1
# OSU MPI Latency Test v5.4.1
# Size          Latency (us)
0                       0.16
1                       0.18

I'm assuming you meant the openib btl or the ofi mtl, as I don't see any ofi btl as per ompi_info (and trying to enable "ofi" as btl gives errors).

Any of the following provide sub-microsecond intra-node latency:

-mca pml ucx,
-mca pml ob1
-mca pml ob1 -mca btl self,vader,openib
-mca pml ob1 -mca btl self,vader
-mca pml ob1 -mca mtl ofi

The trigger for getting the ~3us intra-node latency is selecting -mca pml cm (even if I use --mca pml cm --mca btl self,vader,openib --mca mtl ofi, I still get >3us).

Should there be a way to get this low latency using -mca pml cm, since this is the one suggested to be used for iWARP?

hjelmn commented 4 years ago

Thats why I ask about btl/ofi. cm will never be the fast path for shared memory. It is easily beat by ob1 with btl/vader (now btl/sm). If btl/ofi works well for iWarp then maybe the right answer is ob1.

jsquyres commented 4 years ago

It's probably not entirely surprising that the latencies are a bit higher with CM and Libfabric for iWARP. The iWARP support is at least somewhat emulated.

Specifically, the "CM" PML is intended for networks with MPI-style "matching" built in to the fabric itself. By definition, MPI-style matching requires allowing matching across both shared memory and the network. iWARP does not support MPI-style "matching" natively, so the "rxm" provider in Libfabric is a software emulation of that functionality that is layered on top of the raw iWARP Library functionality. This kind of emulation, paired with shared memory messaging in Libfabric, results in a bit less efficiency than can be achieved via either an MPI implementation that is doing all of matching marshaling itself, or a "real" MPI-style matching fabric via Libfabric.

Meaning: I'm guessing that raw Libfabric (without rxm) inter-node iWARP latencies would be what you expect, but with layered rxm-style software emulated matching and shared memory support, some efficiencies are being lost, resulting in what you see as higher point-to-point latencies in microbenchmarks.

This unfortunately reflects the reality that the iWARP vendors haven't been involved in the Open MPI community for quite a while (I'm not sure how much they're still involved in the overall HPC community; all I can definitively state is that they're not involved in the Open MPI community). Sorry. 😦 If you have an ongoing relationship with your iWARP vendor, you might want to, ...er... "encourage" them. 😄

All that being said, try running some real apps and see what happens to the actual wall-clock execution times. Microbenchmarks are useful, but the only thing that matters at the end of the day is wall-clock execution times of your real applications.

Sidenote: Ralph Castain and myself are giving a seminar about Open MPI and its ecosystem (e.g., all the dependencies and what all the of acronyms mean, etc. -- not necessarily an in-depth guide to tuning, but at least providing a basis for knowing what all of these individual pieces are) in about 45 minutes: i.e., 11am US Eastern, 24 June, 2020. It's definitely going to be at least 2 parts -- there's almost zero chance that we'll get to OB1 vs. CM / Libfabric vs. UCX in part 1 today. Here's more information, if you're interested: https://www.mail-archive.com/users@lists.open-mpi.org/msg33931.html

jsquyres commented 4 years ago

Thats why I ask about btl/ofi. cm will never be the fast path for shared memory. It is easily beat by ob1 with btl/vader (now btl/sm).

The "now" in Nathan's statement refers to the upcoming Open MPI v5.x series (i.e., vader was renamed to "sm" in Open MPI v5.x -- but that's still months away).

If btl/ofi works well for iWarp then maybe the right answer is ob1.

To clarify this point: there is a BTL OFI, but I do not believe it exists in the v4.x series. I think it's only on master (i.e., what will become the v5.x series).

hjelmn commented 4 years ago

@jsquyres Might be worth a try with master. If it gives better performance it would be relatively simple to PR btl/ofi back to 4.1.x at least.

jsquyres commented 4 years ago

@hjelmn You have a very, very short timeframe (~a week) to PR that back to v4.1.x. 😄

hjelmn commented 4 years ago

@jsquyres I might do that proactively. There is no harm in including it and it could help with iWarp so why not.

pllopis commented 4 years ago

Thanks for all the extra input. I tried to join the webinar yesterday but I had issues connecting to Webex, so I will watch the uploaded version, thanks for the heads up.

All that being said, try running some real apps and see what happens to the actual wall-clock execution times. Microbenchmarks are useful, but the only thing that matters at the end of the day is wall-clock execution times of your real applications.

I agree. I am running the latency benchmark as I know that's one of the indicators for how well some of the applications will run.

I have built OpenMPI from the master branch and so far I'm getting the same behaviour. For the nodes on the eth/iWARP network: ucx provides a higher inter-node latency compared to ob1, cm (16us vs 3us, 4us respectively). ob1 and ucx provide a lower intra-node latency compared to cm (<1us vs 3us).

Which is a bit unfortunate, since on OpenMPI3 I could just use ob1 for everything and get good performance no matter what. ~~Whereas now it looks like it isn't just one component for IB and one for iWARP, but within eth/iWARP intra-node and inter-node performance is different as well depending on the component.~~ In OpenMPI 4, I can only use pml/ob1 to get good performance for both intra-node and inter-node (iWARP) latency.

Is there any specific configuration you'd like me to give a try?

jsquyres commented 4 years ago

@hjelmn @pllopis Guess what? @bwbarrett already back-ported the OFI BTL to the v4.1.x branch.

So @pllopis, you might want to try mpirun --mca pml ob1 --mca btl ofi,vader,self ... on a v4.1.x nightly build (v4.1.0 hasn't been released yet): https://www.open-mpi.org/nightly/v4.1.x/ (any build starting from last night should be ok)

Just to clarify some things:

UCX is the PML that only uses UCX.
- A rule of thumb is that UCX is only used for IB and RoCE.
- You can force UCX to use TCP, but that will be plain vanilla POSIX TCP sockets -- not iWARP.
CM is the PML that is for "MPI-style matching fabrics".
- It uses MTLs.
- In your case, that means using the OFI MTL, which uses the "MPI-style matching" functionality in Libfabric. That functionality is emulated for iWARP devices.
OB1 is the "multi-device" PML.
- It uses BTLs.
- You'll always want to use the vader and self BTL to MPI communicate via shared memory and process-loopback.
- There's also an OFI BTL that you can use with the OB1 PML.
  - This will use raw iWARP functionality in Libfabric without the emulation of "MPI-style matching" and Libfabric shared memory support. This is why @hjelmn thinks that it might perform a bit better.
  - That's (mostly) different than using the CM PML + OFI MTL.

pllopis commented 4 years ago

Thank you for the clarifications, this helps understand some things, especially the part about CM using MTLs vs OB1 using BTLs.

I edited my previous comment above. I can already get good performance on iWARP-enabled nodes for both intra and inter-nodes if I use --mca pml ob1. Sorry for the confusion.

In light of this, I my case I think I wouldn't want to use CM at all? Compared to ob1, it has slightly higher latency than ob1 between nodes, and much higher latency for local inter-process communication.

So now I am confused now as to why the backport of BTL OFI :) will this enable IB support in ob1? In any case, I will definitely givet his a try tomorrow. Thanks again

pllopis commented 4 years ago

@hjelmn @pllopis Guess what? @bwbarrett already back-ported the OFI BTL to the v4.1.x branch.

So @pllopis, you might want to try mpirun --mca pml ob1 --mca btl ofi,vader,self ... on a v4.1.x nightly build (v4.1.0 hasn't been released yet): https://www.open-mpi.org/nightly/v4.1.x/ (any build starting from last night should be ok)

I tried the nightly at https://download.open-mpi.org/nightly/open-mpi/v4.1.x/openmpi-v4.1.x-202006260337-4654149.tar.bz2, but I'm getting a crash when using mpirun --mca pml ob1 --mca btl ofi,vader,self ...

mpirun --mca pml ob1 --mca btl ofi,vader,self --mca btl_ofi_provider_include verbs --map-by node --bind-to core -n 2 --host hpc002,hpc003  $OSUTESTS/pt2pt/o
su_latency -m1                                                                                                                                                
[hpc003:1723082:0:1723082] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xc8)
==== backtrace ====
    0  /lib64/libucs.so.0(+0x17970) [0x154f2352b970]
    1  /lib64/libucs.so.0(+0x17b22) [0x154f2352bb22]
    2  /usr/local/mpi/openmpi/4.1.x/lib64/openmpi/mca_pml_ob1.so(+0xb8f5) [0x154f29bb58f5]
    3  /usr/local/mpi/openmpi/4.1.x/lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x540) [0x154f29bb76b0]
# OSU MPI Latency Test v5.4.1
# Size          Latency (us)
    4  /usr/local/mpi/openmpi/4.1.x/lib64/libmpi.so.40(ompi_coll_base_barrier_intra_two_procs+0xe1) [0x154f3e83c401]
    5  /usr/local/mpi/openmpi/4.1.x/lib64/libmpi.so.40(MPI_Barrier+0xa7) [0x154f3e7f9157]
    6  /usr/local/mpi/osu-micro-benchmarks-openmpi403/5.4.1/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency() [0x401631]
    7  /lib64/libc.so.6(__libc_start_main+0xf5) [0x154f3d9af555]
    8  /usr/local/mpi/osu-micro-benchmarks-openmpi403/5.4.1/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency() [0x401934]
===================
[hpc002:1821031:0:1821031] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xc8)
==== backtrace ====
    0  /lib64/libucs.so.0(+0x17970) [0x1541c1ea7970]
    1  /lib64/libucs.so.0(+0x17b22) [0x1541c1ea7b22]
    2  /usr/local/mpi/openmpi/4.1.x/lib64/openmpi/mca_pml_ob1.so(+0xb8f5) [0x1541c803b8f5]
    3  /usr/local/mpi/openmpi/4.1.x/lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x540) [0x1541c803d6b0]
    4  /usr/local/mpi/openmpi/4.1.x/lib64/libmpi.so.40(ompi_coll_base_barrier_intra_two_procs+0xe1) [0x1541dcf78401]
    5  /usr/local/mpi/openmpi/4.1.x/lib64/libmpi.so.40(MPI_Barrier+0xa7) [0x1541dcf35157]
    6  /usr/local/mpi/osu-micro-benchmarks-openmpi403/5.4.1/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency() [0x401631]
    7  /lib64/libc.so.6(__libc_start_main+0xf5) [0x1541dc0eb555]
    8  /usr/local/mpi/osu-micro-benchmarks-openmpi403/5.4.1/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency() [0x401934]
===================

The above is for the nodes that do iWARP, but I tried for both IB and iWARP, with the same outcome.

I welcome these improvements :) but as I mentioend in my previous comment: if I'm understanding the current situation correctly, for these nodes that talk iWARP I should just stick to OB1, and for using IB I should stick to UCX, correct? Since pml/cm does not seem to perform as well, at least for this particular scenario.

jsquyres commented 4 years ago

In light of this, I my case I think I wouldn't want to use CM at all? Compared to ob1, it has slightly higher latency than ob1 between nodes, and much higher latency for local inter-process communication.

Yes, the way it is currently implemented, CM PML + OFI MTL will give you higher shared memory latencies because the OFI MTL is using the "MPI-style matching" functionality in Libfabric (which, because "MPI-style matching" must encompass all peers, will take over all communication to all peers, both local [shared memory] and remote [iWARP]). Put simply:

Libfabric's shared memory functionality isn't (yet?) as efficient as Open MPI's OB1 PML + vader BTL shared memory functionality.
Libfabric's software emulation of "MPI-style matching" isn't as efficient as Open MPI's OB1 PML MPI matching functionality. I'm guessing that it never will be (because the focus is for networks that actually support MPI-style matching and don't require software emulation).

I.e., in Open MPI v4.x, you have two choices for iWARP functionality (actually, 3, but see below):

OB1 PML + openib BTL (and vader/self BTL for shared memory/loopback): good performance in all cases
CM PML + OFI MTL: software emulation of "MPI-style matching" fabrics + less-efficient shared memory functionality

In Open MPI v5.x, the openib BTL is no longer available (because it's effectively unmaintained code).

Additionally, the iWARP vendors have unfortunately not been a meaningful part of the Open MPI community for quite some time, so it has not been a priority.

So now I am confused now as to why the backport of BTL OFI :) will this enable IB support in ob1? In any case, I will definitely givet his a try tomorrow.

This is the 3rd option for iWARP support in Open MPI v4.x (and probably 5.x).

The OFI BTL does not use the "MPI-style matching" functionality in Libfabric, and does not use the shared memory functionality in Libfabric. Rather, the OFI BTL just acts like a dumb bit-pusher to the underlying raw network interface. It will use iWARP's OS bypass and hardware offload, and will use iWARP's RDMA. Hence, you'll use the OB1-optimized MPI matching and the vader BTL for shared memory (just like the openib BTL does). The hope is that your performance will be on par with openib's iWARP performance.

The OFI BTL hasn't had extensive testing, though, combined with the fact that I'm guessing that you're the first person ever to try it on iWARP. Hence, it's probably not entirely surprising that you're getting a segv. ☹️

To sum up:

In Open MPI 4.x, you're probably ok with OB1 PML + openib/self/vader BTLs.
In Open MPI 5.x, openib is going away. Currently, your only option will be the CM PML + OFI MTL (which is using software emulation over iWARP, resulting in slightly lower performance).
That being said, if an iWARP vendor (cough cough) would step up and help finish / harden the OFI BTL, things might be a little better in Open MPI v5.x for iWARP customers.

hjelmn commented 4 years ago

Yup. And I will take a look at ofi/btl and see what is going on. With was tested with highly threaded workloads for a paper so it probably is a simple bug.

open-mpi / ompi