NA OFI: Bad address error with psm2 during dma

marcvef commented 4 years ago

Describe the bug We are facing issues with using ofi+psm2 with GekkoFS when running multiple servers and clients (and client processes), stochastically producing the following error on some of the client processes:

z0263.mogonii.39693Unexpected error in writev(): Bad address (errno=14) (fd=19,iovec=0x2ba199c370e0,len=3) (err=23)

Experimental setup We have seen this behavior first in GekkoFS, but we could reproduce it in a smaller Margo testing program (which I am happy to share if required). For example, we use eight servers (one per node). The client is an MPI program that sends a number of RPCs including bulk transfers (64 MiB each RPC). However, this also happens with smaller transfer sizes, e.g., 1 MiB. Clients are run on a separate set of nodes than the servers. The client pseudo-randomly decides for each RPC which server is the recipient. The more client processes are running on each node the more likely the error occurs.

Detailed observations The error originates from the opa-psm2 library and only during bulk_transfers (see source here) during a writev() to a TCP socket.

Regarding Mercury, this error first appeared with commit https://github.com/mercury-hpc/mercury/commit/5b1de1b7432bcde16bedd99e1fbd5c5561504d2f and is occurring since then.

Additional context

The error happens consistently throughout multiple opa-psm2 and libfabric versions. I tested with opa-psm2 (10.3.58) and libfabric (1.6.2) and with the more recent opa-psm2 (11.2.91) and libfabric (1.9.1-rc2).
Waiting a number of milliseconds between each RPC on the client to sequentialize the sending of RPCs of all ranks makes the error significantly less likely to occur.
The error does not occur with a single client.
The error occurs immediately when sending but after all servers have been looked up by each client.
The client is started with the following mpirun arguments: --mca mtl ^psm2,ofi -x FI_PSM2_DISCONNECT=1 -x PSM2_MULTI_EP=1. PSM2_MULTI_EP=1 is also set on the server.
Compiler version: GCC 6.4.0
MPI version: OpenMPI 3.0.1

Thanks!

soumagne commented 4 years ago

@marcvef thanks for letting me know about it. I have also seen it today with DAOS actually over PSM2. Yes if you have a smaller margo example that reproduces it that would be very helpful, thanks!

marcvef commented 4 years ago

@soumagne Thanks for the quick reply. Please find the testing program attached: rpctester.tar.gz.

I've included a readme which explains how to start the servers and clients and their arguments.

soumagne commented 4 years ago

I had issues running your reproducer, C++ and then some bad regex stuff when parsing the hostfile. Anyway after running mercury tests more extensively, I think I have an idea of what the problem is. There are multiple data races coming from psm2, which is obviously not thread safe at all. So can you please try to do that on both client and server:

struct hg_init_info init_info = HG_INIT_INFO_INITIALIZER;

init_info.na_init_info.progress_mode = NA_NO_BLOCK;

and pass that to HG_Init_opt() or margo_init_opt(). Let me know if that helps.

marcvef commented 4 years ago

Thanks a lot! I've tried it and the issue no longer occurs even with larger process numbers (I tested up to 512 processes with 32 processes per node). I have yet to run with very large node numbers. Could you please elaborate a bit on what NA_NO_BLOCK does and if they are any side effects when using this mode?

Also are there other environment variables or Mercury configurations that you suggest to set when using ofi+psm2?

carns commented 4 years ago

I think PSM2 is understood not to be thread safe on it's own, but I was assuming that libfabric would serialize as needed.

See FI_PSM2_LOCK_LEVEL in https://ofiwg.github.io/libfabric/master/man/fi_psm2.7.html for example. This makes it sounds like it should be using a conservative locking model by default, but I haven't looked at the code to see what this means exactly.

From skimming that man page I see that they also discourage FI_PROGRESS_AUTO (at least in part because it exacerbates the thread safety problem?). Mercury is using FI_PROGRESS_AUTO though (see https://github.com/mercury-hpc/mercury/blob/master/src/na/na_ofi.c#L146). Maybe we should try toggling that to FI_PROGRESS_MANUAL to see if it helps matters?

soumagne commented 4 years ago

Right that is also what I was assuming and it is most likely attempting that level of locking but obviously it is not doing that and it needs more fixes, given the number of data races that the thread sanitizer produces... Yes that's just always the same problem, signaling is not done correctly when FI_PROGRESS_MANUAL is used, which is why we had to set it to FI_PROGRESS_AUTO by default. FI_PROGRESS_MANUAL is used when NA_NO_BLOCK is used, which is why the problem disappears, but that forces internal busy spinning since we no longer have the ability to signal completion in that case. @marcvef setting that progress mode will consume more CPU resources as there is no more blocking progress and it will keep checking completion queues in a busy loop without sleeping.

Unfortunately since PSM2 seems to be only minimally supported, I really don't see that being fixed in the near future.

carns commented 4 years ago

What exactly is the signalling problem when FI_PROGRESS_MANUAL is used with the psm2 provider? (if nothing else, to document for posterity in this github issue). I assume it must be something psm2-specific; we are using MANUAL mode with verbs in Mercury.

marcvef commented 4 years ago

To give some feedback: We have been running longer tests with up to 64 nodes and 1024 processes in total for three hours. We didn't see this particular issue appearing again, albeit as you expected with a much higher CPU utilization.

carns commented 4 years ago

Thanks for the update @marcvef . @soumagne , what can/should we do in Mercury to accommodate? Can we make this the default setting somehow if the psm2 provider is used? That's too bad about CPU utilization but it might be the safest option for now.

soumagne commented 4 years ago

@carns Yes you're right, I should definitely do that.

mercury-hpc / mercury

NA OFI: Bad address error with psm2 during dma #356