ofi-cray / libfabric-cray

Open Fabric Interfaces
http://ofiwg.github.io/libfabric/
Other
16 stars 9 forks source link

Sandia SHMEM-OFI atomic performance #1038

Open kseager opened 7 years ago

kseager commented 7 years ago

message rate performance needs some tuning for the atomics pathway: notably int add/increment.

refer to the following test from Sandia SHMEM-OFI for investigation purposes: (will need to configure SOS with --enable-lengthy-tests to build test suite)

https://github.com/Sandia-OpenSHMEM/SOS/blob/master/test/performance/shmem_perf_suite/shmem_bw_atomics_perf.c

jswaro commented 7 years ago

Does SOS attempt to check signals such as SIGPROF (Profiling signal)? It appears to be the case since it seems to crash at the first signal. If so, is there an easy way to ignore that signal from the perspective of SOS?

kseager commented 7 years ago

we don't register any signal handlers in SOS. It might be the process manager...

jswaro commented 7 years ago

Some notes for myself here:

I re-wrote part of the random access test to use fetch atomics, same as SOS, but put an extended emphasis on millions of loops for the purpose of profiling the behavior of the application. Using the gperftools profiler and the modified code, it seem as though ~30-35% of the time spent in the application was spent in the fi_fetch_atomic section. Another ~50% of the application time was spent waiting on the completions from those atomic transactions. Of the ~30-35% of the time in fi_fetch_atomics, ~80% of that time (~24-28%) of the time was spent in the GNI library under the GNI_PostFMA calls. This would imply that less than ~7% of application time in a tight loop is spent processing provider code relevant to posting the transaction. In a manner similar to SOS, the code waits for completions after the atomic transaction is completed. Approximately 65% of the time spent for each transaction can be attributed to waiting for the completion. Of the 65% of the time spent waiting for completions, approximately ~20% of that time was spent in GNI code (~13% of total application time). This leaves approximately ~52% of application time being consumed by provider code while waiting on completion events. FWIW, the GNIX provider by itself appears to consume nearly 60% of application time in a busy spin environment, with ~37-41% of time being consumed by the lower layer driver(GNI).

I'll modify the test to utilize the counter system in a similar manner as the SOS test for quick comparison.

kseager commented 7 years ago

interesting. Thanks for the update

jswaro commented 7 years ago

There is some performance hit to using counter as compared with busy-spin on fi_cq_read. I'm seeing ~8-10% difference in run-time.

jswaro commented 7 years ago

Posting my branch here for Howard and Sung. I might be bouncing between issues.

https://github.com/jswaro/cray-tests/tree/test/profiling-atomics

jswaro commented 7 years ago

@kseager

One issue you might be running into is memory consumption and allocation time. Unless you are suppressing the generation of completion events, you may be running into issues where you never consume CQ entries, which causes the gni provider to continually allocate resources via malloc. Since you are more familiar with your code than I am, could you comment whether you consume the CQ entries, or if you suppress them?

kseager commented 7 years ago

on our cntr_epfd, the endpoint we use to issue atomic operations off of, the queue is bound with selective completion. Thus it should not generate entries for successful completions since we do not set the FI_COMPLETION flag on our atomic operations. It is sharing this queue with our other stx/"shared" ep that by default issues CQ completions.. However this shouldn't change the behavior.

so yes, we should be suppressing CQ events for the atomic operations.

jswaro commented 7 years ago

Ok, that eliminates that possibility. Thanks Kayla.

jswaro commented 7 years ago

Kayla,

I’ve run the numbers from SHMEM-OFI-uGNI on an internal machine but my OPS/sec are not even close. Can you confirm that you are/were seeing 5.87 million OPS/sec, or was value of 5.87 measured in MBps?

kseager commented 7 years ago

could you be more specific on which atomic operation (Op x Size)? However, I did a CORI run this morning and I'd say that must've been MBps.

jswaro commented 7 years ago

I was referring to the Atomic add INT numbers from your prior email.

jswaro commented 7 years ago

Running pure uGNI seems to give me approximately 2.5M OPS/sec, around ~10MBps. This differs greatly from our numbers, which are about 1.5M OPS/sec, and ~6MBps.

kseager commented 7 years ago

I agree that's the range I'm seeing for our numbers. Did you compare this to OFI-uGNI? which test are you using for pure uGNI? and it must be using a queue for completion? Just trying to see if the two tests are asking for completion at similar intervals; SOS perf add_int should issue "non-blocking (via inject)" for a "window_size" amount then it does a quiet which waits on the associated counter.

jswaro commented 7 years ago

I have been attempting a comparison with OFI-uGNI and pure uGNI. I am using an internal test for pure uGNI, and it uses a queue for completions. Both implementations are window based.

sungeunchoi commented 7 years ago

@kseager Do you know if this test still works with the head of ofi-cray master? I'm getting the following errors with fcfeeff:

[000] ERROR: transport_ofi.h:297 return code 0
aborting job:
Sandia OpenSHMEM exited in error
Warning: shutting down without a call to shmem_finalize()
[000] ERROR: transport_ofi.h:297 return code 0
aborting job:
Sandia OpenSHMEM exited in error
[001] ERROR: transport_ofi.h:297 return code 0
aborting job:
Sandia OpenSHMEM exited in error
[001] ERROR: transport_ofi.h:297 return code 0
aborting job:
Sandia OpenSHMEM exited in error
jdinan commented 7 years ago

It's surprising to see this in the log:

Warning: shutting down without a call to shmem_finalize()

This means that some process called shmem_init() (rather than start_pes()) and had not called finalize by the time it reached the atexit handler. I suppose this can happen if the process manager kills the PE in a way that allows the atexit handlers to still run. What process manager are you guys using?

Also, another note, the perf tests should always be built in make check. Adding --enable-lengthy-tests will enable running them (rather than just building them) in the check. If you want to build, but not run the tests, run make check TESTS=.

sungeunchoi commented 7 years ago

FYI @jswaro build this, so he can answer. FWIW, it works with the the libfabric library he linked against, but the above happens when I change the LD_LIBRARY_PATH to pick up my library.

jswaro commented 7 years ago

I did build with make check and enable-lengthy tests. It works for my version of the library but I haven't updated in a while so as to avoid new things coming down the pipe from affecting my measurements.

jswaro commented 7 years ago

ofi-cray/libfabric@4da597a5cb6941a981a2316136706cad3847989e was my last pull

kseager commented 7 years ago

@sungeunchoi yes I'm seeing this error on the head commit today. Even on the "hello" unit test...looks like its coming from the shmem_transport_fini calling quiet. Could it be a provider regression? The cntr_wait in quiet is returning a non-zero in a case where there is nothing outstanding to complete...At least that is what I gather from poking around.

Additionally are you seeing ERROR: runtime init failed: 7 ERROR: runtime init failed: 7 when you configure with CFLAGS='-g' on CORI?

jswaro commented 7 years ago

On the bright side, we can bisect the git tree from my commit hash to head and find out where the regression was introduced.

jshimek commented 7 years ago

@kseager The non-zero status is probably due to the lack of a wait_object on the cntr, using fi_cntr_wait requires the passed in cntr have an attached wait_object. See https://ofiwg.github.io/libfabric/v1.4.0/man/fi_cntr.3.html.

Looking at the https://github.com/Sandia-OpenSHMEM/SOS/blob/master/src/transport_ofi.h#L290 I don't see any checking of what the return code is from fi_cntr_wait just that it is non-zero. In the event of a cntr being passed in without a wait_object the gni provider returns -FI_EINVAL. So there wouldn't be a error on the cq but just a return code stating the call had an invalid parameter. If you go back to before the fi_wait commit was put in, the gni_provider just used a busy spin internally, whereas now it follows the man page.

If you wish to do a busy spin on the cntr this would need to be changed to a loop on an fi_cntr_read call instead of using fi_cntr_wait. Since the fi_cntr_wait call is not a busy spin anymore the performance of using fi_cntr_wait probably be different than before.

kseager commented 7 years ago

@jshimek thanks for the feedback. We will fix the cntr_wait usage on the SHMEM-OFI end.

kseager commented 7 years ago

the tests should work again now that PR Sandia-OpenSHMEM/SOS#290 is merged

jswaro commented 7 years ago

I'll pull down the newest stuff and work from there. Thanks Kayla.

jswaro commented 7 years ago

I'm seeing a failure in one of the tests. Can you tell me which version of libfabric they tested the latest PR against?

PASS: shmem_latency_put_perf
PASS: shmem_latency_get_perf
PASS: shmem_bw_put_perf
PASS: shmem_bibw_put_perf
PASS: shmem_bw_get_perf
PASS: shmem_bibw_get_perf
PASS: shmem_latency_nb_put_perf
PASS: shmem_latency_nb_get_perf
PASS: shmem_bw_nb_put_perf
PASS: shmem_bibw_nb_put_perf
PASS: shmem_bw_nb_get_perf
PASS: shmem_bibw_nb_get_perf
PASS: shmem_bw_atomics_perf
FAIL: shmem_bibw_atomics_perf
kseager commented 7 years ago

that might've been from the binding bug I just found a few days ago, can you pull master again and let me know if that fixes it on your end?

jswaro commented 7 years ago

I just pulled it and the bibw_atomics is fixed, but now latency_put_perf fails.

libfabric: f1546f08c187530e6c9c7e6bcabbfdb5c3234e80 shmem: 388aba8e4ff4abe54479d2a231ac7c2eef77b648

jswaro commented 7 years ago

I haven't seen the failure since the first failure after recompiling.

jswaro commented 7 years ago

@kseager Can you attempt to get the bench mark to work without wait objects ? A busy spin is best for performance. This would at least give us a chance to optimize the performance of the atomics withou the wait object interference.

kseager commented 7 years ago

It might take me a while to get to this, is there a mini atomic test over ofi-gni to investigate and turn nobs on? In terms of the benchmark, for the atomic_writes if you use a hard polling wait until vs. the fence you can avoid wait objects in SOS. The atomic_get style (swaps) have a wait implicitly within the call itself. I do want to eventually look at polling on a counter_read for fence/quiet I have an issue logged for it on SOS. I'll keep you posted on my end

jswaro commented 7 years ago

Sounds good. I'll keep looking at it from my end.

kseager commented 7 years ago

actually Jim just put up a PR so that you can avoid using wait objects: Sandia-OpenSHMEM/SOS#312. Just configure with --enable-completion-polling. Let us know how it goes on gni

jswaro commented 7 years ago

Wonderful, thanks! I'll take a look at it and see what I can find.