openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.11k stars 417 forks source link

Potential memory leak coming from ucs_mpool_hugetlb_malloc during ucp_atomic_op_nbx #6439

Open krakowski opened 3 years ago

krakowski commented 3 years ago

Describe the bug

I use Oracle's Project Panama (which will replace JNI in the near feature) to generate Java bindings for ucp. Right now I am in the process of benchmarking my implementation using JMH. Sending messages using ucp_tag_send_nbx and ucp_stream_send_nbx as well as accessing remote memory using ucp_put_nbx and ucp_get_nbx is working and gives decent results. Continuously calling ucp_atomic_op_nbx on the other hand fills the local RAM in a matter of seconds and therefore crashes the process.

I used jemalloc to see if I can identify which function is responsible for the huge amounts of allocated memory and found that ucs_mpool_hugetlb_malloc could be it.

Here is the top of the report jemalloc generated (before the process was killed).

Total: 62596688095 B
60482653466  96.6%  96.6% 60482653466  96.6% ucs_mpool_hugetlb_malloc
1337982976   2.1%  98.8% 1337982976   2.1% uct_mem_alloc
601234012   1.0%  99.7% 601234012   1.0% AllocateHeap@3ad0d0 (inline)
71788578   0.1%  99.8% 71788578   0.1% MLX5_1.0
66201113   0.1%  99.9% 66201113   0.1% ChunkPool::allocate
 4552007   0.0%  99.9%  4683591   0.0% Arena::Arena (inline)
 4552007   0.0% 100.0% 550811466   0.9% ucp_worker_create
 3733020   0.0% 100.0%  3733020   0.0% AllocateHeap@3ad070
 3053808   0.0% 100.0%  3053808   0.0% _verbs_init_and_alloc_context
 2928874   0.0% 100.0% 74717453   0.1% mlx5dv_open_device
 2783848   0.0% 100.0%  2783848   0.0% init
 2525931   0.0% 100.0%  2525931   0.0% uct_rc_iface_fc_handler
 1938982   0.0% 100.0%  1938982   0.0% _GLOBAL__sub_I_eh_alloc.cc
 1678786   0.0% 100.0% 10888470   0.0% uct_rc_verbs_iface_post_recv_always
 1629517   0.0% 100.0%  1629517   0.0% inflatePrime
 1533780   0.0% 100.0%  1533780   0.0% ucs_twheel_init
 1498845   0.0% 100.0% 11188275   0.0% uct_ud_verbs_qp_max_send_sge
 1487317   0.0% 100.0%  1487317   0.0% Unsafe_AllocateMemory0
 1487317   0.0% 100.0%  1487317   0.0% numa_node_to_cpus
 1443905   0.0% 100.0%  1443905   0.0% allocate_dtv

If you are interested in the Java code, here are the relevant parts called during the Benchmark.

  1. LatencyBenchmark#atomicAdd
  2. AtomicIntegerContext#blockingAtomicAdd
  3. BenchmarkConnection#blockingAtomicAdd32
  4. Endpoint#atomic

All parameters passed to ucp_tag_send_nbx are allocated exactly once during the benchmark.

Steps to Reproduce

Perform the following steps in a loop

  1. Call ucp_atomic_op_nbx with UCP_ATOMIC_OP_ADD
  2. Wait for the request to finish by polling ucp_request_check_status
  3. Release the request with ucp_request_free

Operating System

Packages

Hardware

Additional information

yosefe commented 3 years ago

@krakowski is it possible there are many unexpected tagged messages which are not being received? it's possible to heck this by probing with tag_mask=0 : call it repeatedly drain the unexpected queue.

yosefe commented 3 years ago

Also, it is possible some send/recv requests are not being released? Is there any warning about leaked objects during ucp_worker_destroy()?

krakowski commented 3 years ago

Hi @yosefe,

Running the benchmark for a short amount of time works and no warnings about leaked resources are logged. Before running the benchmark I exchange some messages for setting up the environment and synchronization, but everything runs in a single thread, so these calls should not be reached a second time once setup (connection establishment and exchange of remote keys) is finished.

Does ucp_atomic_op_nbx also send tagged messages I have to poll or are there any other resources (except for the request, if it is one) I have to release after this function call?

yosefe commented 3 years ago

ucp_atomic_op_nbx does not send a tagged message.. can you double check that request_free is actually called the expected # of times? another way - build ucx with debug mode (./contrib/configure-devel) and upload log file with UCX_LOG_LEVEL=data. Then we can see the lifetime of all objects and why they are not destroyed.

krakowski commented 3 years ago

Just checked this. The pointers I get from ucp_atomic_op_nbx are all UCS_OK (null pointers). From what I read this means that the request finished immediately. Do I have to call ucp_request_free on them, too? At the moment I check if the returned value/pointer is within the range of a status and only call ucp_request_free on it if it isn't.

Edit

Just tried it. Calling ucp_request_free on them results in a segfault.

yosefe commented 3 years ago

I think the issue is that too many requests are being queued internally. Will try to provide a fix for it which will create some backpressure (so at some point, will return a request instead of UCS_OK)

krakowski commented 3 years ago

I can check this by waiting some amount of time between calls to ucp_atomic_op_nbx and see if the memory consumption reduces. Will report my results soon.

Do you still need the log output with debug mode? I could try to build the RPM packages in debug mode and run my benchmark with UCX_LOG_LEVEL=data.

yosefe commented 3 years ago

I can check this by waiting some amount of time between calls to ucp_atomic_op_nbx and see if the memory consumption reduces. Will report my results soon.

Do you still need the log output with debug mode? I could try to build the RPM packages in debug mode and run my benchmark with UCX_LOG_LEVEL=data.

No need for now. As another experiment, i'd suggest to try a blocking {ucp_ep_flush_nbx+poll_wait} every 1k iterations or so.

krakowski commented 3 years ago

@yosefe Your suggestion is working! :)

Calling ucp_ep_flush_nbx and waiting for it to complete after every 1000 invocations of ucp_atomic_op_nbx keeps the memory consumption steady. I also got better performance results (~0.95 us per atomic add) presumably by preventing new memory allocations.

Since I am already talking to someone knowing how ucx works I would like to ask another off-topic question, if this is ok. Is it normal that RDMA reads are a lot slower than RDMA writes? With a buffer size of 64 bytes I get ~0.2us for writes while reads take ~2.0us (until ucp_request_check_status reports UCS_OK).

shamisp commented 3 years ago

Writes completed locally the moment the user data is copied to the device or local bouncing buffer. For read to be completed (aka data delivered to the user) you have to wait for a full roundtrip.

krakowski commented 3 years ago

@shamisp Thanks for the explanation! :)

I already thought something along those lines. Is it then somehow possible to measure write latency at the initiators side without adding extra overhead? Since the receiver is not aware of the RDMA write, I can only think of polling the memory region for changes and sending a message back if a change is detected. Unfortunately, this would add some considerable amount of overhead to the measurement.

shamisp commented 3 years ago

@krakowski We already have such test implemented (for both uct/ucp layers) in ucx_perftest. UCT/UCP put latency test measures half round trip latency using ping-pong protocol, which is similar to what you describe.

krakowski commented 3 years ago

@shamisp Thanks, I will take a look at it and try to port it. I should have mentioned that I am using a custom Java binding for UCX which I want to benchmark. This is why I search for a way to reliably measure the latency :slightly_smiling_face:

shamisp commented 3 years ago

Adding @petro-rudenko who implemented the Java binding. Maybe he has some benchmarks to share.