ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
584 stars 386 forks source link

Memory Leak Detected in TCP Provider with unbound Event Queue after fi_shutdown #10545

Open piotrchmiel opened 1 week ago

piotrchmiel commented 1 week ago

Describe the bug A memory leak is detected by the Address Sanitizer when performing operations on an endpoint that has been shut down using fi_shutdown. The issue occurs specifically when using the TCP provider in RDM mode, and no event queue (ep->util_ep.eq) is bound to the domain.

To Reproduce Steps to reproduce the behavior:

  1. Use the TCP provider in RDM mode.
  2. Create an endpoint without binding an event queue to the domain (ep->util_ep.eq remains empty).
  3. Perform fi_shutdown on the endpoint.
  4. Perform any additional operations on the endpoint after fi_shutdown.

Expected behavior The memory allocated in xnet_ep_disable (specifically err_entry.err_data = mem_dup(err_data, err_data_size);) should be properly released, avoiding memory leaks.

Output The Address Sanitizer reports the following memory leak:

2024-11-14T12:56:28.7675977Z ==73885==ERROR: LeakSanitizer: detected memory leaks
2024-11-14T12:56:28.7676337Z 
2024-11-14T12:56:28.7676557Z Direct leak of 8 byte(s) in 1 object(s) allocated from:
2024-11-14T12:56:28.7677701Z     #0 0x55c2874a72c3 in malloc (/test/test+0x6d52c3) (BuildId: 10d8cef421d2609343e1feb371ea248a68039137)
2024-11-14T12:56:28.7679278Z     #1 0x7f2e491dea68 in mem_dup /test/third_party/libfabric/./include/ofi_mem.h:81:15
2024-11-14T12:56:28.7680923Z     #2 0x7f2e491de493 in xnet_ep_disable /test/third_party/libfabric/prov/tcp/src/xnet_ep.c:458:25
2024-11-14T12:56:28.7682293Z     #3 0x7f2e491d5819 in xnet_req_done /test/third_party/libfabric/prov/tcp/src/xnet_cm.c:209:2
2024-11-14T12:56:28.7683669Z     #4 0x7f2e491f30d5 in xnet_run_ep /test/third_party/libfabric/prov/tcp/src/xnet_progress.c:1468:3
2024-11-14T12:56:28.7685215Z     #5 0x7f2e491ee15a in xnet_handle_events /test/third_party/libfabric/prov/tcp/src/xnet_progress.c:1505:4
2024-11-14T12:56:28.7686681Z     #6 0x7f2e491edf8a in xnet_run_progress /test/third_party/libfabric/prov/tcp/src/xnet_progress.c:1562:3
2024-11-14T12:56:28.7688089Z     #7 0x7f2e491e96c6 in xnet_cq_progress /test/third_party/libfabric/prov/tcp/src/xnet_cq.c:84:2
2024-11-14T12:56:28.7689621Z     #8 0x7f2e49129be0 in ofi_cq_readfrom /test/third_party/libfabric/prov/util/src/util_cq.c:270:2
2024-11-14T12:56:28.7690989Z     #9 0x7f2e491e9d89 in xnet_cq_readfrom /test/third_party/libfabric/prov/tcp/src/xnet_cq.c:50:8
2024-11-14T12:56:28.7692541Z     #10 0x55c287ad9c7f in fi_cq_readfrom(fid_cq*, void*, unsigned long, unsigned long*) /test/third_party/libfabric/include/rdma/fi_eq.h:402:9

Environment: OS: Ubuntu 22.04 Provider: TCP Mode: RDM Libfabric 1.22.0

Additional context The memory leak originates from the function xnet_ep_disable at the line: err_entry.err_data = mem_dup(err_data, err_data_size); The issue only occurs when no event queue is bound to the domain (ep->util_ep.eq is empty) and operations are performed on the endpoint after it has been shut down using fi_shutdown.

sydidelot commented 6 days ago

@piotrchmiel The memory that leaks corresponds to a FI_SHUTDOWN event added to the Event Queue after the endpoint shuts down. I'm not familiar with RDM but I guess there is a bug where err_entry.err_data is not freed after the EQ event is consumed by RDM.