ornladios / ADIOS2

Next generation of ADIOS developed in the Exascale Computing Program
https://adios2.readthedocs.io/en/latest/index.html
Apache License 2.0
268 stars 125 forks source link

SSTReader/Writer: Memory leaks #2560

Open ax3l opened 3 years ago

ax3l commented 3 years ago

Describe the bug

Using the SSTReader/Writer, we (cc @franzpoeschel) see memory leaks in openPMD-api in:

leak:adios2::core::engine::SstReader::*
leak:adios2::core::engine::SstWriter::* 

To Reproduce See: https://github.com/openPMD/openPMD-api/pull/570

Expected behavior SST should be covered by the UBSan/ASan checks in ADIOS2 CI :)

Desktop (please complete the following information):

Additional context See: https://github.com/openPMD/openPMD-api/pull/570

eisenhauer commented 3 years ago

SstReader/Writer are covered by UBSan/Asan in ADIOS2 CI. Maybe you're using it in a way our tests don't exercise? Can you point me at ASan output somewhere?

franzpoeschel commented 3 years ago

I've observed these issues here.

eisenhauer commented 3 years ago

Hmm. Stuff under CMListen() is out of EVPath, a GaTech library that we use for network support. (I'm the maintainer there, so the only issue with that is that it increases the awkwardness of trying fixes, they have to be integrated with the external library, then pulled into ADIOS, then tried in your situation.)

Here's what I know about the memory usage under load_transports(): In order to optimize repeated usage, it maintains a cache of data with a pointer in static memory and which is designed to persist for the duration of the process. Originally that cache was designed to be free'd via an onexit/atexit() handler, but we found that this handler was responsible for unpredictable segmentation faults during otherwise-normal process exits. Since the memory in question was not really an unintended leak, but just in-use-at-exit, the atexit() routine was disabled as a simple solution. This situation does not, apparently, result in sanitizer errors in ADIOS CI.

We could try re-enabling the atexit() handler, but given past faults I'm not sure cleaning up a few bytes right before exiting is worth the risk of failures in normal execution. So I'd be tempted to just add no_sanitize("address") tags where appropriate to suppress the errors on our end so that you don't have to. But that's going to be a pain to get right because of the multi-step process involved. Given that we're doing ASAN, and the in-use-at-exit memory you're seeing happens in all of our tests to, I'm curious why you're getting complaints and we're not. (If we were getting them too, I could at least sort things out more easily.). Maybe @chuckatkins, can answer?