lballabio / QuantLib

The QuantLib C++ library
http://quantlib.org
Other
5.23k stars 1.78k forks source link

Indirect leaks with AddressSanitizer #1251

Closed sweemer closed 2 years ago

sweemer commented 2 years ago

Out of curiosity I added -fsanitize=address -fno-omit-frame-pointer to linux-full-tests.yml and got a bunch of indirect leaks when running the unit tests.

I couldn't find any mention of using AddressSanitizer with QuantLib since 2013. Has anyone tried using it more recently than that?

As far as I can tell, these indirect leaks are probably caused by circular references between ext::shared_ptr instances, but there might be other causes as well. False alarms with AddressSanitizer should be quite rare, so I believe this is worth looking into further.

Also I think that it would be a good idea to run the unit tests with AddressSanitizer enabled - if not in the main tests then at least in a new github action that can be run manually.

Side question, but what is the purpose of compiling the unit tests with -g0? There shouldn't be any overhead of running the unit tests with -g and indeed it is beneficial to have at least some debug info available when using AddressSanitizer to get helpful stack traces, even if the code is compiled with -O2 or higher.

lballabio commented 2 years ago

Hi, the -g0 is to reduce the size of the compiled files and have more of them fit into the space allocated to ccache. We're trading that info for compilation speed.

As for AddressSanitizer, no, I don't think anybody tried that so far. Some of them might be circular references, as you say. I'm also seeing some in methods like QuantLib::Thirty360::implementation, where I can't really see what's wrong. May you try a build with -g and see if you get more information?

sweemer commented 2 years ago

I rebuilt quantlib-test-suite with -g instead of -g0 and got basically the same output. But I would have been surprised if it made a difference in the result of the tests besides better stack traces.

At first glance I don't see any immediate issue with Thirty360 or the other indirect leak sources. Maybe I will get some time later this week to step through the test cases and find the circular references or whatever else is causing it.

lballabio commented 2 years ago

Yes, better stack traces were all I was after 😄 . Thanks for the effort!

pcaspers commented 2 years ago

I checked the leak reported from

#15 0x1f7b883 in MarkovFunctionalTest::testBermudanSwaption() (/__w/QuantLib/QuantLib/test-suite/.libs/quantlib-test-suite+0x1f7b883)

by looking at the graph of observers / observables and see whether there is a cycle in this graph. There doesn't seem to be such a cycle though, so the leak must have a different origin.

pcaspers commented 2 years ago

Next I checked running the test case with valgrind:

peter@peter:/media/sf_peter.caspers/openxva/build.linux/QuantLib/test-suite$ valgrind ./quantlib-test-suite --run_test='*/*/QuantLib__detail__quantlib_test_case(&MarkovFunctionalTest__testBermudanSwaption)*'
==18351== Memcheck, a memory error detector
==18351== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==18351== Using Valgrind-3.17.0 and LibVEX; rerun with -h for copyright info
==18351== Command: ./quantlib-test-suite --run_test=*/*/QuantLib__detail__quantlib_test_case(&MarkovFunctionalTest__testBermudanSwaption)*
==18351== 
Running 1 test case...

*** No errors detected
==18351== 
==18351== HEAP SUMMARY:
==18351==     in use at exit: 0 bytes in 0 blocks
==18351==   total heap usage: 1,722,699 allocs, 1,722,699 frees, 132,384,918 bytes allocated
==18351== 
==18351== All heap blocks were freed -- no leaks are possible
==18351== 
==18351== For lists of detected and suppressed errors, rerun with: -s
==18351== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

I'll try to run the whole test suite under valgrind.

sweemer commented 2 years ago

Thanks for helping to have a look at this @pcaspers. I also ran testBermudanSwaption with valgrind and confirmed it showed no leaks for me either.

However if I run InflationZCIISInterpolationTest_asIndexInterpolated with valgrind I now see a direct leak - as opposed to merely an indirect leak with AddressSanitizer.

sweemer@tesla-t4-1:~/QuantLib$ valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --verbose ./build/linux-gcc-unit-test-valgrind/test-suite/quantlib-test-suite --run_test='*/*/QuantLib__detail__quantlib_test_case(&InflationZCIISInterpolationTest__asIndexInterpolated)*'
[...]
==34771== 85,872 (64 direct, 85,808 indirect) bytes in 1 blocks are definitely lost in loss record 126 of 126
==34771==    at 0x4C3217F: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==34771==    by 0x6344D98: QuantLib::ObservableValue<QuantLib::Date>::ObservableValue(QuantLib::Date const&) (observablevalue.hpp:71)
==34771==    by 0x63449C0: QuantLib::Settings::DateProxy::DateProxy() (settings.cpp:27)
==34771==    by 0x6344A4B: QuantLib::Settings::Settings() (settings.cpp:34)
==34771==    by 0x63172C: QuantLib::Singleton<QuantLib::Settings, std::integral_constant<bool, false> >::instance() (singleton.hpp:176)
==34771==    by 0x11519D3: (anonymous namespace)::configure(QuantLib::Date) (quantlibtestsuite.cpp:251)
==34771==    by 0x1151D79: init_unit_test_suite(int, char**) (quantlibtestsuite.cpp:316)
==34771==    by 0x1377C31: init_function() (main.cpp:12)
==34771==    by 0x7BC1E48: boost::unit_test::framework::impl::invoke_init_func(bool (*)()) (in /usr/local/lib/libboost_unit_test_framework.so.1.75.0)
==34771==    by 0x7BC142D: boost::detail::function::function_obj_invoker0<boost::detail::forward, int>::invoke(boost::detail::function::function_buffer&) (in /usr/local/lib/libboost_unit_test_framework.so.1.75.0)
==34771==    by 0x7BBFC84: boost::execution_monitor::catch_signals(boost::function<int ()> const&) (in /usr/local/lib/libboost_unit_test_framework.so.1.75.0)
==34771==    by 0x7BBFD03: boost::execution_monitor::execute(boost::function<int ()> const&) (in /usr/local/lib/libboost_unit_test_framework.so.1.75.0)
==34771==
==34771== LEAK SUMMARY:
==34771==    definitely lost: 64 bytes in 1 blocks
==34771==    indirectly lost: 85,808 bytes in 1,116 blocks
==34771==      possibly lost: 0 bytes in 0 blocks
==34771==    still reachable: 463 bytes in 7 blocks
==34771==         suppressed: 0 bytes in 0 blocks
==34771==
==34771== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
==34771== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

My best guess is that it comes from the static initialization of QuantLib::Singleton not playing nicely with Boost Test. I'm still digging into it.

pcaspers commented 2 years ago

I ran the whole test suite under valgrind yielding "no leaks possible". This is our version of ql though (based off 1.22). I'll do the same for the current head in master now.

sweemer commented 2 years ago

@pcaspers Which version of Boost are you using? While I was looking into this, I noticed that Boost Test has gone through a lot of changes and I suspect that the usage of the obsolete initialization function leaks in some versions but not in others.

@lballabio On a related note, I noticed that the minimum Boost version is 1.58.0 in CMakeLists.txt but 1.48.0 in acinclude.m4. Is this intentional, or should the minimum version be bumped up to 1.58.0 everywhere?

Also note that Boost Test v3 was released in 1.59.0 so we might also want to consider bumping the minimum version to 1.59.0 to require the "modern" test features.

pcaspers commented 2 years ago

I was using 1.74.0. However, on the full run vs master I did get a direct leak reported. I'll rerun with leak details turned on to see where it is caused.

pcaspers commented 2 years ago

here are the details of the run with --leak-check=full (as I said boost is 1.74.0)

==33988== 
==33988== HEAP SUMMARY:
==33988==     in use at exit: 493,695 bytes in 5,580 blocks
==33988==   total heap usage: 1,528,770,816 allocs, 1,528,765,236 frees, 351,327,097,975 bytes allocated
==33988== 
==33988== 493,232 (24 direct, 493,208 indirect) bytes in 1 blocks are definitely lost in loss record 608 of 608
==33988==    at 0x4843FB3: operator new(unsigned long) (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==33988==    by 0x520E5D: QuantLib::Observer::registerWith(boost::shared_ptr<QuantLib::Observable> const&) (in /home/peter/QuantLib/build.linux/test-suite/quantlib-test-suite)
==33988==    by 0x53E0C85: QuantLib::ZeroInflationIndex::ZeroInflationIndex(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, QuantLib::Region const&, bool, bool, QuantLib::Frequency, QuantLib::Period const&, QuantLib::Currency const&, QuantLib::Handle<QuantLib::ZeroInflationTermStructure>) (in /home/peter/QuantLib/build.linux/ql/libQuantLib.so.1.25.0)
==33988==    by 0x53E0F46: QuantLib::ZeroInflationIndex::clone(QuantLib::Handle<QuantLib::ZeroInflationTermStructure> const&) const (in /home/peter/QuantLib/build.linux/ql/libQuantLib.so.1.25.0)
==33988==    by 0x5988730: QuantLib::ZeroCouponInflationSwapHelper::setTermStructure(QuantLib::ZeroInflationTermStructure*) (in /home/peter/QuantLib/build.linux/ql/libQuantLib.so.1.25.0)
==33988==    by 0xB7908A: QuantLib::IterativeBootstrap<QuantLib::PiecewiseZeroInflationCurve<QuantLib::Linear, QuantLib::IterativeBootstrap, QuantLib::ZeroInflationTraits> >::calculate() const (in /home/peter/QuantLib/build.linux/test-suite/quantlib-test-suite)
==33988==    by 0x511A22: QuantLib::LazyObject::calculate() const (in /home/peter/QuantLib/build.linux/test-suite/quantlib-test-suite)
==33988==    by 0xBDF665: ZCIIS::makeZeroInflationCurve(ZCIIS::Setup const&, boost::shared_ptr<QuantLib::ZeroInflationIndex> const&, std::vector<boost::shared_ptr<QuantLib::BootstrapHelper<QuantLib::ZeroInflationTermStructure> >, std::allocator<boost::shared_ptr<QuantLib::BootstrapHelper<QuantLib::ZeroInflationTermStructure> > > > const&) (in /home/peter/QuantLib/build.linux/test-suite/quantlib-test-suite)
==33988==    by 0xBDF940: ZCIIS::makeResult(ZCIIS::Setup const&) (in /home/peter/QuantLib/build.linux/test-suite/quantlib-test-suite)
==33988==    by 0xBE02F9: ZCIIS::runTest(ZCIIS::Setup const&) (in /home/peter/QuantLib/build.linux/test-suite/quantlib-test-suite)
==33988==    by 0xBE050A: InflationZCIISInterpolationTest::asIndexInterpolated() (in /home/peter/QuantLib/build.linux/test-suite/quantlib-test-suite)
==33988==    by 0x51C391: QuantLib::detail::quantlib_test_case::operator()() const (in /home/peter/QuantLib/build.linux/test-suite/quantlib-test-suite)
==33988== 
==33988== LEAK SUMMARY:
==33988==    definitely lost: 24 bytes in 1 blocks
==33988==    indirectly lost: 493,208 bytes in 5,572 blocks
==33988==      possibly lost: 0 bytes in 0 blocks
==33988==    still reachable: 463 bytes in 7 blocks
==33988==         suppressed: 0 bytes in 0 blocks
==33988== Reachable blocks (those to which a pointer was found) are not shown.
==33988== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==33988== 
==33988== For lists of detected and suppressed errors, rerun with: -s
==33988== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
peter@peter:~/QuantLib/build.linux/test-suite$ 
sweemer commented 2 years ago

I finally tracked down the root cause of the leak, and it was indeed a circular reference in inflationzciisinterpolation.cpp. I've submitted the following one-line pull request #1265 to fix the issue.

hz.linkTo(ext::shared_ptr<ZeroInflationTermStructure>());

I don't think I ever would have figured this out if I didn't come across the comment in other inflation related tests such as inflation.cpp. Even after figuring it out, I have more questions than answers:

Adding @ralfkonrad as he appears to be the original author of inflationzciisinterpolation.cpp.

ralfkonrad commented 2 years ago

Thanks Jonathan for letting me know. I was expecting something like that as I regocnized my code in Peter's last sanitizer results already.

Let's answer your last question first:

  • Why wasn't this line included in inflationzciisinterpolation.cpp originally? Was it simply missed when copy-pasting similar code from other inflation tests?

Yes, it was simply missed.

Regarding the other ones: I cannot really answer them. I was working on #1050 which looks simple at first but became more difficult than I was expecting (see e.g. my last comment https://github.com/lballabio/QuantLib/pull/1051#issuecomment-933353191 and #1068).

When I deprecated bool ZeroInflationIndex::interpolated() const; I had to touch all zero inflation "stuff": the coupons, the zero coupon inflation swap, the zero inflation curve. And as far as I can remember I did not find a real loose end to start with as all depends on each other or rather on the value of ZeroInflationIndex::interpolated(). Which in the end means nothing else what you have found out: there is a circular reference in the inflation code which is "solved" by the empty Handle<ZeroInflationTermStructure>.

I do not think it is an established "pattern" for inflation code. It is somehow build in and I guess Luigi's issue #1050 was a first step to solve these problems.

sweemer commented 2 years ago

Thanks for the info @ralfkonrad. I wasn't following #1050 but seems like an important (and thorny) problem to solve indeed.

@lballabio Are you fine to merge #1265 to fix the leak in the test and keep AddressSanitizer running on linux-full-tests?

lballabio commented 2 years ago

I'm fine to merge the fix — I'm not sure where to run the AddressSanitizer. linux-full-tests doesn't run automatically for all PR. Is there a performance hit in compile or test time? If so, we might have a specific job for it and run it with the latest gcc and/or clang instead of adding it to all runs.

sweemer commented 2 years ago

Right, I'm aware that linux-full-tests isn't run automatically, which is why I think it might be a good place to stick the AddressSanitizer - you don't pay the price for it except when you want to test absolutely everything; and when you do, you want it to provide that extra level of validation. We could also create a new CMake preset for AddressSanitizer along with a new workflow for it if you prefer.

According to the AddressSanitizer wiki, the average runtime slowdown is about 2x. I haven't observed any significant impact on compilation times.

I just reran linux-full-tests on my branch and see a lot more indirect leaks, so I'll have to look into that again. But I think that merging the fix for the inflation test in my PR is a good start.