Open eturck opened 4 years ago
Hi @eturck,
Sorry to hear about all the crashes, it's quite embarrassing to be honest.
By having a look at the call stack you provided, it reminds me of some memory corruption problem we've had in the past, working on some optimisations in georef. We were messing up the data but could only see the problem when the data was used during serialization.
The root cause and the actual crash were completely unrelated...
This brings me to a few questions:
Funny enough, when we last saw that type of crash on our platform, we had to bisect on our production servers to root cause the issue.... Not a glorious time at all :(
Sorry I couldn't help you much, but I hope this could give you a few leads. Olivier.
I've asked the team to retrieve the version that had the issue I mentioned... Hopefully this might be 2.87 🤞
We have had a random crash on our prod caused by a non initialized int in A*. here is the PR which fixed it: https://github.com/CanalTP/navitia/pull/2846/
Note that, this PR has been merged into v2.82.
As @nonifier has suggested, if you're short of ideas about what's happening in the code, you may have to bisect to find which the very PR caused the crash. Let's us know
Sorry for the delay I was away the last few days.
@nonifier :
@xlqian since this was merged in v2.82 we have this fix so it is something else :(.
I guess we are going to need to resort to bisect. Problems are :
Our main idea is trying to try to launch gatling or something like this and updating disruptions at the same time to see if we can make this happen semi-reliably to make a git bisect worthwhile.
I'd suggest plug a local memory-sanitized version of navitia on your prod disruption-queue, and request it a bit on the side... The problem would be the noise in the sanitize-log (probably getting unrelated warnings and maybe errors) but that's the lead I would try.
I'd suggest plug a local memory-sanitized version of navitia on your prod disruption-queue, and request it a bit on the side... The problem would be the noise in the sanitize-log (probably getting unrelated warnings and maybe errors) but that's the lead I would try.
Thanks @pbougue for the suggestion. I don't know much about memory sanitizers but I'll try it out and see if I can find something.
Welcome, that's a grouped effort here :) Maybe have a look at https://github.com/CanalTP/navitia/search?q=sanitize&type=Issues you could find some leads to ease the setup.
Well, I finally dived into Sanitizers, but I'm not sure of if I found something or not... I started by building a local navitia on the same version (2.87) plugged-in on our test Chaos server. I mainly tried to have journey requests at the time of the data cloning.
The potentially good news is that I managed to have 3 separate crashes while running Kraken compiled with AdressSanitizer during cloning of the data. The sad part is I managed to do it only twice in the span of half an hour at the start, but not once in the five or six hours after that trying... This morning I manage to do it a third time (also at the start of the day), not once since either.
I don't know if I did something wrong, but there is basically no informations whatsoever though... Here is the gist of it: https://gist.github.com/eturck/2ee7c7107f9eee98c362f7700f5ef5ca I fear that I did something wrong since @pbougue warned me about potentially lots of noise. The first stracktrace doesn't really match what we had in production even if it ends in libboost_serialization, the second one seems to match a bit more, the third crash was the same as the second one. However the end result is different, stacktrace 1 is a segfault and stacktrace 2 an std::invalid_argument. In production we always have sigabrt for std::bad_alloc.
I also ran ThreadSanitizer since number of requests seemed to play a role in our production crash. I have a lot of data races even between to standard threads so I don't know what is a false positive or not. But I saved the result for the data race between a journey request and the thread when data are copied : https://gist.github.com/eturck/ee0dfcfba885f3d3f6cf710981fba3b4 There is a few "SUMMARY: ThreadSanitizer: data race (/srv/navitia/kraken/kraken+0xcd7c20) in memcpy"
WARNING: ThreadSanitizer: data race (pid=18493)
Read of size 8 at 0x7dc40003c000 by thread T3 (mutexes: write M44502):
#0 memcpy <null> (kraken+0x000000cd7c20)
#1 threadbuf::sync() <null> (kraken+0x0000013549a8)
#2 boost::archive::basic_binary_iprimitive<boost::archive::binary_iarchive, char, std::char_traits<char> >::basic_binary_iprimitive(std::basic_streambuf<char, std::char_traits<char> >&, bool) <null> (kraken+0x000000fa5c95)
#3 boost::archive::binary_iarchive_impl<boost::archive::binary_iarchive, char, std::char_traits<char> >::binary_iarchive_impl(std::istream&, unsigned int) <null> (kraken+0x000000fa58fb)
#4 navitia::type::Data::clone_from(navitia::type::Data const&) <null> (kraken+0x000000f9d3ce)
#5 DataManager<navitia::type::Data>::get_data_clone() <null> (kraken+0x000000dadcda)
#6 navitia::MaintenanceWorker::handle_rt_in_batch(std::vector<boost::shared_ptr<AmqpClient::Envelope>, std::allocator<boost::shared_ptr<AmqpClient::Envelope> > > const&) <null> (kraken+0x000000da7dd3)
#7 navitia::MaintenanceWorker::listen_rabbitmq() <null> (kraken+0x000000da9a2b)
#8 navitia::MaintenanceWorker::operator()() <null> (kraken+0x000000da921b)
#9 boost::detail::thread_data<navitia::MaintenanceWorker>::run() <null> (kraken+0x000000d8b6ec)
#10 boost::detail::make_ready_at_thread_exit(boost::shared_ptr<boost::detail::shared_state_base>) <null> (libboost_thread.so.1.62.0+0x000000012115)
Previous write of size 8 at 0x7dc40003c000 by thread T6:
#0 memcpy <null> (kraken+0x000000cd7c20)
#1 std::basic_streambuf<char, std::char_traits<char> >::xsputn(char const*, long) <null> (libstdc++.so.6+0x00000011c52a)
#2 boost::archive::basic_binary_oarchive<boost::archive::binary_oarchive>::init() <null> (kraken+0x000001014aee)
#3 boost::archive::binary_oarchive_impl<boost::archive::binary_oarchive, char, std::char_traits<char> >::binary_oarchive_impl(std::ostream&, unsigned int) <null> (kraken+0x0000010145e8)
#4 std::thread::_State_impl<std::_Bind_simple<navitia::type::Data::clone_from(navitia::type::Data const&)::$_3 ()> >::_M_run() /home/tisseo/navitia/source/type/data.cpp (kraken+0x000000f9db66)
#5 std::error_code::default_error_condition() const <null> (libstdc++.so.6+0x0000000b9e6e)
I guess it could be related? But I fear we are way (way) passed my knowledges of multithreaded applications... I'm more than ready to dive a bit more into this but I need some guidance. Do you feel there is something that needs to be investigated there or not?
I'll be on Element in the afternoon to talk about all of this.
Thank you for your help.
Hi @eturck !
Thanks for you detailed investigation !
Some thoughts about this :
As @kinnou02 mentionned on Element, this may be related to the data clone trick.
Indeed, Data::clone_from
appears in the stack traces you obtained.
The data clone trick seems to be adapted from a stackoverflow answer.
Some differences between the stackoverflow answer and navitia code :
sbuf.close();
between the end of the write
and the end of the read
.sbuf.close();
has disappeared.Reading the Navitia code, I don't really understand why it works :man (or I read it wrongly). For me :
So for me, the main thread may run the reading operation to completion before the write thread has finished writing, which I think could result in the data races found by ThreadSanitizer.
Maybe you can try to modify navitia to bring it more in-line with the stackoverflow answer :
read
thread instead of the actual write
onesbuf.close();
after the write operationjoin
on the read threadOr even completely reproduce the stackoverflow answer with 3 threads.
Thanks for the feedback, I really appreciate the time you all took to take a look at this.
I can't say I really understand the code either. I had the same interrogations on the order of execution. I saw that @TeXitoi was the one responsible for this https://github.com/CanalTP/navitia/pull/632/files#diff-930d109fe1ca9c60e82996bda4dd2f8aR538 :), If he ever stumble upon this message and want to shed some light on those lines...
I was really hopeful with this data race but I unfortunately wasn't able to reproduce it a second time... I had a few others more or less related to data cloning, sometimes there was tens of data race on some totally unrelated parts too. Lots of noise but not once again this one.
I spend three days trying a few things to have a way to at least semi reliably reproduce this crash or data race. Even if it happened once every hour it would be frequent enough to maybe draw a conclusion on a fix, but right now there is just nothing and I really have no clear path to investigate more.
My current plan is:
To check if the data race comes from the this data clone trick, maybe it is possible to assign a low priority on the write thread, so that this write thread will be scheduled after the read in the main ?
Hey, @pascal-vs suggested an easy way to test whether the data clone trick is responsible for the crashes : add a sleep() in the write thread before the serializing to the archive. This should ensure that the read occurs before the write, and thus trigger some segfault.
Another update, probably the last one. But mostly positive ;). Well we finally had the time to try running navitia with boost 1.55 instead of boost 1.62 in production.
We started by re-enabling disruption on 1.62, there was still some crashes even with the hardware changes. We installed boost 1.55 beside the current 1.62, a kraken build with, build new nav files with ed2nav build in boost 1.55 and enabled disruptions march 18th. Since then we've loaded close to a thousand disruptions and... no crash whatsoever. It's impossible to say if it's 100% fixed, but we were crashing pretty quickly before (a few dozens disruptions before a crash at best). It will probably stay that way in production and we will start to take disruption into account in our journeys soon.
We tried looking for something in boost issue tracker or mailing list with @pbench but nothing was resembling this crash, I really don't know what could cause it. The next step would be probably to test this with a recent boost version, it will not be very practical to stay with boost 1.55 forever.
If you want to close this issue feel free to do it. There will probably be no changes on our side for quite a while on this.
Thank you everyone for your ideas and your help!
Hello everyone.
We have been using disruptions from chaos in production for a while. Our current install of navitia consists of, on the same server :
Until this week disruptions were only plugged on kraken-iv. We sometimes had a crash but it was not frequent (once per month). We talked about it with @kinnou02 once or twice but never found a solution. Yesterday, we plugged disruptions on kraken-prod and it quickly went wrong. We had one crash yesterday and 5 more between 7:17am and 8;50am today (it's peak hour for us if it has any impacts). However, the same exact disruption has been applied to kraken-iv at the exact same time, on the exact same server, and no crash on kraken-iv, probably excluding a problem on the disruption itself.
If you want the full stacktrace and syslog output : https://gist.github.com/eturck/afd981ed550432f53aaf76a8b9eb0df5
What I see :
tcmalloc: large alloc 2314885530818461696
. I don't know how we are trying to allocate that much RAM really :). The only thing I saw about this value is that in hex it's 0x2020202020204000. It's always this exact same value when kraken crashes. I don't know if we are really trying to allocate this much RAM or it's just a display glitch.What I know :
I am not super hopeful that you can help me on this issue. But let me know if you have any ideas or something similar on kraken or another project already happened to you and you have some inputs to share.
Thank you in advance.