Crash during navitia::type::Data::clone_from when applying disruption

eturck commented 4 years ago

Hello everyone.

We have been using disruptions from chaos in production for a while. Our current install of navitia consists of, on the same server :

One kraken named kraken-prod, used for every schedules and journey requests for end users. It has 16 threads and has the most traffic (tens of requests per second).
One kraken named kraken-iv. It's only use is for traffic_report API and a few other calls. It has 4 threads and is less used, at most a few requests per second.

Until this week disruptions were only plugged on kraken-iv. We sometimes had a crash but it was not frequent (once per month). We talked about it with @kinnou02 once or twice but never found a solution. Yesterday, we plugged disruptions on kraken-prod and it quickly went wrong. We had one crash yesterday and 5 more between 7:17am and 8;50am today (it's peak hour for us if it has any impacts). However, the same exact disruption has been applied to kraken-iv at the exact same time, on the exact same server, and no crash on kraken-iv, probably excluding a problem on the disruption itself.

If you want the full stacktrace and syslog output : https://gist.github.com/eturck/afd981ed550432f53aaf76a8b9eb0df5

What I see :

Crash in syslog talk about : tcmalloc: large alloc 2314885530818461696. I don't know how we are trying to allocate that much RAM really :). The only thing I saw about this value is that in hex it's 0x2020202020204000. It's always this exact same value when kraken crashes. I don't know if we are really trying to allocate this much RAM or it's just a display glitch.
According to the stacktrace it seems to happen during ValidityPattern (de)serialization.

What I know :

I am unable to reproduce this on an integration server or a local virtual machine. I think it's maybe related to the current load of kraken? I will try to test it soon by launching a lot of requests and applying disruption at the same time.
Both kraken are in the exact same version, actually using the same binary. Kraken-iv has slightly more datas (kraken-iv has school lines and a few shuttles added compare to kraken-prod).
We already had this crash in v2.67 and now in v2.87. We are on Debian 9.6 using boost 1.62.
In the past when talking with @kinnou02 he told me he couldn't find any trace of this crash on your instances loading disruptions.
The server has 16 vCPU 16GB of RAM. kraken-prod uses 1.2GB of RAM. We had more than 4GB of free RAM during crashes.

I am not super hopeful that you can help me on this issue. But let me know if you have any ideas or something similar on kraken or another project already happened to you and you have some inputs to share.

Thank you in advance.

nonifier commented 4 years ago

Hi @eturck,

Sorry to hear about all the crashes, it's quite embarrassing to be honest.

By having a look at the call stack you provided, it reminds me of some memory corruption problem we've had in the past, working on some optimisations in georef. We were messing up the data but could only see the problem when the data was used during serialization.

The root cause and the actual crash were completely unrelated...

This brings me to a few questions:

Are you using the exact version of Navitia without any modification ?
Have you tried to run your kraken with any of the Address/Memory Sanatizer - https://github.com/google/sanitizers ? (we should really take some time to integrate those tools to our ci though...)
Do you have the same crashing rate other versions ?

Funny enough, when we last saw that type of crash on our platform, we had to bisect on our production servers to root cause the issue.... Not a glorious time at all :(

Sorry I couldn't help you much, but I hope this could give you a few leads. Olivier.

nonifier commented 4 years ago

I've asked the team to retrieve the version that had the issue I mentioned... Hopefully this might be 2.87 🤞

xlqian commented 4 years ago

We have had a random crash on our prod caused by a non initialized int in A*. here is the PR which fixed it: https://github.com/CanalTP/navitia/pull/2846/

Note that, this PR has been merged into v2.82.

As @nonifier has suggested, if you're short of ideas about what's happening in the code, you may have to bisect to find which the very PR caused the crash. Let's us know

eturck commented 4 years ago

Sorry for the delay I was away the last few days.

@nonifier :

We are using release v2.87 https://github.com/CanalTP/navitia/releases/tag/v2.87.0 with this tiny patch from v3.2.0 : #3039. Before that we were using release v2.69 without any patches. However we have custom modules and a custom scenario in jormungandr (nothing too crazy in term of kraken requests I think).
We did not try many things since we really had no idea what to do :).
We already had crashes on v2.69 with approximately the same rate on the instance not crashing much. No idea for the other one since we just activated them. On our test instance we never had any crashes...

@xlqian since this was merged in v2.82 we have this fix so it is something else :(.

I guess we are going to need to resort to bisect. Problems are :

we currently don't know how to reproduce this crash... And since we only have one kraken in production it's not a possibility to play with this one.
the only version that was OK was the v.2.34 we had before but it was running on Debian 7 with boost 1.55 so there is another layer of complexity to identify the problem.

Our main idea is trying to try to launch gatling or something like this and updating disruptions at the same time to see if we can make this happen semi-reliably to make a git bisect worthwhile.

pbougue commented 4 years ago

I'd suggest plug a local memory-sanitized version of navitia on your prod disruption-queue, and request it a bit on the side... The problem would be the noise in the sanitize-log (probably getting unrelated warnings and maybe errors) but that's the lead I would try.

eturck commented 4 years ago

I'd suggest plug a local memory-sanitized version of navitia on your prod disruption-queue, and request it a bit on the side... The problem would be the noise in the sanitize-log (probably getting unrelated warnings and maybe errors) but that's the lead I would try.

Thanks @pbougue for the suggestion. I don't know much about memory sanitizers but I'll try it out and see if I can find something.

pbougue commented 4 years ago

Welcome, that's a grouped effort here :) Maybe have a look at https://github.com/CanalTP/navitia/search?q=sanitize&type=Issues you could find some leads to ease the setup.

eturck commented 4 years ago

Well, I finally dived into Sanitizers, but I'm not sure of if I found something or not... I started by building a local navitia on the same version (2.87) plugged-in on our test Chaos server. I mainly tried to have journey requests at the time of the data cloning.

The potentially good news is that I managed to have 3 separate crashes while running Kraken compiled with AdressSanitizer during cloning of the data. The sad part is I managed to do it only twice in the span of half an hour at the start, but not once in the five or six hours after that trying... This morning I manage to do it a third time (also at the start of the day), not once since either.

I don't know if I did something wrong, but there is basically no informations whatsoever though... Here is the gist of it: https://gist.github.com/eturck/2ee7c7107f9eee98c362f7700f5ef5ca I fear that I did something wrong since @pbougue warned me about potentially lots of noise. The first stracktrace doesn't really match what we had in production even if it ends in libboost_serialization, the second one seems to match a bit more, the third crash was the same as the second one. However the end result is different, stacktrace 1 is a segfault and stacktrace 2 an std::invalid_argument. In production we always have sigabrt for std::bad_alloc.

I also ran ThreadSanitizer since number of requests seemed to play a role in our production crash. I have a lot of data races even between to standard threads so I don't know what is a false positive or not. But I saved the result for the data race between a journey request and the thread when data are copied : https://gist.github.com/eturck/ee0dfcfba885f3d3f6cf710981fba3b4 There is a few "SUMMARY: ThreadSanitizer: data race (/srv/navitia/kraken/kraken+0xcd7c20) in memcpy"

WARNING: ThreadSanitizer: data race (pid=18493)
  Read of size 8 at 0x7dc40003c000 by thread T3 (mutexes: write M44502):
    #0 memcpy <null> (kraken+0x000000cd7c20)
    #1 threadbuf::sync() <null> (kraken+0x0000013549a8)
    #2 boost::archive::basic_binary_iprimitive<boost::archive::binary_iarchive, char, std::char_traits<char> >::basic_binary_iprimitive(std::basic_streambuf<char, std::char_traits<char> >&, bool) <null> (kraken+0x000000fa5c95)
    #3 boost::archive::binary_iarchive_impl<boost::archive::binary_iarchive, char, std::char_traits<char> >::binary_iarchive_impl(std::istream&, unsigned int) <null> (kraken+0x000000fa58fb)
    #4 navitia::type::Data::clone_from(navitia::type::Data const&) <null> (kraken+0x000000f9d3ce)
    #5 DataManager<navitia::type::Data>::get_data_clone() <null> (kraken+0x000000dadcda)
    #6 navitia::MaintenanceWorker::handle_rt_in_batch(std::vector<boost::shared_ptr<AmqpClient::Envelope>, std::allocator<boost::shared_ptr<AmqpClient::Envelope> > > const&) <null> (kraken+0x000000da7dd3)
    #7 navitia::MaintenanceWorker::listen_rabbitmq() <null> (kraken+0x000000da9a2b)
    #8 navitia::MaintenanceWorker::operator()() <null> (kraken+0x000000da921b)
    #9 boost::detail::thread_data<navitia::MaintenanceWorker>::run() <null> (kraken+0x000000d8b6ec)
    #10 boost::detail::make_ready_at_thread_exit(boost::shared_ptr<boost::detail::shared_state_base>) <null> (libboost_thread.so.1.62.0+0x000000012115)

  Previous write of size 8 at 0x7dc40003c000 by thread T6:
    #0 memcpy <null> (kraken+0x000000cd7c20)
    #1 std::basic_streambuf<char, std::char_traits<char> >::xsputn(char const*, long) <null> (libstdc++.so.6+0x00000011c52a)
    #2 boost::archive::basic_binary_oarchive<boost::archive::binary_oarchive>::init() <null> (kraken+0x000001014aee)
    #3 boost::archive::binary_oarchive_impl<boost::archive::binary_oarchive, char, std::char_traits<char> >::binary_oarchive_impl(std::ostream&, unsigned int) <null> (kraken+0x0000010145e8)
    #4 std::thread::_State_impl<std::_Bind_simple<navitia::type::Data::clone_from(navitia::type::Data const&)::$_3 ()> >::_M_run() /home/tisseo/navitia/source/type/data.cpp (kraken+0x000000f9db66)
    #5 std::error_code::default_error_condition() const <null> (libstdc++.so.6+0x0000000b9e6e)

I guess it could be related? But I fear we are way (way) passed my knowledges of multithreaded applications... I'm more than ready to dive a bit more into this but I need some guidance. Do you feel there is something that needs to be investigated there or not?

I'll be on Element in the afternoon to talk about all of this.

Thank you for your help.

pbench commented 4 years ago

Hi @eturck !

Thanks for you detailed investigation !

Some thoughts about this : As @kinnou02 mentionned on Element, this may be related to the data clone trick. Indeed, Data::clone_from appears in the stack traces you obtained. The data clone trick seems to be adapted from a stackoverflow answer.

Some differences between the stackoverflow answer and navitia code :

in stackoverflow there is three threads : the main one, a writer and a reader. Moreover, there is a sbuf.close(); between the end of the write and the end of the read.
in navitia, there is only two threads, the writer, and the main which acts as a reader. The sbuf.close(); has disappeared.

Reading the Navitia code, I don't really understand why it works :man (or I read it wrongly). For me :

we launch a write thread
we read from the main thread
when reading is done, we wait for the completion of the write thread

So for me, the main thread may run the reading operation to completion before the write thread has finished writing, which I think could result in the data races found by ThreadSanitizer.

Maybe you can try to modify navitia to bring it more in-line with the stackoverflow answer :

launch a read thread instead of the actual write one
launch the write from the main thread
apply the sbuf.close(); after the write operation
finally join on the read thread

Or even completely reproduce the stackoverflow answer with 3 threads.

eturck commented 4 years ago

Thanks for the feedback, I really appreciate the time you all took to take a look at this.

I can't say I really understand the code either. I had the same interrogations on the order of execution. I saw that @TeXitoi was the one responsible for this https://github.com/CanalTP/navitia/pull/632/files#diff-930d109fe1ca9c60e82996bda4dd2f8aR538 :), If he ever stumble upon this message and want to shed some light on those lines...

I was really hopeful with this data race but I unfortunately wasn't able to reproduce it a second time... I had a few others more or less related to data cloning, sometimes there was tens of data race on some totally unrelated parts too. Lots of noise but not once again this one.

I spend three days trying a few things to have a way to at least semi reliably reproduce this crash or data race. Even if it happened once every hour it would be frequent enough to maybe draw a conclusion on a fix, but right now there is just nothing and I really have no clear path to investigate more.

My current plan is:

Re-enable disruption in production. We had a few hardware changes, faster CPU and storage and I want to see if it makes any difference and confirm that there is still a problem before trying anything.
Try to run kraken in the exact same environment but with boost 1.55 instead of boost 1.62. Since the crash is always in boost_serialization and we didn't have the issue on debian 7 / boost 1.55 and you don't have it on debian 8 / boost 1.55 either... It's a shot in the dark but that is my best lead on this.
Cry a lot if nothing works 😭

pbench commented 4 years ago

To check if the data race comes from the this data clone trick, maybe it is possible to assign a low priority on the write thread, so that this write thread will be scheduled after the read in the main ?

pbench commented 4 years ago

Hey, @pascal-vs suggested an easy way to test whether the data clone trick is responsible for the crashes : add a sleep() in the write thread before the serializing to the archive. This should ensure that the read occurs before the write, and thus trigger some segfault.

eturck commented 3 years ago

Another update, probably the last one. But mostly positive ;). Well we finally had the time to try running navitia with boost 1.55 instead of boost 1.62 in production.

We started by re-enabling disruption on 1.62, there was still some crashes even with the hardware changes. We installed boost 1.55 beside the current 1.62, a kraken build with, build new nav files with ed2nav build in boost 1.55 and enabled disruptions march 18th. Since then we've loaded close to a thousand disruptions and... no crash whatsoever. It's impossible to say if it's 100% fixed, but we were crashing pretty quickly before (a few dozens disruptions before a crash at best). It will probably stay that way in production and we will start to take disruption into account in our journeys soon.

We tried looking for something in boost issue tracker or mailing list with @pbench but nothing was resembling this crash, I really don't know what could cause it. The next step would be probably to test this with a recent boost version, it will not be very practical to stay with boost 1.55 forever.

If you want to close this issue feel free to do it. There will probably be no changes on our side for quite a while on this.

Thank you everyone for your ideas and your help!

hove-io / navitia

Crash during navitia::type::Data::clone_from when applying disruption #3113