ut-osa / nightcore

Nightcore: Efficient and Scalable Serverless Computing for Latency-Sensitive, Interactive Microservices [ASPLOS '21]
Apache License 2.0
95 stars 23 forks source link

Gateway segfault when stressing the system #5

Open suraj44 opened 2 years ago

suraj44 commented 2 years ago

I have 4 machines each with 12 CPU cores and 64GB RAM. I deploy the Nightcore gateway on one of them and on each of the other three, I deploy an instance of the engine and a launcher for a hello-world function.

I have 3 other machines which act as clients and invoke the hello-world function by sending http POST requests to the gateway. The segfault occurs only when there are a large number of client threads (10 or 14 client threads on each client machine). What happens is that in the middle of the experiment, the gateway returns the following error in the log:

When I first encountered the problem, the segfault happened at uv__count_bufs but in my latest attempt to produce the error I got the following message from gdb:

Thread 3 "gateway" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff7228700 (LWP 275299)]
__GI___libc_free (mem=0xffffffff00000000) at malloc.c:3102

and in the tail of the log, there was this message:

3102    malloc.c: No such file or directory.

Any ideas as to what causes this problem and how to resolve it?

Some more info that might be useful: When I deploy the engine and launcher on only 2 other machines (instead of 3), then this error does not show up regardless of the number of client threads I use to stress the system. The minWorkers and maxWorkers config parameters for the function are 20 and 80 respectively.

zhipeng-jia commented 2 years ago

This is an interesting issue. I cannot immediately think which part of the code can segfault under heavy load.

A question: what is the (rough) aggregated QPS from 3 client machines? I'll try to find if I can re-produce this problem with my machines.

suraj44 commented 2 years ago

When using 10 threads on each client machine, the aggregated QPS was about 17k. The segfault occurs once in a while with these many threads, and if I increase the number of threads to 14 per client machine, it happens every time I rerun the experiment and about 10 to 20 seconds into each experiment.

Thanks for looking into this!

zhipeng-jia commented 2 years ago

Could you verify if the segfault is caused by rlimit, e.g, number of max file descriptors?

DCsunset commented 1 year ago

Hi @zhipeng-jia, I encountered the same error. I tried increasing the number of max file descriptors (e.g. ulimit -n 102400) but the error still occurs. There are only two lines of the error message:

PC: @     0x55d4462a8180  (unknown)  uv__count_bufs
    @ ... and at least 1 more frames

It seems that it's related to uv library. Any ideas about what causes it?

Thanks

DCsunset commented 1 year ago

Hi, I enabled the debug mode and address sanitizer and I'm able to locate the bug now:

=================================================================
==3106988==ERROR: AddressSanitizer: heap-use-after-free on address 0x61b0000f51b0 at pc 0x563fc372d9b6 bp 0x7fc2f66fb340 sp 0x7fc2f66fb330
READ of size 8 at 0x61b0000f51b0 thread T3
    #0 0x563fc372d9b5 in faas::server::IOWorker::PipeWriteCallback(uv_write_s*, int) src/server/io_worker.cpp:223
    #1 0x563fc38575b0 in uv__write_callbacks (nightcore/bin/debug/gateway+0x28c5b0)
    #2 0x563fc38580af in uv__stream_io (nightcore/bin/debug/gateway+0x28d0af)
    #3 0x563fc38550ec in uv_run (nightcore/bin/debug/gateway+0x28a0ec)
    #4 0x563fc372a984 in faas::server::IOWorker::EventLoopThreadMain() src/server/io_worker.cpp:168
    #5 0x563fc374c37c in decltype (((*((declval<faas::server::IOWorker*&>)())).*((declval<void (faas::server::IOWorker::*&)()>)()))()) absl::lts_2020_02_25::base_internal::MemFunAndPtr::Invoke<void (faas::server::IOWorker::*&)(), faas::server::IOWorker*&>(void (faas::server::IOWorker::*&)(), faas::server::IOWorker*&) d
eps/out/include/absl/base/internal/invoke.h:105
    #6 0x563fc374966e in decltype (absl::lts_2020_02_25::base_internal::Invoker<void (faas::server::IOWorker::*&)(), faas::server::IOWorker*&>::type::Invoke((declval<void (faas::server::IOWorker::*&)()>)(), (declval<faas::server::IOWorker*&>)())) absl::lts_2020_02_25::base_internal::Invoke<void (faas::server::IOWorker:
:*&)(), faas::server::IOWorker*&>(void (faas::server::IOWorker::*&)(), faas::server::IOWorker*&) deps/out/include/absl/base/internal/invoke.h:180
    #7 0x563fc3742f2a in void absl::lts_2020_02_25::functional_internal::Apply<void, absl::lts_2020_02_25::container_internal::CompressedTuple<void (faas::server::IOWorker::*)(), faas::server::IOWorker*>&, 0ul, 1ul>(absl::lts_2020_02_25::container_internal::CompressedTuple<void (faas::server::IOWorker::*)(), faas::serv
er::IOWorker*>&, absl::lts_2020_02_25::integer_sequence<unsigned long, 0ul, 1ul>) deps/out/include/absl/functional/internal/front_binder.h:36
    #8 0x563fc37397e7 in void absl::lts_2020_02_25::functional_internal::FrontBinder<void (faas::server::IOWorker::*)(), faas::server::IOWorker*>::operator()<, void>() & deps/out/include/absl/functional/internal/front_binder.h:56
    #9 0x563fc3734187 in std::_Function_handler<void (), absl::lts_2020_02_25::functional_internal::FrontBinder<void (faas::server::IOWorker::*)(), faas::server::IOWorker*> >::_M_invoke(std::_Any_data const&) /usr/include/c++/9/bits/std_function.h:300
    #10 0x563fc3731c07 in std::function<void ()>::operator()() const /usr/include/c++/9/bits/std_function.h:688
    #11 0x563fc3808558 in faas::base::Thread::Run() src/base/thread.cpp:41
    #12 0x563fc38086ae in faas::base::Thread::StartRoutine(void*) src/base/thread.cpp:90
    #13 0x7fc2fb274608 in start_thread /build/glibc-eX1tMB/glibc-2.31/nptl/pthread_create.c:477
    #14 0x7fc2fae47292 in __clone (/lib/x86_64-linux-gnu/libc.so.6+0x122292)

0x61b0000f51b0 is located 304 bytes inside of 1608-byte region [0x61b0000f5080,0x61b0000f56c8)
freed by thread T9 here:
    #0 0x7fc2fb39f025 in operator delete(void*, unsigned long) (/lib/x86_64-linux-gnu/libasan.so.5+0x111025)
    #1 0x563fc364a430 in faas::gateway::HttpConnection::~HttpConnection() (nightcore/bin/debug/gateway+0x7f430)
    #2 0x563fc37ff6ff in std::_Sp_counted_ptr<faas::gateway::HttpConnection*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() /usr/include/c++/9/bits/shared_ptr_base.h:377
    #3 0x563fc365befe in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() /usr/include/c++/9/bits/shared_ptr_base.h:155
    #4 0x563fc36577ef in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() /usr/include/c++/9/bits/shared_ptr_base.h:730
    #5 0x563fc3714457 in std::__shared_ptr<faas::server::ConnectionBase, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() /usr/include/c++/9/bits/shared_ptr_base.h:1169
    #6 0x563fc3714477 in std::shared_ptr<faas::server::ConnectionBase>::~shared_ptr() /usr/include/c++/9/bits/shared_ptr.h:103
    #7 0x563fc371e847 in std::pair<int, std::shared_ptr<faas::server::ConnectionBase> >::~pair() /usr/include/c++/9/bits/stl_pair.h:208
    #8 0x563fc371e86b in void __gnu_cxx::new_allocator<std::pair<int const, std::shared_ptr<faas::server::ConnectionBase> > >::destroy<std::pair<int, std::shared_ptr<faas::server::ConnectionBase> > >(std::pair<int, std::shared_ptr<faas::server::ConnectionBase> >*) /usr/include/c++/9/ext/new_allocator.h:153
    #9 0x563fc371d127 in decltype (({parm#2}.destroy)({parm#3})) absl::lts_2020_02_25::allocator_traits<std::allocator<std::pair<int const, std::shared_ptr<faas::server::ConnectionBase> > > >::destroy_impl<std::allocator<std::pair<int const, std::shared_ptr<faas::server::ConnectionBase> > >, std::pair<int, std::shared_
ptr<faas::server::ConnectionBase> > >(int, std::allocator<std::pair<int const, std::shared_ptr<faas::server::ConnectionBase> > >&, std::pair<int, std::shared_ptr<faas::server::ConnectionBase> >*) deps/out/include/absl/memory/memory.h:587
    #10 0x563fc371c7ab in void absl::lts_2020_02_25::allocator_traits<std::allocator<std::pair<int const, std::shared_ptr<faas::server::ConnectionBase> > > >::destroy<std::pair<int, std::shared_ptr<faas::server::ConnectionBase> > >(std::allocator<std::pair<int const, std::shared_ptr<faas::server::ConnectionBase> > >&,
std::pair<int, std::shared_ptr<faas::server::ConnectionBase> >*) (nightcore/bin/debug/gateway+0x1517ab)
    #11 0x563fc371bc86 in void absl::lts_2020_02_25::container_internal::map_slot_policy<int, std::shared_ptr<faas::server::ConnectionBase> >::destroy<std::allocator<std::pair<int const, std::shared_ptr<faas::server::ConnectionBase> > > >(std::allocator<std::pair<int const, std::shared_ptr<faas::server::ConnectionBase>
 > >*, absl::lts_2020_02_25::container_internal::map_slot_type<int, std::shared_ptr<faas::server::ConnectionBase> >*) (nightcore/bin/debug/gateway+0x150c86)
    #12 0x563fc371a458 in void absl::lts_2020_02_25::container_internal::FlatHashMapPolicy<int, std::shared_ptr<faas::server::ConnectionBase> >::destroy<std::allocator<std::pair<int const, std::shared_ptr<faas::server::ConnectionBase> > > >(std::allocator<std::pair<int const, std::shared_ptr<faas::server::ConnectionBas
e> > >*, absl::lts_2020_02_25::container_internal::map_slot_type<int, std::shared_ptr<faas::server::ConnectionBase> >*) deps/out/include/absl/container/flat_hash_map.h:561
    #13 0x563fc3718128 in void absl::lts_2020_02_25::container_internal::hash_policy_traits<absl::lts_2020_02_25::container_internal::FlatHashMapPolicy<int, std::shared_ptr<faas::server::ConnectionBase> >, void>::destroy<std::allocator<std::pair<int const, std::shared_ptr<faas::server::ConnectionBase> > > >(std::alloca
tor<std::pair<int const, std::shared_ptr<faas::server::ConnectionBase> > >*, absl::lts_2020_02_25::container_internal::map_slot_type<int, std::shared_ptr<faas::server::ConnectionBase> >*) deps/out/include/absl/container/internal/hash_policy_traits.h:84
    #14 0x563fc3717885 in absl::lts_2020_02_25::container_internal::raw_hash_set<absl::lts_2020_02_25::container_internal::FlatHashMapPolicy<int, std::shared_ptr<faas::server::ConnectionBase> >, absl::lts_2020_02_25::hash_internal::Hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, std::shared_ptr<faas:
:server::ConnectionBase> > > >::erase(absl::lts_2020_02_25::container_internal::raw_hash_set<absl::lts_2020_02_25::container_internal::FlatHashMapPolicy<int, std::shared_ptr<faas::server::ConnectionBase> >, absl::lts_2020_02_25::hash_internal::Hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, std::shar
ed_ptr<faas::server::ConnectionBase> > > >::iterator) deps/out/include/absl/container/internal/raw_hash_set.h:1175
    #15 0x563fc3715c7a in unsigned long absl::lts_2020_02_25::container_internal::raw_hash_set<absl::lts_2020_02_25::container_internal::FlatHashMapPolicy<int, std::shared_ptr<faas::server::ConnectionBase> >, absl::lts_2020_02_25::hash_internal::Hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, std::sh
ared_ptr<faas::server::ConnectionBase> > > >::erase<int>(int const&) deps/out/include/absl/container/internal/raw_hash_set.h:1152
    #16 0x563fc37b7683 in faas::gateway::Server::OnConnectionClose(faas::server::ConnectionBase*) src/gateway/server.cpp:125
    #17 0x563fc3767935 in operator() src/server/server_base.cpp:132
    #18 0x563fc376c2fe in _M_invoke /usr/include/c++/9/bits/std_function.h:300
    #19 0x563fc376c5ef in std::function<void (faas::server::ConnectionBase**)>::operator()(faas::server::ConnectionBase**) const /usr/include/c++/9/bits/std_function.h:688
    #20 0x563fc376a3f0 in void faas::utils::ReadMessages<faas::server::ConnectionBase*>(faas::utils::AppendableBuffer*, char const*, unsigned long, std::function<void (faas::server::ConnectionBase**)>) src/utils/appendable_buffer.h:128
    #21 0x563fc3767bf3 in faas::server::ServerBase::OnReturnConnection(long, uv_buf_t const*) src/server/server_base.cpp:129
    #22 0x563fc376764c in faas::server::ServerBase::ReturnConnectionCallback(uv_stream_s*, long, uv_buf_t const*) src/server/server_base.cpp:121
    #23 0x563fc3857cce in uv__read (nightcore/bin/debug/gateway+0x28ccce)

previously allocated by thread T9 here:
    #0 0x7fc2fb39d947 in operator new(unsigned long) (/lib/x86_64-linux-gnu/libasan.so.5+0x10f947)
    #1 0x563fc37beecb in faas::gateway::Server::OnHttpConnection(int) src/gateway/server.cpp:430
    #2 0x563fc37be9b7 in faas::gateway::Server::HttpConnectionCallback(uv_stream_s*, int) src/gateway/server.cpp:424
    #3 0x563fc38584fa in uv__server_io (nightcore/bin/debug/gateway+0x28d4fa)

It seems that it tries to write to a destructed HTTP connection and causes the bug. Do you have any ideas how to fix that?

zhipeng-jia commented 12 months ago

My guess is there is ongoing write, but the connection is closed. Maybe try not to destruct the connection class (by removing line 125 in gateway/server.cpp)

DCsunset commented 12 months ago

Yeah I tried the same a few days ago and it did fix the crash issue. At least it could work now! Thanks for your reply anyway.