pistacheio / pistache

A high-performance REST toolkit written in C++
https://pistacheio.github.io/pistache/
Apache License 2.0
3.18k stars 698 forks source link

Race conditions with timers #161

Open ywahl opened 6 years ago

ywahl commented 6 years ago

When using timeouts: e.g.: response.timeoutAfter(std::chrono::milliseconds(100)); we get sometimes timer already armed error followed by a SEGV+core dump (running run_http_server)

The problem is apparently due to a race condition between the time the timerfd is closed, and the time it is erased from the timers map.

The timerfd is first closed, and then erased from the map, If we have the following sequence a problem can happen (it happens almost every time in my tests when load gets a bit high) e.g.: thread 1 close(timerfd) (http.h:) around line 210 thread 2 void arm(Duration duration) { (http.h) Async::Promise p([=](Async::Deferred deferred) { timerFd = TRY_RET(timerfd_create(CLOCK_MONOTONIC, TFD_NONBLOCK));

it reuses the closed the timerfd thread 1: timers.erase(it) (transport.cc line 85)

We should actually close the timerfd only after doing timers.erase(it)

If wanted a can provide a pull request

xoac commented 6 years ago

If you are making pull request link it with issue. Linking for you #162 . Thanks for your effort!

ywahl commented 6 years ago

Let me know if there are any problems Cheers,

1Hyena commented 6 years ago

I think I can report the same problem. I have onTimeout handler. When making 1 request at a time everything works perfectly. However, when using the parallel command in Linux as shown below, the HTTP server crashes with SIGSEGV.

parallel sh -c "curl -X POST -H 'X-API-Signature:17' --data '{\"token\":\"xxx\"}' -v http://localhost:10000/game/url" -- 1 2 3

The above command makes 3 requests in parallel. Below is gdb output.

Tue May 29 16:56:12 2018 :::: Serving the API on 0.0.0.0:10000.
[New Thread 0x7ffff2055700 (LWP 13974)]
[New Thread 0x7ffff1854700 (LWP 13975)]
[New Thread 0x7ffff1053700 (LWP 13976)]
Tue May 29 16:56:16 2018 :7c: New connection from 127.0.0.1 (1 total).
Tue May 29 16:56:16 2018 :7c: Received 15 bytes from a POST request to /game/url resource.
[New Thread 0x7ffff0852700 (LWP 13984)]
[Thread 0x7ffff0852700 (LWP 13984) exited]
Tue May 29 16:56:26 2018 :7c: Responding with 75 bytes (Request Timeout).
Tue May 29 16:56:26 2018 :7c: Connection closed (0 remaining).
Tue May 29 16:56:31 2018 :f7: New connection from 127.0.0.1 (1 total).
Tue May 29 16:56:31 2018 :f7: Received 15 bytes from a POST request to /game/url resource.
[New Thread 0x7ffff0852700 (LWP 14030)]
[Thread 0x7ffff0852700 (LWP 14030) exited]
Tue May 29 16:56:41 2018 :f7: Responding with 75 bytes (Request Timeout).
Tue May 29 16:56:41 2018 :f7: Connection closed (0 remaining).
Tue May 29 16:56:45 2018 :e2: New connection from 127.0.0.1 (1 total).
Tue May 29 16:56:45 2018 :e2: Received 15 bytes from a POST request to /game/url resource.
[New Thread 0x7ffff0852700 (LWP 14038)]
[Thread 0x7ffff0852700 (LWP 14038) exited]
Tue May 29 16:56:55 2018 :e2: Responding with 75 bytes (Request Timeout).
Tue May 29 16:56:55 2018 :e2: Connection closed (0 remaining).
Tue May 29 16:57:02 2018 :62: New connection from 127.0.0.1 (1 total).
Tue May 29 16:57:02 2018 :62: Received 15 bytes from a POST request to /game/url resource.
[New Thread 0x7ffff0852700 (LWP 14066)]
[Thread 0x7ffff0852700 (LWP 14066) exited]
Tue May 29 16:57:02 2018 :b5: New connection from 127.0.0.1 (2 total).
Tue May 29 16:57:02 2018 :b5: Received 15 bytes from a POST request to /game/url resource.
[New Thread 0x7ffff0852700 (LWP 14069)]
[Thread 0x7ffff0852700 (LWP 14069) exited]
Tue May 29 16:57:12 2018 :c0: New connection from 127.0.0.1 (3 total).
Tue May 29 16:57:12 2018 :62: Responding with 75 bytes (Request Timeout).
Tue May 29 16:57:12 2018 :c0: Received 15 bytes from a POST request to /game/url resource.
[New Thread 0x7ffff0852700 (LWP 14080)]
[Thread 0x7ffff0852700 (LWP 14080) exited]
Tue May 29 16:57:12 2018 :b5: Responding with 75 bytes (Request Timeout).
Tue May 29 16:57:12 2018 :b5: Connection closed (2 remaining).
Tue May 29 16:57:22 2018 :62: Connection closed (1 remaining).

Thread 3 "sndapi" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff1854700 (LWP 13975)]
0x000000000050e693 in Pistache::Http::Timeout::onTimeout(unsigned long) ()
(gdb) bt
#0  0x000000000050e693 in Pistache::Http::Timeout::onTimeout(unsigned long) ()
#1  0x00000000004cf4dc in void Pistache::Http::Timeout::arm<std::chrono::duration<long, std::ratio<1l, 1l> > >(std::chrono::duration<long, std::ratio<1l, 1l> >)::{lambda(unsigned long)#2}::operator()(unsigned long) const (__closure=0x7fffe8004cb8, numWakeup=1)
    at /usr/local/include/pistache/http.h:211
#2  0x00000000004ec580 in Pistache::Async::Private::impl::Continuation<unsigned long, void Pistache::Http::Timeout::arm<std::chrono::duration<long, std::ratio<1l, 1l> > >(std::chrono::duration<long, std::ratio<1l, 1l> >)::{lambda(unsigned long)#2}, void Pistache::Http::Timeout::arm<std::chrono::duration<long, std::ratio<1l, 1l> > >(std::chrono::duration<long, std::ratio<1l, 1l> >)::{lambda(std::__exception_ptr::exception_ptr)#3}, void (unsigned long)>::doResolve(std::shared_ptr<Pistache::Async::Private::CoreT<unsigned long> > const&) (this=0x7fffe8004c90, 
    core=std::shared_ptr (count 3, weak 0) 0x7fffe80034e0) at /usr/local/include/pistache/async.h:438
#3  0x00000000004ec2b4 in Pistache::Async::Private::Continuable<unsigned long>::resolve (this=0x7fffe8004c90, 
    core=std::shared_ptr (count 3, weak 0) 0x7fffe80034e0) at /usr/local/include/pistache/async.h:265
#4  0x0000000000500103 in Pistache::Tcp::Transport::handleTimer(Pistache::Tcp::Transport::TimerEntry) ()
#5  0x0000000000503598 in Pistache::Tcp::Transport::onReady(Pistache::Aio::FdSet const&) ()
#6  0x000000000050ad63 in Pistache::Aio::SyncImpl::runOnce() ()
#7  0x0000000000508a8a in std::thread::_Impl<std::_Bind_simple<Pistache::Aio::AsyncImpl::Worker::run()::{lambda()#1} ()> >::_M_run() ()
#8  0x00007ffff7b0dc80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007ffff783f6ba in start_thread (arg=0x7ffff1854700) at pthread_create.c:333
#10 0x00007ffff69a541d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb) 

@ywahl do you think it is the same issue?

@xoac when do you think the fix for this could be merged?

1Hyena commented 6 years ago

For the record, I tried the fix proposed in https://github.com/oktal/pistache/pull/162 but it still crashes as shown above when doing 3 requests in parallel.

ywahl commented 6 years ago

Hi @1Hyena Had a lot of various issues even after fixing that. I stopped trying to fix it.

1Hyena commented 6 years ago

@ywahl thanks for letting me know. I have disabled the framework's onTimeout handler for now. It fails to execute anyway when the working thread is being blocked (by curl for example), so it has no real use-cases.