ohwgiles / laminar

Fast and lightweight Continuous Integration
https://laminar.ohwg.net
GNU General Public License v3.0
298 stars 54 forks source link

SIGSEGV (with a few browser tabs and while fetching a large job log) #185

Closed jbglaw closed 1 year ago

jbglaw commented 1 year ago
130 kj::Promise<void> Http::cleanupPeers(kj::Timer& timer)
131 {
132     return timer.afterDelay(15 * kj::SECONDS).then([&]{
133         for(EventPeer* p : eventPeers) {
134             // an empty SSE message is a colon followed by two newlines
135             p->pendingOutput.push_back(":\n\n");
136             p->fulfiller->fulfill();
137         }
138         return cleanupPeers(timer);
139     }).eagerlyEvaluate(nullptr);
140 }

I had a crash here and there during the last time. This time with GDB attached (but too fast fingers, I failed to get a full bt.) It was in line 136, so likely p->fulfiller was a nullptr I guess. Maybe that's enough of a hint? If not, feel free to just close this issue.

jbglaw commented 1 year ago

Laminar was unresponsive for some time (multiple seconds, maybe a job finishing with lots of writes to the sqlite DB file?) when this happened the next time:

(gdb) bt full
#0  0x000056050c0565e7 in operator() (__closure=0x56050c97af60) at ./src/http.cpp:136
        p = 0x56050c983b70
        __for_range = std::set with 2 elements = {[0] = 0x56050c983b70, [1] = 0x56050c9bd7f0}
        __for_begin = <optimized out>
        __for_end = <optimized out>
        this = 0x56050c691a30
        timer = @0x56050c690ec8: {<kj::MonotonicClock> = {_vptr.MonotonicClock = 0x56050c0a8b08 <vtable for kj::TimerImpl+16>}, <No data fields>}
        this = <optimized out>
        timer = <optimized out>
        p = <optimized out>
        __for_range = <optimized out>
        __for_begin = <optimized out>
        __for_end = <optimized out>
#1  kj::_::MaybeVoidCaller<kj::_::Void, kj::Promise<void> >::apply<Http::cleanupPeers(kj::Timer&)::<lambda()> > (in=..., func=...) at /usr/include/kj/async-prelude.h:152
No locals.
#2  kj::_::TransformPromiseNode<kj::Promise<void>, kj::_::Void, Http::cleanupPeers(kj::Timer&)::<lambda()>, kj::_::PropagateException>::getImpl(kj::_::ExceptionOrValue &) (
    this=0x56050c97af40, output=...) at /usr/include/kj/async-inl.h:550
        depValue = 0x7ffdfc452659
        depException = 0x0
        depResult = {<kj::_::ExceptionOrValue> = {exception = {ptr = {isSet = false, {value = {ownFile = {content = {ptr = 0x7f41b43f5aab "H\213\204$(\003", size_ = 94579676836096, 
                        disposer = 0x0}}, file = 0x7ffdfc453a00 "\r", line = 0, type = kj::Exception::Type::FAILED, description = {content = {ptr = 0x0, size_ = 0, 
                        disposer = 0x7f41b4190456}}, context = {ptr = {disposer = 0x200000214, ptr = 0x0}}, remoteTrace = {content = {ptr = 0x0, size_ = 94579676841000, disposer = 0x0}}, 
                    trace = {0x0, 0x0, 0x0, 0x0, 0x7f41b417879f, 0x7f41b4178cad, 0x7f41b43b7544, 
                      0x7f41b43b7732 <capnp::writeMessage(kj::AsyncOutputStream&, kj::ArrayPtr<kj::ArrayPtr<capnp::word const> const>)+50>, 
                      0x7f41b43b778a <capnp::AsyncIoMessageStream::writeMessage(kj::ArrayPtr<int const>, kj::ArrayPtr<kj::ArrayPtr<capnp::word const> const>)+42>, 
                      0x7f41b441625d <kj::_::RunnableImpl<kj::evalNow<capnp::TwoPartyVatNetwork::OutgoingMessageImpl::send()::{lambda()#2}::operator()() const::{lambda()#1}>(capnp::TwoPartyVatNetwork::OutgoingMessageImpl::send()::{lambda()#2}::operator()() const::{lambda()#1}&&)::{lambda()#1}>::run()+125>, 
                      0x7f41b4415180 <capnp::TwoPartyVatNetwork::OutgoingMessageImpl::send()::{lambda()#2}::operator()() const::{lambda(kj::Exception&&)#2}::operator()(kj::Exception&&) const>, 
                      0x7f41b4415b5f <kj::_::TransformPromiseNode<kj::_::Void, kj::_::Void, kj::_::IdentityFunc<void>, capnp::TwoPartyVatNetwork::OutgoingMessageImpl::send()::{lambda()#2}::operator()() const::{lambda(kj::Exception&&)#2}>::getImpl(kj::_::ExceptionOrValue&)+687>, 0x7f41b4118911, 0x7f41b440ef20, 0x7f41b43e3c40, 0x7f41b411f72f <kj::_::EagerPromiseNodeBase::fire()+47>, 
                      0x7f41b411b797 <kj::EventLoop::turn()+119>, 0x7f41b4122fb8 <kj::_::waitImpl(kj::Own<kj::_::PromiseNode>&&, kj::_::ExceptionOrValue&, kj::WaitScope&)+712>, 
                      0x56050c072e71 <kj::Promise<void>::wait(kj::WaitScope&)+49>, 0x56050c06d502 <Server::start()+242>, 0x56050c02c9e3 <main(int, char**)+531>, 
                      0x7f41b39fa189 <__libc_start_call_main+121>, 0x7f41b39fa244 <__libc_start_main_impl+132>, 0x56050c02cbd0 <_start+32>, 0x56050c02cbd0 <_start+32>, 
                      0x56050c02cbd0 <_start+32>, 0x56050c02cbd0 <_start+32>, 0x56050c02cbd0 <_start+32>, 0x56050c02cbd0 <_start+32>, 0x56050c02cb00 <main(int, char**)+816>, 
                      0x56050c0a6f20 <kj::_::HeapDisposer<kj::_::ChainPromiseNode>::instance>, 0x0}, traceCount = 768}}}}}, value = {ptr = {isSet = true, {value = {<No data fields>}}}}}
#3  0x00007f41b4118912 in ?? () from /lib/x86_64-linux-gnu/libkj-async-0.9.2.so
No symbol table info available.
#4  0x00007f41b407f3e9 in kj::_::runCatchingExceptions(kj::_::Runnable&) () from /lib/x86_64-linux-gnu/libkj-0.9.2.so
No symbol table info available.
#5  0x00007f41b4117111 in kj::_::TransformPromiseNodeBase::get(kj::_::ExceptionOrValue&) () from /lib/x86_64-linux-gnu/libkj-async-0.9.2.so
No symbol table info available.
#6  0x00007f41b412155e in kj::_::ChainPromiseNode::fire() () from /lib/x86_64-linux-gnu/libkj-async-0.9.2.so
No symbol table info available.
#7  0x00007f41b411b798 in kj::EventLoop::turn() () from /lib/x86_64-linux-gnu/libkj-async-0.9.2.so
No symbol table info available.
#8  0x00007f41b4122fb9 in kj::_::waitImpl(kj::Own<kj::_::PromiseNode>&&, kj::_::ExceptionOrValue&, kj::WaitScope&) () from /lib/x86_64-linux-gnu/libkj-async-0.9.2.so
No symbol table info available.
#9  0x000056050c072e72 in kj::Promise<void>::wait (this=this@entry=0x7ffdfc453880, waitScope=...) at /usr/include/kj/async-inl.h:1120
        result = {<kj::_::ExceptionOrValue> = {exception = {ptr = {isSet = false, {value = {ownFile = {content = {
                        ptr = 0x56050c06c0c0 <PathWatcherImpl::addPath(char const*)> "SH\211\373\213\177\b\272\b\003", size_ = 0, disposer = 0x0}}, file = 0x0, line = 0,
                    type = kj::Exception::Type::FAILED, description = {content = {ptr = 0x56050c6914a0 " A\r\264A\177", size_ = 140728835848256, disposer = 0x56050c6914b8}}, context = {
                      ptr = {disposer = 0x56050c691488, ptr = 0x7ffdfc4537b0}}, remoteTrace = {content = {ptr = 0x7f41b411dedb "H\215\225P\377\377\377H\215\r\322U\006", 
                        size_ = 94579683037080, disposer = 0x4}}, trace = {0x0, 0x0, 0x7ffdfc453720, 0x0, 0x7f41b41834bb, 0x5, 0x1, 0x2060dbe7ce9bf800, 0x0, 
                      0x7f41b3a0ee88 <__bsd_signal+104>, 0x56050c0740a0 <laminar_quit(int)>, 0x4000, 0x0, 0x0, 0x56050c0a8428 <kj::_::HeapArrayDisposer::instance>, 0x2060dbe7ce9bf800, 0x4, 
                      0x56050c90acd8, 0x56050c90acd8, 0x7f41b411daa9 <kj::_::Event::armBreadthFirst()+41>, 0x7ffdfc453820, 0x7f41b411e0b7 <kj::_::Event::Event()+23>, 0x56050c90acd0, 
                      0x2060dbe7ce9bf800, 0x7f41b41b5a98 <vtable for kj::_::ChainPromiseNode+16>, 0x7ffdfc453880, 0x56050c90acd0, 0x7f41b411dbc1, 0x7ffdfc453820, 
                      0x56050c06d3ac <kj::_::maybeChain<void>(kj::Own<kj::_::PromiseNode>&&, kj::Promise<void>*)+76>, 
                      0x56050c0a7608 <kj::_::HeapDisposer<kj::_::TransformPromiseNode<kj::Promise<void>, kj::_::Void, Server::start()::{lambda()#1}, kj::_::PropagateException> >::instance>, 
                      0x0}, traceCount = 48}}}}}, value = {ptr = {isSet = false, {value = {<No data fields>}}}}}
#10 0x000056050c06d503 in Server::start (this=0x56050c691220) at ./src/server.cpp:57
No locals.
#11 0x000056050c02c9e4 in main (argc=<optimized out>, argv=<optimized out>) at ./src/main.cpp:98
        ioContext = {lowLevelProvider = {disposer = 0x7f41b41b4738, ptr = 0x56050c690eb0}, provider = {disposer = 0x7f41b41b4740, ptr = 0x56050c691090}, waitScope = @0x56050c690ff0, 
          unixEventPort = @0x56050c690eb8}
        settings = {home = 0x56050c07e7b2 "/var/lib/laminar", bind_rpc = 0x56050c07e7c3 "unix-abstract:laminar", bind_http = 0x56050c07e7d9 "*:8080", 
          archive_url = 0x56050c0798da "/archive/"}

Any specific stuff I'd pay attention to for the next time?

ohwgiles commented 1 year ago

Thanks for the detailed report. I can't see quite how this could happen. According to the log it's not a null pointer, and it looks like a reasonable value, so not sure what actually caused the segfault. Is there any console output when this happens?

jbglaw commented 1 year ago

p isn't a nullptr, but were you able to see that p->fulfiller is non-NULL? That's for sure something I'll check next time (though I haven't seen this since the bt full.)

jbglaw commented 1 year ago

Console output as in journalctl? Maybe this single line:

Feb 22 08:50:38 lili laminard[228041]: kj/compat/http.c++:5308: error: unhandled exception in HTTP server; exception = kj/compat/http.c++:1870: failed: previous HTTP message body incomplete; can't write more messages

(I'm not sure, but it would somewhat match my comment's timestamp.)

jbglaw commented 1 year ago

Got this one once more.

laminard logged:

Mar 01 10:26:26 lili laminard[69730]: kj/compat/http.c++:5308: error: unhandled exception in HTTP server; exception = kj/compat/http.c++:1870: failed: previous HTTP message body incomplete; can't write more messages
Mar 01 10:26:26 lili laminard[69730]: stack: 7f0d19293550 7f0d19293580

Same bt, additionally:

(gdb) p p
$1 = (EventPeer *) 0x560478289cc0
(gdb) p *p
$2 = {scope = {type = MonitorScope::ALL, job = "", num = 0, page = 0, field = "number", order_desc = true}, pendingOutput = std::__cxx11::list = {
    [0] = ":\n\n"}, fulfiller = {disposer = 0x0, ptr = 0x0}}

---> up to the Exception frame:
#2  kj::_::TransformPromiseNode<kj::Promise<void>, kj::_::Void, Http::cleanupPeers(kj::Timer&)::<lambda()>, kj::_::PropagateException>::getImpl(kj::_::ExceptionOrValue &) (this=0x5604786920f0, output=...) at /usr/include/kj/async-inl.h:550
warning: Source file is more recent than executable.
550           output.as<T>() = handle(MaybeVoidCaller<DepT, T>::apply(func, kj::mv(*depValue)));
(gdb) info args
this = 0x5604786920f0
output = @0x7ffcc5aa9e20: {exception = {ptr = {isSet = false, {value = {ownFile = {content = {ptr = 0x7ffcc5aaa080 "\210f\037\031\r\177", 
              size_ = 140723624779776, disposer = 0x7f0d1916217a <kj::_::ExclusiveJoinPromiseNode::Branch::fire()+106>}}, 
          file = 0x7f0d191f6840 " \230\025\031\r\177", line = 2020550696, type = 22020, description = {content = {ptr = 0x0, size_ = 0, 
              disposer = 0x560476741428 <kj::_::HeapArrayDisposer::instance>}}, context = {ptr = {disposer = 0x0, ptr = 0x0}}, remoteTrace = {
            content = {ptr = 0x0, size_ = 0, disposer = 0x0}}, trace = {0x78b34e5000000001, 0x7f0d00005604, 0x7f0d193fa544, 
            0x7f0d193fa732 <capnp::writeMessage(kj::AsyncOutputStream&, kj::ArrayPtr<kj::ArrayPtr<capnp::word const> const>)+50>, 
            0x7f0d193fa78a <capnp::AsyncIoMessageStream::writeMessage(kj::ArrayPtr<int const>, kj::ArrayPtr<kj::ArrayPtr<capnp::word const> const>)+42>, 
            0x7f0d1945925d <kj::_::RunnableImpl<kj::evalNow<capnp::TwoPartyVatNetwork::OutgoingMessageImpl::send()::{lambda()#2}::operator()() const::{lambda()#1}>(capnp::TwoPartyVatNetwork::OutgoingMessageImpl::send()::{lambda()#2}::operator()() const::{lambda()#1}&&)::{lambda()#1}>::run()+125>, 
            0x7f0d19458180 <capnp::TwoPartyVatNetwork::OutgoingMessageImpl::send()::{lambda()#2}::operator()() const::{lambda(kj::Exception&&)#2}::operator()(kj::Exception&&) const>, 
            0x7f0d19458b5f <kj::_::TransformPromiseNode<kj::_::Void, kj::_::Void, kj::_::IdentityFunc<void>, capnp::TwoPartyVatNetwork::OutgoingMessageImpl::send()::{lambda()#2}::operator()() const::{lambda(kj::Exception&&)#2}>::getImpl(kj::_::ExceptionOrValue&)+687>, 0x7f0d1915b911, 0x7f0d19451f20, 
            0x7f0d1915a110 <kj::_::TransformPromiseNodeBase::get(kj::_::ExceptionOrValue&)+64>, 
            0x7f0d1916272f <kj::_::EagerPromiseNodeBase::fire()+47>, 0x7f0d1915e797 <kj::EventLoop::turn()+119>, 
            0x7f0d19165fb8 <kj::_::waitImpl(kj::Own<kj::_::PromiseNode>&&, kj::_::ExceptionOrValue&, kj::WaitScope&)+712>, 
            0x7ffcc5b7695a <clock_gettime+90>, 0x7733261142e14700, 0x5604781ce128, 0x7ffcc5aa9fc0, 0x560477fa1fa8, 
            0x7f0d191604a2 <kj::_::Event::~Event()+50>, 0x7ffcc5aa9f60, 0x7f0d18ae52e9 <__GI___clock_gettime+25>, 0x5604766c5bd0 <_start+32>, 
            0x7f0d191c2f4e <kj::TimerImpl::timeoutToNextEvent(kj::Absolute<kj::Quantity<long, kj::_::NanosecondLabel>, kj::_::TimeLabel>, kj::Quantity<long, kj::_::NanosecondLabel>, unsigned long)+46>, 0x5604766c5b01 <main(int, char**)+817>, 0x7733261142e14700, 0xd84a1d, 0x560477fa1eb8, 
            0x7ffcc5aaa030, 0x7733261142e14700, 0x7ffcc5aaa5c0, 0xffffffffffffff40}, traceCount = 11}}}}}
jbglaw commented 1 year ago

Seems this happens when a browser tab is closed while retrieving data from laminard or when the tab is closed between data is requested and laminard fetched what it needed from the database to send it back to the client. (NB: My database is about 10 GB in size, with about 45k builds entries. So many requests initially take quite some time, as each builds row contains the compressed build log. For my setup, where many of those logs are in the range of 50 to 100 MB (uncompressed) it would be a hugh performance boost to separate build logs from build metadata.)

ohwgiles commented 1 year ago

Please try with the latest master.

it would be a hugh performance boost to separate build logs from build metadata

I'm not so sure about this. The db indexing should make it possible to efficiently query only the relevant parts, but perhaps some optimization is needed here. I'm not opposed to changing this if it can really boost performance.

jbglaw commented 1 year ago

Thanks for this fix! When the current round of builds is over (or I see a crash in between), I'll pick up this patch! From my experience, this is/was the only remaining crash reason I saw.

(As for DB access speed, I'll probably look a bit into it once the current build round is over. A vague guess could be that sqlite would load a whole row when any component of the row is needed. If you pull the build stats for the most recent n builds, that metadata would only be a handfull of bytes, while the complete table row could be quite larger. Median gzip'ed log size for the last NetBSD builds for example is just below 5 MB, max'ing at around 18 MB. And as log output changed while that round was running, I guess it'll shift more towards that 18 MB reign...)

jbglaw commented 1 year ago

Updated, couldn't trigger the crash since then. Good work, thank you very much!

jbglaw commented 1 year ago

Just a note: Ignoring slow laminard answers when there's other work to do (seems to be primarily the case when a job with lots of output finishes and it writes/commits it (compressed) to the DB), this was the last major stability issue. Maybe doing a release would be appropriate? That may gain some attention to distributions to update?

ohwgiles commented 1 year ago

Ignoring slow laminard answers when there's other work to do

Would be great to create a separate issue for this with a minimal reproducing example if possible

Maybe doing a release would be appropriate?

Ideally I would like to migrate the frontend from vue2 to vue3 as part of the next release, not sure when I'll have time for that though...

jbglaw commented 1 year ago

I'll create another ticket for tracking request speed.