Closed conorsch closed 9 months ago
Maybe https://github.com/koute/bytehound would be useful here? I think a good goal would be that pd
can maintain many active connections without using excess memory.
I don't think that the use of an explicit Box
is a meaningful indicator of what might be consuming significant per-connection memory. We should investigate using a heap profiler.
On the assumption that each connection is using the same amount of memory (a reasonable prior, though it could certainly not be the case), it might be sufficient to collect a heap profile of what happens when a single client connection is held open. If there's significant per-connection memory overhead, that's a bug, and it could show up without having to load many connections.
It seems we do have a memory leak in pd. I can fairly easily get pd to consume a few gigs of memory if I bombard it with sync requests from multiple clients. Using bytehound to profile as recommended above, we see some never-freed allocations:
We can also show that bytehound believes these are leaks:
Unfortunately I don't yet have a root cause, but will spend more time with the stack traces and try to piece together a clearer story. The bytehound guide gives a walkthrough about how to perform this kind of investigation. I pushed the testing script I used (very simple loop over pcli
operations) for visibility.
Nice digging!
Not sure if it will be helpful, but one thought about isolating a cause could be:
pd
+ tendermint
tendermint
so pd
is not processing any consensus messages or doing anything other than serving rpc requestspclientd
against the stalled pd
until chain tip, then turn it offpd
pclientd
, causing it to open a single long-lived connectionpd
This way, we might be able to get information about what memory is used by a single long-lived connection.
This way, we might be able to get information about what memory is used by a single long-lived connection.
Ahoy, thar she blows:
Reading through the stack trace associated with that leak, I see a lot of rocksdb references, so I'm guessing that we're not dropping a db handle in a service worker somewhere.
I also saw a smaller leak that may be related to tracing instrumentation, or else I'm mistaken in reading the backtrace. Here's a PDF of the full report I generated this morning, mostly for posterity in reproducing these steps in future debugging sessions: pd-memory-profiling-testnet-56-report-1.pdf
Formatting's a bit wonky. Separately I'll paste in the console code I adapted from the bytehound guide, since that'll be fairly easily copy/pasteable in the future.
Initial drops in linked PR seem to help, but aren't sufficient. Encountered a new leak:
Stack trace:
#00 [libc.so.6] __clone3
#01 [libc.so.6] start_thread
#02 [pd] std::sys::unix::thread::Thread::new::thread_start [thread.rs:108]
#03 [pd] <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once [boxed.rs:1985]
#04 [pd] <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once [boxed.rs:1985]
#05 [pd] core::ops::function::FnOnce::call_once{{vtable.shim}}
#07 [pd] tokio::runtime::blocking::pool::Inner::run
#08 [pd] tokio::runtime::task::harness::Harness<T,S>::poll
#09 [pd] tokio::runtime::task::core::Core<T,S>::poll
#10 [pd] tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
#11 [pd] <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
#12 [pd] tokio::runtime::scheduler::multi_thread::worker::run
#13 [pd] tokio::runtime::context::runtime::enter_runtime
#14 [pd] tokio::runtime::context::scoped::Scoped<T>::set
#15 [pd] tokio::runtime::scheduler::multi_thread::worker::Context::run
#16 [pd] tokio::runtime::scheduler::multi_thread::worker::Context::run_task
#17 [pd] tokio::runtime::task::harness::Harness<T,S>::poll
#18 [pd] tokio::runtime::task::core::Core<T,S>::poll
#19 [pd] tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
#20 [pd] <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
#21 [pd] <hyper::proto::h2::server::H2Stream<F,B> as core::future::future::Future>::poll
#22 [pd] <hyper::proto::h2::PipeToSendStream<S> as core::future::future::Future>::poll
#23 [pd] <http_body::combinators::map_err::MapErr<B,F> as http_body::Body>::poll_data
#24 [pd] <http_body::combinators::map_err::MapErr<B,F> as http_body::Body>::poll_data
#25 [pd] <http_body::combinators::map_err::MapErr<B,F> as http_body::Body>::poll_data
#26 [pd] <tonic::codec::encode::EncodeBody<S> as http_body::Body>::poll_data
#27 [pd] <T as futures_util::fns::FnMut1<A>>::call_mut
#28 [pd] prost::message::Message::encode
#29 [pd] <penumbra_proto::penumbra::core::chain::v1alpha1::CompactBlock as prost::message::Message>::encode_raw
#30 [pd] prost::encoding::message::encode
#31 [pd] prost::encoding::<impl prost::encoding::sealed::BytesAdapter for alloc::vec::Vec<u8>>::append_to
#32 [pd] bytes::bytes_mut::BytesMut::reserve_inner
#33 [pd] alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle
#34 [pd] alloc::raw_vec::finish_grow
#35 [libbytehound.so] realloc [api.rs:378]
Currently pairing with @erwanor.
The second leak is more mysterious to me, since the allocation is happening inside the Tonic stack, and I'm not sure why it would be growing a BytesMut
and then keeping it around. I think it's worth merging a fix for the part of the problem that's clearly in our part of the stack.
Recent related changes:
We shipped point releases as 0.56.1 and 0.57.1 to evaluate performance improvements. At least one more PR should land in time for 0.58.0 (#2888).
Moving this issue back to Future
since we are prioritizing other items. Over the medium term, we should aim for pd
to completely isolate the consensus module from inbound load on its RPCs. Although the current performance pattern is far from optimal, it is probably sufficient when combined with network level rate limiters and load balancers.
Closing as completed since we addressed the memory leaks that were causing the original problem. While there is more work to do, it can be tracked in later issues.
Today on Testnet 56 we observed a large spike in client traffic to the
pd
endpoint athttps://grpc.testnet.penumbra.zone
.As for provenance of the traffic, let's assume it's organic interest, in the form of many people downloading the web extension and synchronizing blocks for the first time. After about an hour or two, memory consumption—in the
pd
container specifically—balloons to the point that OOMkiller kicks in and terminates the pod. An example of resource consumption shortly before kill:According to the logs,
pd
is serving a lot of two types of requests, CompactBlockRange and ValidatorInfoStream:Intriguingly both those types are
Box
ed return values in our RPCs. Also intriguing is this comment:https://github.com/penumbra-zone/penumbra/blob/bfda3a85ba92b1d4112ebe402c0fb2a8e6271c60/crates/bin/pd/src/info/oblivious.rs#L332-L336
We need to understand why pd consumes large amounts of memory when handling these types of concurrent requests. For now, I'm assuming the traffic is well-formed, honest clients trying to synchronize.