pd doesn't behave well under heavy sync load

conorsch commented 1 year ago

Today on Testnet 56 we observed a large spike in client traffic to the pd endpoint at https://grpc.testnet.penumbra.zone.

active-client-conns-2

As for provenance of the traffic, let's assume it's organic interest, in the form of many people downloading the web extension and synchronizing blocks for the first time. After about an hour or two, memory consumption—in the pd container specifically—balloons to the point that OOMkiller kicks in and terminates the pod. An example of resource consumption shortly before kill:

❯ kubectl top pod penumbra-testnet-fn-0-54d9f68d76-kl9gm --containers
POD                                      NAME   CPU(cores)   MEMORY(bytes)   
penumbra-testnet-fn-0-54d9f68d76-kl9gm   pd     3055m        11249Mi         
penumbra-testnet-fn-0-54d9f68d76-kl9gm   tm     47m          162Mi

According to the logs, pd is serving a lot of two types of requests, CompactBlockRange and ValidatorInfoStream:

❯ wc -l logs-penumbra-testnet-fn-0-54d9f68d76-kl9gm.txt
8329 logs-penumbra-testnet-fn-0-54d9f68d76-kl9gm.txt
❯ rg validator_info logs-penumbra-testnet-fn-0-54d9f68d76-kl9gm.txt -c
4639
❯ rg -i compact_block_range logs-penumbra-testnet-fn-0-54d9f68d76-kl9gm.txt -c
2107

Intriguingly both those types are Boxed return values in our RPCs. Also intriguing is this comment:

https://github.com/penumbra-zone/penumbra/blob/bfda3a85ba92b1d4112ebe402c0fb2a8e6271c60/crates/bin/pd/src/info/oblivious.rs#L332-L336

We need to understand why pd consumes large amounts of memory when handling these types of concurrent requests. For now, I'm assuming the traffic is well-formed, honest clients trying to synchronize.

hdevalence commented 1 year ago

Maybe https://github.com/koute/bytehound would be useful here? I think a good goal would be that pd can maintain many active connections without using excess memory.

hdevalence commented 1 year ago

I don't think that the use of an explicit Box is a meaningful indicator of what might be consuming significant per-connection memory. We should investigate using a heap profiler.

On the assumption that each connection is using the same amount of memory (a reasonable prior, though it could certainly not be the case), it might be sufficient to collect a heap profile of what happens when a single client connection is held open. If there's significant per-connection memory overhead, that's a bug, and it could show up without having to load many connections.

conorsch commented 1 year ago

It seems we do have a memory leak in pd. I can fairly easily get pd to consume a few gigs of memory if I bombard it with sync requests from multiple clients. Using bytehound to profile as recommended above, we see some never-freed allocations:

pd-mem-leak-bytehound-1

We can also show that bytehound believes these are leaks:

pd-mem-leak-bytehound-2

Unfortunately I don't yet have a root cause, but will spend more time with the stack traces and try to piece together a clearer story. The bytehound guide gives a walkthrough about how to perform this kind of investigation. I pushed the testing script I used (very simple loop over pcli operations) for visibility.

hdevalence commented 1 year ago

Nice digging!

Not sure if it will be helpful, but one thought about isolating a cause could be:

run pd + tendermint
shut off tendermint so pd is not processing any consensus messages or doing anything other than serving rpc requests
sync pclientd against the stalled pd until chain tip, then turn it off
start measuring pd
start pclientd, causing it to open a single long-lived connection
stop measuring pd

This way, we might be able to get information about what memory is used by a single long-lived connection.

conorsch commented 1 year ago

This way, we might be able to get information about what memory is used by a single long-lived connection.

Ahoy, thar she blows:

pd-pclientd-single-conn-leak

Reading through the stack trace associated with that leak, I see a lot of rocksdb references, so I'm guessing that we're not dropping a db handle in a service worker somewhere.

full stack trace for graphed leak

``` #00 [libc.so.6] __clone3 #01 [libc.so.6] start_thread #02 [libstdc++.so.6.0.30] 7f57e02dbc42 #03 [pd] std::_Function_handler const&, unsigned long)::{lambda()#1}>::_M_invoke(std::_Any_data const&) #04 [pd] rocksdb::TableCache::FindTable(rocksdb::ReadOptions const&, rocksdb::FileOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData const&, rocksdb::BasicTypedCacheInterface::TypedHandle**, std::shared_ptr const&, bool, bool, rocksdb::HistogramImpl*, bool, int, bool, unsigned long, rocksdb::Temperature) #05 [pd] rocksdb::TableCache::GetTableReader(rocksdb::ReadOptions const&, rocksdb::FileOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData const&, bool, bool, rocksdb::HistogramImpl*, std::unique_ptr>*, std::shared_ptr const&, bool, int, bool, unsigned long, rocksdb::Temperature) #06 [pd] rocksdb::BlockBasedTableFactory::NewTableReader(rocksdb::ReadOptions const&, rocksdb::TableReaderOptions const&, std::unique_ptr>&&, unsigned long, std::unique_ptr>*, bool) const [clone .localalias] #07 [pd] rocksdb::BlockBasedTable::Open(rocksdb::ReadOptions const&, rocksdb::ImmutableOptions const&, rocksdb::EnvOptions const&, rocksdb::BlockBasedTableOptions const&, rocksdb::InternalKeyComparator const&, std::unique_ptr>&&, unsigned long, std::unique_ptr>*, std::shared_ptr, std::shared_ptr const&, bool, bool, int, bool, unsigned long, bool, rocksdb::TailPrefetchStats*, rocksdb::BlockCacheTracer*, unsigned long, std::string const&, unsigned long, std::array) #08 [pd] rocksdb::BlockBasedTable::PrefetchIndexAndFilterBlocks(rocksdb::ReadOptions const&, rocksdb::FilePrefetchBuffer*, rocksdb::InternalIteratorBase*, rocksdb::BlockBasedTable*, bool, rocksdb::BlockBasedTableOptions const&, int, unsigned long, unsigned long, rocksdb::BlockCacheLookupContext*) #09 [pd] rocksdb::BlockBasedTable::CreateIndexReader(rocksdb::ReadOptions const&, rocksdb::FilePrefetchBuffer*, rocksdb::InternalIteratorBase*, bool, bool, bool, rocksdb::BlockCacheLookupContext*, std::unique_ptr>*) #10 [pd] rocksdb::BinarySearchIndexReader::Create(rocksdb::BlockBasedTable const*, rocksdb::ReadOptions const&, rocksdb::FilePrefetchBuffer*, bool, bool, bool, rocksdb::BlockCacheLookupContext*, std::unique_ptr>*) #11 [pd] rocksdb::BlockBasedTable::IndexReaderCommon::ReadIndexBlock(rocksdb::BlockBasedTable const*, rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::CachableEntry*) #12 [pd] std::enable_if<((rocksdb::Block_kIndex::kCacheEntryRole)==((rocksdb::CacheEntryRole)13))||(true), rocksdb::Status>::type rocksdb::BlockBasedTable::RetrieveBlock(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry*, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, bool, bool, bool) const #13 [pd] rocksdb::Status rocksdb::(anonymous namespace)::ReadAndParseBlockFromFile(rocksdb::RandomAccessFileReader*, rocksdb::FilePrefetchBuffer*, rocksdb::Footer const&, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, std::unique_ptr>*, rocksdb::ImmutableOptions const&, rocksdb::BlockCreateContext&, bool, rocksdb::UncompressionDict const&, rocksdb::PersistentCacheOptions const&, rocksdb::MemoryAllocator*, bool, bool) #14 [pd] rocksdb::BlockFetcher::ReadBlockContents() #15 [pd] rocksdb::UncompressSerializedBlock(rocksdb::UncompressionInfo const&, char const*, unsigned long, rocksdb::BlockContents*, unsigned int, rocksdb::ImmutableOptions const&, rocksdb::MemoryAllocator*) #16 [pd] rocksdb::UncompressBlockData(rocksdb::UncompressionInfo const&, char const*, unsigned long, rocksdb::BlockContents*, unsigned int, rocksdb::ImmutableOptions const&, rocksdb::MemoryAllocator*) #17 [libstdc++.so.6.0.30] operator new(unsigned long) #18 [libbytehound.so] malloc [api.rs:294] ```

I also saw a smaller leak that may be related to tracing instrumentation, or else I'm mistaken in reading the backtrace. Here's a PDF of the full report I generated this morning, mostly for posterity in reproducing these steps in future debugging sessions: pd-memory-profiling-testnet-56-report-1.pdf

Formatting's a bit wonky. Separately I'll paste in the console code I adapted from the bytehound guide, since that'll be fairly easily copy/pasteable in the future.

bytehound memleak console scripting

``` println("Temporary vs leaked allocations:"); graph() .add("Leaked", allocations().only_leaked()) .add("Temporary", allocations()) .save(); println("Sort leaks by size:"); let groups = allocations() .only_leaked() .group_by_backtrace() .sort_by_size(); graph().add(groups).save(); // Function to group leaks based on common backtraces fn analyze_group(list) { let list_all = allocations().only_matching_backtraces(list); graph() .add("Leaked", list_all.only_leaked()) .add("Temporary", list_all) .save(); println("Total: {}", list_all.len()); println("Leaked: {}", list_all.only_leaked().len()); println(); println("Backtrace:"); println(list_all[0].backtrace().strip()); } println("Analyzing most common leak by backtrace:"); analyze_group(groups[0]); println("Analyzing second most common leak by backtrace:"); analyze_group(groups[1]); println("Analyzing third most common leak by backtrace:"); analyze_group(groups[2]); ```

conorsch commented 1 year ago

Initial drops in linked PR seem to help, but aren't sufficient. Encountered a new leak:

pd-pclientd-leak-partially-fixed-1

Stack trace:

#00 [libc.so.6] __clone3
#01 [libc.so.6] start_thread
#02 [pd] std::sys::unix::thread::Thread::new::thread_start [thread.rs:108]
#03 [pd] <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once [boxed.rs:1985]
#04 [pd] <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once [boxed.rs:1985]
#05 [pd] core::ops::function::FnOnce::call_once{{vtable.shim}}
#07 [pd] tokio::runtime::blocking::pool::Inner::run
#08 [pd] tokio::runtime::task::harness::Harness<T,S>::poll
#09 [pd] tokio::runtime::task::core::Core<T,S>::poll
#10 [pd] tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
#11 [pd] <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
#12 [pd] tokio::runtime::scheduler::multi_thread::worker::run
#13 [pd] tokio::runtime::context::runtime::enter_runtime
#14 [pd] tokio::runtime::context::scoped::Scoped<T>::set
#15 [pd] tokio::runtime::scheduler::multi_thread::worker::Context::run
#16 [pd] tokio::runtime::scheduler::multi_thread::worker::Context::run_task
#17 [pd] tokio::runtime::task::harness::Harness<T,S>::poll
#18 [pd] tokio::runtime::task::core::Core<T,S>::poll
#19 [pd] tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
#20 [pd] <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
#21 [pd] <hyper::proto::h2::server::H2Stream<F,B> as core::future::future::Future>::poll
#22 [pd] <hyper::proto::h2::PipeToSendStream<S> as core::future::future::Future>::poll
#23 [pd] <http_body::combinators::map_err::MapErr<B,F> as http_body::Body>::poll_data
#24 [pd] <http_body::combinators::map_err::MapErr<B,F> as http_body::Body>::poll_data
#25 [pd] <http_body::combinators::map_err::MapErr<B,F> as http_body::Body>::poll_data
#26 [pd] <tonic::codec::encode::EncodeBody<S> as http_body::Body>::poll_data
#27 [pd] <T as futures_util::fns::FnMut1<A>>::call_mut
#28 [pd] prost::message::Message::encode
#29 [pd] <penumbra_proto::penumbra::core::chain::v1alpha1::CompactBlock as prost::message::Message>::encode_raw
#30 [pd] prost::encoding::message::encode
#31 [pd] prost::encoding::<impl prost::encoding::sealed::BytesAdapter for alloc::vec::Vec<u8>>::append_to
#32 [pd] bytes::bytes_mut::BytesMut::reserve_inner
#33 [pd] alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle
#34 [pd] alloc::raw_vec::finish_grow
#35 [libbytehound.so] realloc [api.rs:378]

Currently pairing with @erwanor.

hdevalence commented 1 year ago

The second leak is more mysterious to me, since the allocation is happening inside the Tonic stack, and I'm not sure why it would be growing a BytesMut and then keeping it around. I think it's worth merging a fix for the part of the problem that's clearly in our part of the stack.

conorsch commented 1 year ago

Recent related changes:

We shipped point releases as 0.56.1 and 0.57.1 to evaluate performance improvements. At least one more PR should land in time for 0.58.0 (#2888).

erwanor commented 1 year ago

Moving this issue back to Future since we are prioritizing other items. Over the medium term, we should aim for pd to completely isolate the consensus module from inbound load on its RPCs. Although the current performance pattern is far from optimal, it is probably sufficient when combined with network level rate limiters and load balancers.

hdevalence commented 9 months ago

Closing as completed since we addressed the memory leaks that were causing the original problem. While there is more work to do, it can be tracked in later issues.

penumbra-zone / penumbra

pd doesn't behave well under heavy sync load #2867