Closed pete-eiger closed 5 months ago
Initially the memory leak was reported in January of this year. It was particularly hard to find the cause of it, since it could be in one of the many components, underlying Subgraph Radio:
From the start the most likely culprit were the C bindings and how they interact with Go and Rust code, since C is most prone to memory leak issues, whereas Go (with its garbage collector) and Rust (with the ownership and borrowing system) are memory safe.
It's also worth noting that the leak is/was only happening on Linux (our Dockerfile uses the debian:bullseye-slim
), indicating that the issue itself happens in a native system library which the Radio uses, and not necessarily in the Radio code itself (or in the SDK/waku Rust bindings code).
After thorough investigation, the leak was traced to a discv5 issue on the waku-rust-bindings.
In simple terms, the memory usage pattern of the Radio was something like this:
__
/
_/
__/
_/
__/
_______/
Where we could see memory spiking often, in batches of 20-30MB at a time, ultimately that caused the Radio's memory usage to rise to 10GB over a few days, without ever going down.
The Waku team deployed a fix for the issue on their side, and straight away we saw an improvement. The memory usage pattern turned to something like this:
_____________
_______________/
_______________/
_/
_/
_/
_/
_/
_/
Which was a notable improvement! The big spikes were gone, and now the memory went to 4.5GB after a few days. We saw a huge initial spike after the Radio starts and then it turned into a fairly stable plateau. But again, the memory usage was never going down, and despite it rising way slower than before, we still had 2 big problems:
As mentioned above, it's very hard to trace memory issues and profile memory usage in a system that jumps through so much hoops, as Subgraph Radio does. There are three languages involved here in the different layers of the application - Rust, Go and C, and all of them interact with the host system in various ways.
Ultimately, only heaptrack proved useful (I tried valgrind
before but its output didn't reveal much).
I profile the Radio's memory using a macOS machine that runs the Radio in a Docker container, for this I just added heaptrack
do the Dockerfile along with our other packages:
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
wget \
...
heaptrack \
...
To run heaptrack in the container, I point it to the subgraph-radio
release binary:
docker exec -it subgraph-radio heaptrack /usr/local/bin/subgraph-radio
I let the Radio run for 10-15 minutes and stop the container, that leaves me with a heaptrack report file (heaptrack.subgraph-radio.21.gz
) which I copy over to my host machine (docker cp subgraph-radio:/heaptrack.subgraph-radio.21.gz ./heaptrack-output.gz
). From here, you can either use heaptrack-gui
or heaptrack-print
to analyze the report in a human-friendly way (I used the print option - heaptrack_print heaptrack-output.gz | less
).
Right after opening the heaptrack report, a few issues stand out. Raw sample from the report:
reading file "heaptrack-output.gz" - please wait, this might take some time...
Debuggee command was: /usr/local/bin/subgraph-radio
finished reading file, now analyzing data:
MOST CALLS TO ALLOCATION FUNCTIONS
36611151 calls to allocation functions with 38.94M peak consumption from
CRYPTO_zalloc
in /usr/lib/aarch64-linux-gnu/libcrypto.so.1.1
773090 calls with 0B peak consumption from:
OPENSSL_sk_new_reserve
...
From this we can already see that the issue is coming from the CRYPTO_zalloc
function in the libcrypto.so.1.1
Linux/Debian package. For context, this function is responsible for allocating memory and initializing it to zero.
In the short time that the Radio ran (10-15 minutes), it called the CRYPTO_zalloc
function almost 37 million times, that's a lot of heap allocations without corresponding deallocations! The main villain CRYPTO_zalloc
also has a sidekick - sk_new_reserve
, it was called 774 thousand times. Without a doubt - OpenSSL is our culprit.
The Radio uses OpenSSL when it's running on Linux (Debian) environments because both the SDK and the Radio use reqwest
, which uses the native-tls
. And here's the catch which explains why this memory leak only happens in Linux (Debian) and not macOS - on macOS native-tls
uses the native security framework provided by the operating system, which is called Secure Transport. But on Linux (Debian), native-tls
uses OpenSSL as the underlying TLS library.
native-tls
Before outlining the solution, it's worth touching on why native-tls
is used in reqwest
and so many other well-established Rust crates. The first and most obvious one is that, at the time of creating reqwest
there weren't many alternatives. reqwest
was published on December 6, 2016, and native-tls
shortly before that - May 27, 2016.
native-tls
is also useful because it relies on the platform's native TLS libraries (OpenSSL on Unix/Linux, Secure Transport on macOS, and SChannel on Windows), so using native-tls
provides compatibility across various environments.
rustls
rustls
was published on April 10, 2017. It's a modern TLS library written entirely in Rust. It was created to provide a memory-safe, efficient, and secure TLS implementation. Unlike native-tls
, which relies on external libraries, rustls
is built from the ground up in Rust, focusing on security and performance. It uses Rust's ownership and type system to ensure memory safety and thread safety.
OpenSSL, which caused all this, is built with C, a very memory unsafe language. Secure Transport, which is the default TLS framework (that native-tls
uses on macOS), is built with Objective-C, which uses automatic reference counting to manage memory, making it memory safe, unlike regular C. That's why native-tls
didn't cause any problems when we run the Radio on macOS.
In order to use rustls
instead of native-tls
I switched the way we configure reqwest
. From:
reqwest = { version = "0.11.17", features = ["json"] }
To:
reqwest = { version = "0.11.24", features = ["json", "rustls-tls"] }
We have two other dependencies in Graphcast SDK that rely on native-tls
- slack-morphism
and teloxide
, which are used for out Slack and Telegram bots. I removed them for the purposes of this Spike.
After making those changes and profiling the Radio again, this is the result:
Initial Peak Memory Consumption: 38.94M
Updated Peak Memory Consumption: 9.46M
Improvement: ~75%
Initial Allocation Calls: 37 million
Updated Allocation Calls: 563 thousand
Improvement: ~98%
rustls
worksOpenSSL relies on manual memory allocation and deallocation, which can lead to memory leaks if not handled correctly. OpenSSL also maintains a lot of global state, which can cause issues when used in multi-threaded environments like those in Rust applications. This global state management can result in memory being allocated and not freed correctly, especially in containerized environments, and especially when using FFI. OpenSSL requires explicit initialization and management of locks, which can be error-prone. The default behavior without proper initialization can lead to race conditions and memory leaks in multi-threaded applications.
rustls
, on the other hand, follows Rust’s memory safety guarantees. It leverages Rust’s ownership, borrowing, and lifetime semantics to manage memory automatically, reducing the risk of leaks. It's also designed to work well in multi-threaded environments, avoiding the global state issues that can plague OpenSSL. And because rustls
is written in Rust, it naturally integrates with Rust applications, avoiding the pitfalls of FFI and the associated complexity. It also inherently supports Rust’s concurrency model. It doesn’t rely on global state or require explicit lock management, thus avoiding common pitfalls associated with OpenSSL.
native-tls
to rustls
everywhere. This includes the reqwest
, slack-morphism
and teloxide
crates.Wow, this is an awesome report! Thanks for sharing your methodology, I didn't know about heaptrack. It seems that the actual problem is either on reqwest, or whatever the FFI wrapper crate is for OpenSSL. But OpenSSL is a infinite pit of CVEs anyway, so good riddance. The only argument I've heard for sticking with the system OpenSSL lib is that it gets updated by the OS, so you should always have a patched version regardless if the actual app gets patched. Probably not worth the hassle though, especially if users rely on the official container image that by definition will ship a frozen OpenSSL lib anyway.
Could heaptrack help for https://github.com/graphprotocol/indexer/issues/41? :grin:
Could heaptrack help for graphprotocol/indexer#41? 😁
hopefully! 🤞🏻
This Issue comes the back of the memory leak issue that we tackled in Subgraph Radio 1.0.2 release. While the initial Issue has been fixed and the Radio's RAM is not incrementally increasing all the time, it still appears that it's consuming too much RAM (±4.7GB), and it still looks like it's slowly growing over time (although much much less evident than before).
A basic instance of the Graphcast SDK only takes about 250MB of RAM when running continuously, so the Radio's usage is 50x that. We need to understand why it's using so much RAM and if we can improve this somehow.