graphops / subgraph-radio

Gossip about Subgraphs with other Graph Protocol Indexers
docs.graphops.xyz/graphcast/radios/subgraph-radio/intro
Apache License 2.0
8 stars 1 forks source link

Investigate memory usage #147

Closed pete-eiger closed 5 months ago

pete-eiger commented 5 months ago

This Issue comes the back of the memory leak issue that we tackled in Subgraph Radio 1.0.2 release. While the initial Issue has been fixed and the Radio's RAM is not incrementally increasing all the time, it still appears that it's consuming too much RAM (±4.7GB), and it still looks like it's slowly growing over time (although much much less evident than before).

image

A basic instance of the Graphcast SDK only takes about 250MB of RAM when running continuously, so the Radio's usage is 50x that. We need to understand why it's using so much RAM and if we can improve this somehow.

pete-eiger commented 5 months ago

Spike findings

Context, complexity of the Issue & recap on previous work

Initially the memory leak was reported in January of this year. It was particularly hard to find the cause of it, since it could be in one of the many components, underlying Subgraph Radio:

From the start the most likely culprit were the C bindings and how they interact with Go and Rust code, since C is most prone to memory leak issues, whereas Go (with its garbage collector) and Rust (with the ownership and borrowing system) are memory safe.

It's also worth noting that the leak is/was only happening on Linux (our Dockerfile uses the debian:bullseye-slim), indicating that the issue itself happens in a native system library which the Radio uses, and not necessarily in the Radio code itself (or in the SDK/waku Rust bindings code).

After thorough investigation, the leak was traced to a discv5 issue on the waku-rust-bindings.

In simple terms, the memory usage pattern of the Radio was something like this:

                   __
                  /
                _/ 
             __/   
           _/      
        __/        
_______/           

Where we could see memory spiking often, in batches of 20-30MB at a time, ultimately that caused the Radio's memory usage to rise to 10GB over a few days, without ever going down.

The Waku team deployed a fix for the issue on their side, and straight away we saw an improvement. The memory usage pattern turned to something like this:

                                            _____________
                            _______________/
            _______________/
          _/       
        _/        
      _/         
    _/          
  _/           
_/             

Which was a notable improvement! The big spikes were gone, and now the memory went to 4.5GB after a few days. We saw a huge initial spike after the Radio starts and then it turned into a fairly stable plateau. But again, the memory usage was never going down, and despite it rising way slower than before, we still had 2 big problems:

  1. The memory kept going up, indicating that there was still a memory leak
  2. 3-5GB of RAM is too much usage for the Radio, given that a simple instance of the Graphcast SDK/Waku Rust bindings/Go waku node typically needs 200-250MB (thanks to @aasseman for reviving the Memory leak issue with these concerns btw).

Investigation

As mentioned above, it's very hard to trace memory issues and profile memory usage in a system that jumps through so much hoops, as Subgraph Radio does. There are three languages involved here in the different layers of the application - Rust, Go and C, and all of them interact with the host system in various ways.

Ultimately, only heaptrack proved useful (I tried valgrind before but its output didn't reveal much).

Profiling the Radio's memory usage

I profile the Radio's memory using a macOS machine that runs the Radio in a Docker container, for this I just added heaptrack do the Dockerfile along with our other packages:

RUN apt-get update \
    && apt-get install -y --no-install-recommends \
        wget \
        ...
        heaptrack \
        ...

To run heaptrack in the container, I point it to the subgraph-radio release binary:

docker exec -it subgraph-radio heaptrack /usr/local/bin/subgraph-radio

I let the Radio run for 10-15 minutes and stop the container, that leaves me with a heaptrack report file (heaptrack.subgraph-radio.21.gz) which I copy over to my host machine (docker cp subgraph-radio:/heaptrack.subgraph-radio.21.gz ./heaptrack-output.gz). From here, you can either use heaptrack-gui or heaptrack-print to analyze the report in a human-friendly way (I used the print option - heaptrack_print heaptrack-output.gz | less).

The culprit

Right after opening the heaptrack report, a few issues stand out. Raw sample from the report:

reading file "heaptrack-output.gz" - please wait, this might take some time...
Debuggee command was: /usr/local/bin/subgraph-radio
finished reading file, now analyzing data:

MOST CALLS TO ALLOCATION FUNCTIONS
36611151 calls to allocation functions with 38.94M peak consumption from
CRYPTO_zalloc
  in /usr/lib/aarch64-linux-gnu/libcrypto.so.1.1
773090 calls with 0B peak consumption from:
    OPENSSL_sk_new_reserve
    ...

From this we can already see that the issue is coming from the CRYPTO_zalloc function in the libcrypto.so.1.1 Linux/Debian package. For context, this function is responsible for allocating memory and initializing it to zero.

In the short time that the Radio ran (10-15 minutes), it called the CRYPTO_zalloc function almost 37 million times, that's a lot of heap allocations without corresponding deallocations! The main villain CRYPTO_zalloc also has a sidekick - sk_new_reserve, it was called 774 thousand times. Without a doubt - OpenSSL is our culprit.

The Radio uses OpenSSL when it's running on Linux (Debian) environments because both the SDK and the Radio use reqwest, which uses the native-tls. And here's the catch which explains why this memory leak only happens in Linux (Debian) and not macOS - on macOS native-tls uses the native security framework provided by the operating system, which is called Secure Transport. But on Linux (Debian), native-tls uses OpenSSL as the underlying TLS library.

Solution

History of native-tls

Before outlining the solution, it's worth touching on why native-tls is used in reqwest and so many other well-established Rust crates. The first and most obvious one is that, at the time of creating reqwest there weren't many alternatives. reqwest was published on December 6, 2016, and native-tls shortly before that - May 27, 2016.

native-tls is also useful because it relies on the platform's native TLS libraries (OpenSSL on Unix/Linux, Secure Transport on macOS, and SChannel on Windows), so using native-tls provides compatibility across various environments.

Our Saviour - rustls

rustls was published on April 10, 2017. It's a modern TLS library written entirely in Rust. It was created to provide a memory-safe, efficient, and secure TLS implementation. Unlike native-tls, which relies on external libraries, rustls is built from the ground up in Rust, focusing on security and performance. It uses Rust's ownership and type system to ensure memory safety and thread safety.

A bit of language beef and the importance of memory safety

OpenSSL, which caused all this, is built with C, a very memory unsafe language. Secure Transport, which is the default TLS framework (that native-tls uses on macOS), is built with Objective-C, which uses automatic reference counting to manage memory, making it memory safe, unlike regular C. That's why native-tls didn't cause any problems when we run the Radio on macOS.

Numbers don't lie / vires in numeris

In order to use rustls instead of native-tls I switched the way we configure reqwest. From:

reqwest = { version = "0.11.17", features = ["json"] }

To:

reqwest = { version = "0.11.24", features = ["json", "rustls-tls"] }

We have two other dependencies in Graphcast SDK that rely on native-tls - slack-morphism and teloxide, which are used for out Slack and Telegram bots. I removed them for the purposes of this Spike.

After making those changes and profiling the Radio again, this is the result:

Initial Peak Memory Consumption: 38.94M
Updated Peak Memory Consumption: 9.46M
Improvement: ~75%

Initial Allocation Calls: 37 million
Updated Allocation Calls: 563 thousand
Improvement: ~98%

Why rustls works

OpenSSL relies on manual memory allocation and deallocation, which can lead to memory leaks if not handled correctly. OpenSSL also maintains a lot of global state, which can cause issues when used in multi-threaded environments like those in Rust applications. This global state management can result in memory being allocated and not freed correctly, especially in containerized environments, and especially when using FFI. OpenSSL requires explicit initialization and management of locks, which can be error-prone. The default behavior without proper initialization can lead to race conditions and memory leaks in multi-threaded applications.

rustls, on the other hand, follows Rust’s memory safety guarantees. It leverages Rust’s ownership, borrowing, and lifetime semantics to manage memory automatically, reducing the risk of leaks. It's also designed to work well in multi-threaded environments, avoiding the global state issues that can plague OpenSSL. And because rustls is written in Rust, it naturally integrates with Rust applications, avoiding the pitfalls of FFI and the associated complexity. It also inherently supports Rust’s concurrency model. It doesn’t rely on global state or require explicit lock management, thus avoiding common pitfalls associated with OpenSSL.

Next Steps

aasseman commented 5 months ago

Wow, this is an awesome report! Thanks for sharing your methodology, I didn't know about heaptrack. It seems that the actual problem is either on reqwest, or whatever the FFI wrapper crate is for OpenSSL. But OpenSSL is a infinite pit of CVEs anyway, so good riddance. The only argument I've heard for sticking with the system OpenSSL lib is that it gets updated by the OS, so you should always have a patched version regardless if the actual app gets patched. Probably not worth the hassle though, especially if users rely on the official container image that by definition will ship a frozen OpenSSL lib anyway.

aasseman commented 5 months ago

Could heaptrack help for https://github.com/graphprotocol/indexer/issues/41? :grin:

pete-eiger commented 5 months ago

Could heaptrack help for graphprotocol/indexer#41? 😁

hopefully! 🤞🏻