paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.78k stars 646 forks source link

Using warp sync on parachain triggers OOM #5053

Open Dinonard opened 2 months ago

Dinonard commented 2 months ago

Is there an existing issue?

Experiencing problems? Have you tried our Stack Exchange first?

Description of bug

As the title suggest, using warp sync on parachain causes OOM, crashing the client.

We've had this problem on Astar for a few months, and have recently uplifted to polkadot-sdk version v1.9.0 but are still seeing the problem. There are no outstanding traces in the log, it just explodes at some point.

There's an issue opened in our repo, https://github.com/AstarNetwork/Astar/issues/1110, with steps to reproduce as well as images of resource consumption before the crash.

We haven't been able to find similar issues or discussion related to the topic.

Steps to reproduce

Run latest Astar node as described in the linked issue.

bkchr commented 2 months ago

Hey, This is basically a duplicate of: https://github.com/paritytech/polkadot-sdk/issues/4

The problem is that right now we keep the entire state in memory while downloading. For chains that have a big state (not sure what the state size of Astar is), this can lead to OOM on machines with not enough main memory.

Dinonard commented 2 months ago

Hey @bkchr, I'm not sure this is the same as the issue you linked - the memory consumption suddenly spikes, it doesn't grow slowly over the time.

Please check the image here. There are huge spikes that happen abruptly.

bkchr commented 2 months ago

Hmm yeah. I still think it is related to having everything in memory.

Did you try to run the node with gdb? To get some stacktrace when it OOMs?

Dinonard commented 2 months ago

We haven't but I'll ask our devops team to do that and I'll post the traces here.

bkchr commented 2 months ago

Ty.

Dinonard commented 2 months ago

Does this help? threads.log

We've used this command to generate it: sudo gdb --batch --quiet -ex "thread apply all bt full" -p <...>

bkchr commented 2 months ago

@Dinonard but this is not from the point when OOMs. You need to run the node with gdb all the time attached.

Dinonard commented 1 month ago

TBH I checked the logs myself and couldn't see anything wrong, but AFAIK this was run immediately when the node started. Let me get back to you.

Dinonard commented 1 month ago

I've run it on the same server as before with gdb now properly encapsulating the entire service. And it's weird: Thread 12 "tokio-runtime-w" received signal SIGPIPE, Broken pipe.

gdb_dump.log


Sorry about the missing symbols but I haven't used GDB with any Rust app before. I figured that building with debug flag (-g) would include the debug info (and symbols?) into the binary itself but it doesn't seem to have helped.

liuchengxu commented 1 month ago

Just to share my experiences on the state sync with a chain having a huge state. The subcoin node crashed the first time when importing the downloaded state at a certain height. And then I reran it with the same command syncing the state at the same height, with gdb attached. Unfortunately (or fortunately :P), it didn't crash with gdb. I observed that this successful state importing almost absorbed my entire memory (my machine has 128GiB of memory) at the peak.

2024-08-19 16:35:01.747  INFO tokio-runtime-worker sync: State sync is complete (5438 MiB), restarting block sync.
liuchengxu commented 3 weeks ago
diff --git a/substrate/primitives/trie/src/lib.rs b/substrate/primitives/trie/src/lib.rs
index ef6b6a5743..e0a2cf3b30 100644
--- a/substrate/primitives/trie/src/lib.rs
+++ b/substrate/primitives/trie/src/lib.rs
@@ -296,23 +296,30 @@ where
        V: Borrow<[u8]>,
        DB: hash_db::HashDB<L::Hash, trie_db::DBValue>,
 {
-       {
+       // {
                let mut trie = TrieDBMutBuilder::<L>::from_existing(db, &mut root)
                        .with_optional_cache(cache)
                        .with_optional_recorder(recorder)
                        .build();

+               tracing::info!("====================== Collecting delta");
                let mut delta = delta.into_iter().collect::<Vec<_>>();
+               tracing::info!("====================== Finished Collecting delta: {}", delta.len());
                delta.sort_by(|l, r| l.0.borrow().cmp(r.0.borrow()));
+               tracing::info!("====================== Sorted delta");

-               for (key, change) in delta {
+               tracing::info!("====================== Starting to write trie, mem usage: {:.2?}GiB", memory_stats::memory_stats().map(|usage| usage.physical_mem as f64 / 1024.0 / 1024.0 / 1024.0));
+               for (index, (key, change)) in delta.into_iter().enumerate() {
                        match change.borrow() {
                                Some(val) => trie.insert(key.borrow(), val.borrow())?,
                                None => trie.remove(key.borrow())?,
                        };
                }
-       }
+               tracing::info!("====================== Finished writing delta to trie, mem usage: {:.2?}GiB", memory_stats::memory_stats().map(|usage| usage.physical_mem as f64 / 1024.0 / 1024.0 / 1024.0));
+               drop(trie);
+       // }

+       tracing::info!("====================== End of delta_trie_root, mem usage: {:.2?}GiB", memory_stats::memory_stats().map(|usage| usage.physical_mem as f64 / 1024.0 / 1024.0 / 1024.0));
        Ok(root)
 }
 2024-08-29 10:52:53 ====================== Collecting delta
2024-08-29 10:52:55 ====================== Finished Collecting delta: 104674943
2024-08-29 10:52:56 ====================== Sorted delta
2024-08-29 10:52:57 ====================== Starting to write trie, mem usage: Some(26.23)GiB
2024-08-29 10:54:09 ====================== Finished writing delta to trie, mem usage: Some(76.36)GiB
2024-08-29 11:00:28 ====================== End of delta_trie_root, mem usage: Some(90.53)GiB

I added some logging for the memory usage in the block import pipeline. It turned out that https://github.com/subcoin-project/polkadot-sdk/blob/13ca1b64692b05b699f49e729d0522ed4be730b9/substrate/primitives/trie/src/lib.rs#L285 is the culprit. The memory usage surged from 26 GiB to 76 GiB after the trie was built. Importing the same state does not always succeed, if it crashes due to OOM, Finished writing delta to trie... would be printed and then it crashes, End of delta_trie_root won't be printed. Any clue for the further investigation and potential fix? @bkchr

liuchengxu commented 2 weeks ago

Constructing the entire trie from the state at a specific height in memory seems to be the primary cause of the OOM issue. This is a critical design flaw, in my opinion, especially since the chain state will continue to grow over time. While Polkadot may be fine for now, it will inevitably face the same problem in the long run if we don't address this. Please prioritize pushing this issue forward. @bkchr

bkchr commented 2 weeks ago

Constructing the entire trie from the state at a specific height in memory seems to be the primary cause of the OOM issue.

Yeah sounds reasonable. I need to think about on how to improve this, but yeah this should not happen ;)