Closed jimmygchen closed 3 months ago
Looks like a panic inside publish_block
(after the block is produced)
Yeah, I think Pawan figured it's most likely something to do with the compute_cells_and_kzg_proofs
function (which constructs data column sidecars from blobs), which is invoked from here: https://github.com/sigp/lighthouse/blob/ba8126e02fa8c18a6b8aafde4550088f0721b6c1/beacon_node/beacon_chain/src/block_verification.rs#L740
I think the stackoverflow in this case was a red-herring -- running the LH node in debug mode on the devnet, shows the issue was that rust-embed
was, for some, either:
The fix, replacing rust-embed with include_str
, has been merged into the das
branch -- currently waiting for a few folks who reproduced the issue to confirm the issue is fixed on their end. For quick testing, I've uploaded docker images with the fix here kevaundray/lighthouse-eth:das
for use in kurtosis
Hi @kevaundray
I've tried the kevaundray/lighthouse-eth:das
image and managed to reproduced the error:
thread '<unknown>' has overflowed its stack
fatal runtime error: stack overflow
I can take another look tomorrow too.
Hi @kevaundray
I've tried the
kevaundray/lighthouse-eth:das
image and managed to reproduced the error:thread '<unknown>' has overflowed its stack fatal runtime error: stack overflow
I can take another look tomorrow too.
Oh that's surprising -- if you run it in debug mode, then that should give a stacktrace
No rush on this, I can try running it on a different machine x86_64 too and in the meantime, I will update the image to use debug mode
To help debugging further, I have created three docker images in debug mode based on the last three commits on the das
branch and pasted the output of them in gist files.
lighthouse/src/main.rs
and changed: if std::env::var("RUST_BACKTRACE").is_err() {
std::env::set_var("RUST_BACKTRACE", "1");
}
to
if std::env::var("RUST_BACKTRACE").is_err() {
std::env::set_var("RUST_BACKTRACE", "full");
}
So I could have just modified the Dockerfile and used the dev
profile. I went a roundabout way by adding --debug
to the MakeFile on cargo install and commenting out the places in the DockerFile and MakeFile that use the profile flag/env variable.
Created a bash script that checked out the following commits from the das
branch:
These are in order of newest to oldest.
Then just ran, built and published each of these commits with the commit hash in the tag name.
# filename: peerdas.yaml
participants:
# Supernodes
- cl_type: prysm
cl_image: ethpandaops/prysm-beacon-chain:peerdas-initial-sync
cl_extra_params: [--subscribe-all-subnets]
- cl_type: lighthouse
cl_image: kevaundray/lighthouse-eth:das-dbg-{COMMIT_HASH}
cl_extra_params: [--subscribe-all-data-column-subnets]
# Non supernodes
- cl_type: prysm
cl_image: ethpandaops/prysm-beacon-chain:peerdas-initial-sync
- cl_type: lighthouse
cl_image: kevaundray/lighthouse-eth:das-dbg-{COMMIT_HASH}
network_params:
eip7594_fork_epoch: 0
eip7594_fork_version: "0x50000038"
snooper_enabled: true
global_log_level: debug
additional_services:
- dora
- goomy_blob
Using Kurtosis, I run the following command:
kurtosis run --enclave local-testnet github.com/ethpandaops/ethereum-package --args-file peerdas.yaml --image-download always
The main difference here between the start_local_testnet
script is that I have --image-download always
-- the default for Kurtosis seems to be to use any cached image on the system locally, regardless of the tag. The docs mention it here
On the 018f38257
branch, I get the rust-embed issue I mentioned before. Contents have been copied into this gist
On the 55a3be72b
branch, I get the rust-embed issue also. Contents have been copied into this gist
On the 04d9eef8a
branch, I don't get the rust-embed issue, however there seems to be an issue with the vc-lighthouse-geth pairs. This is possibly related to the timeout. Contents have been copied into this gist
A lot of time was spent on Execution in progress, when adding a lighthouse node:
This is kurtosis waiting for Lighthouse to complete the startup process and wait until it's healthy before proceeding.
Yeah I'm not exactly sure what's taking a long time, but I don't think it's to do with KZG as it loads in about 500ms. I'll look into the slow startup issue.
Seems to load pretty quickly on my amd machine. Would be useful to check the Lighthouse container logs of the while "Execution in progress" is happening.
Seems to load pretty quickly on my amd machine. Would be useful to check the Lighthouse container logs of the while "Execution in progress" is happening.
For the latest commit (04d9eef8a) the logs for vc-geth-lighthouse was :
docker logs -f vc-2-geth-lighthouse--66190c11946b4c7980461b123f308375
Jul 18 08:24:39.178 INFO Logging to file path: "/validator-keys/keys/logs/validator.log"
Jul 18 08:24:39.274 INFO Lighthouse started version: Lighthouse/v5.2.1-04d9eef+
Jul 18 08:24:39.274 INFO Configured for network name: custom (/network-configs)
Unable to initialize validator config: Unable to parse proposer_nodes: Unknown argument or group id. Make sure you are using the argument id and not the short or long flags
Yeah I'm not exactly sure what's taking a long time, but I don't think it's to do with KZG as it loads in about 500ms.
My current hypothesis is that the lengthy periods of time are due to retries. This is because in the error messages by Kurtosis where the service did not start up, it notes the retry attempts:
Caused by: Unsuccessful ports check for IP '172.16.0.17' and port spec '{privatePortSpec:0x400088c570}', even after '240' retries with '500' milliseconds in between retries. Timeout '2m0s' has been reached
If this is correct, then the long startup times on 55a3be72b
and 04d9eef8a
would be due to Kurtosis retrying with the rust-embed issue, and the long startup times on 018f38257
would possibly be due to the Kurtosis retrying on the Unable to initialize validator config
issue
For completeness, I'll add what the Dockerfile and Makefile looks like, that I used to build the docker images:
FROM rust:1.78.0-bullseye AS builder
RUN apt-get update && apt-get -y upgrade && apt-get install -y cmake libclang-dev
COPY . lighthouse
ARG FEATURES
# ARG PROFILE=release
ARG CARGO_USE_GIT_CLI=true
ENV FEATURES $FEATURES
# ENV PROFILE $PROFILE
ENV CARGO_NET_GIT_FETCH_WITH_CLI=$CARGO_USE_GIT_CLI
RUN cd lighthouse && make
FROM ubuntu:22.04
RUN apt-get update && apt-get -y upgrade && apt-get install -y --no-install-recommends \
libssl-dev \
ca-certificates \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
COPY --from=builder /usr/local/cargo/bin/lighthouse /usr/local/bin/lighthouse
The main change is that I remove the PROFILE env variable.
The only parts I changed in the Makefile were the install
commands:
install:
cargo install --debug --path lighthouse --force --locked \
--features "$(FEATURES)" \
$(CARGO_INSTALL_EXTRA_FLAGS)
# Builds the lcli binary in release (optimized).
install-lcli:
cargo install --debug --path lcli --force --locked \
--features "$(FEATURES)" \
$(CARGO_INSTALL_EXTRA_FLAGS)
The main changes are that I removed the profile flag and added --debug.
I've pushed a new image with the fix in PR 6125 -- the tag is kevaundray/lighthouse-eth:das-dbg-cli-fix-04d9eef8a
For startup times, I get ~10 seconds to startup a cl-2-lighthouse-X
pair and almost instantanious to startup a vc-lighthouse-X
pair
I am still getting a stackoverflow though its a bit more detailed:
Jul 18 06:00:39.835 DEBG Beacon proposer found validator_index: 233, slot: 21, head_block_root: 0xc65d38627d411aaee3617a8019e3f942cd57c9b85da1101862649641ab696619, payload_attributes: V3(PayloadAttributesV3 { timestamp: 1721282440, prev_randao: 0x6568eee59d352cf3e7dcbd1aed7783a7d8ed5c0dcd609b5fdfd5dae6445927f7, suggested_fee_recipient: 0x8943545177806ed17b9f23f0a21ee5948ecaa776, withdrawals: [], parent_beacon_block_root: 0xc65d38627d411aaee3617a8019e3f942cd57c9b85da1101862649641ab696619 }), service: exec
Jul 18 06:00:40.000 DEBG Processed HTTP API request method: GET, path: /eth/v1/node/version, status: 200 OK, elapsed: 1.281542ms
Jul 18 06:00:40.076 DEBG Running beacon chain per slot tasks slot: Slot(21), service: beacon
Jul 18 06:00:40.077 DEBG No change in canonical head head: 0xc65d38627d411aaee3617a8019e3f942cd57c9b85da1101862649641ab696619, service: beacon
Jul 18 06:00:40.079 DEBG Processed HTTP API request method: GET, path: /eth/v1/validator/duties/proposer/0, status: 200 OK, elapsed: 5.268625ms
Jul 18 06:00:40.081 DEBG Processed HTTP API request method: GET, path: /eth/v1/config/spec, status: 200 OK, elapsed: 6.118875ms
Jul 18 06:00:40.169 DEBG Processed HTTP API request method: POST, path: /eth/v1/validator/duties/attester/0, status: 200 OK, elapsed: 11.719875ms
Jul 18 06:00:40.174 DEBG Received proposer preparation data count: 64
Jul 18 06:00:40.175 DEBG Processed HTTP API request method: POST, path: /eth/v1/validator/prepare_beacon_proposer, status: 200 OK, elapsed: 16.40725ms
Jul 18 06:00:40.175 DEBG Block production request from HTTP API slot: 21
Jul 18 06:00:40.175 DEBG Fork choice successfully updated before block production, fork_choice_slot: 21, slot: 21, service: beacon
Jul 18 06:00:40.176 DEBG Not attempting re-org reason: head not late, service: beacon
Jul 18 06:00:40.327 ERRO No valid eth1_data votes, `votes_to_consider` empty, outcome: casting `state.eth1_data` as eth1 vote, genesis_time: 1721282188, earliest_block_timestamp: 1721282168, lowest_block_number: 0, service: deposit_contract_rpc
Jul 18 06:00:40.328 DEBG Produced vote for eth1 chain block_hash: 0x78ce9187e998a0436cec8cc1d2d72d43d6263f8f257b2c479ad2764cc5637b69, deposit_count: 0, deposit_root: 0xd70a234731285c6804c2a4f56711ddb8c82c99740f207854891028af34e27e5e, service: deposit_contract_rpc
Jul 18 06:00:40.336 DEBG Issuing engine_getPayload parent_hash: 0x3a41a02252cd0769ff78d2d58366706c02c252dfab081d372b7d45b37bd954ab, timestamp: 1721282440, prev_randao: 0x6568eee59d352cf3e7dcbd1aed7783a7d8ed5c0dcd609b5fdfd5dae6445927f7, suggested_fee_recipient: 0x8943545177806ed17b9f23f0a21ee5948ecaa776, service: exec
Jul 18 06:00:41.229 DEBG Produced block on state block_size: 3781, service: beacon
thread 'tokio-runtime-worker' has overflowed its stack
fatal runtime error: stack overflow
Full logs: https://gist.github.com/kevaundray/04504a2bbb175a9485f120251b9a24a3
Unable to initialize validator config: Unable to parse proposer_nodes: Unknown argument or group id. Make sure you are using the argument id and not the short or long flags
yep your image actually revealed a bug we've had for a while! We rarely run validator client in debug mode so this bug has been around for a while without us noticing! Fixed in #6125
Unable to initialize validator config: Unable to parse proposer_nodes: Unknown argument or group id. Make sure you are using the argument id and not the short or long flags
yep your image actually revealed a bug we've had for a while! We rarely run validator client in debug mode so this bug has been around for a while without us noticing! Fixed in #6125
Oh great, can confirm that that does fix it in debug mode!
@jimmygchen was away for a bit, I added debug lines into the code and seems to imply that the issue was never peerdas kzg:
Jul 20 06:44:46.174 DEBG let kzg_proofs = Vec::from(proofs) service: beacon
Jul 20 06:44:46.174 DEBG let kzg = self.kzg service: beacon
Jul 20 06:44:46.174 DEBG let blobs_ref: Vec<_> service: beacon
[beacon_node/beacon_chain/src/kzg_utils.rs:85:5] blobs.len() = 4
thread 'tokio-runtime-worker' has overflowed its stack
fatal runtime error: stack overflow
In complete_partial_beacon_block
method, I added the following debug lines corresponding to the snippet above.
let kzg_proofs = Vec::from(proofs);
debug!(
self.log,
"let kzg_proofs = Vec::from(proofs)";
);
let kzg = self
.kzg
.as_ref()
.ok_or(BlockProductionError::TrustedSetupNotInitialized)?;
debug!(
self.log,
"let kzg = self.kzg";
);
let blobs_ref: Vec<_> = blobs.iter().collect();
debug!(
self.log,
"let blobs_ref: Vec<_>";
);
kzg_utils::validate_blobs::<T::EthSpec>(
kzg,
expected_kzg_commitments,
blobs_ref,
&kzg_proofs,
)
.map_err(BlockProductionError::KzgError)?;
debug!(
self.log,
"kzg_utils::validate_blobs";
);
In validate_blobs, I added the following debug lines. Note, I use dbg!
so it will print out differently in terminal.
pub fn validate_blobs<E: EthSpec>(
kzg: &Kzg,
expected_kzg_commitments: &[KzgCommitment],
blobs: Vec<&Blob<E>>,
kzg_proofs: &[KzgProof],
) -> Result<(), KzgError> {
let _timer = crate::metrics::start_timer(&crate::metrics::KZG_VERIFICATION_BATCH_TIMES);
dbg!("blobs size: ", blobs.len());
let blobs = blobs
.into_iter()
.map(|blob| ssz_blob_to_crypto_blob::<E>(blob))
.collect::<Result<Vec<_>, KzgError>>()?;
dbg!("blobs collected");
let ok = kzg.verify_blob_kzg_proof_batch(&blobs, expected_kzg_commitments, kzg_proofs);
dbg!("verified verify_blob_kzg_proof_batch");
ok
}
This implies that the stackoverflow is happening with four blobs on this line:
let blobs = blobs
.into_iter()
.map(|blob| ssz_blob_to_crypto_blob::<E>(blob))
.collect::<Result<Vec<_>, KzgError>>()?;
Looking further,
ssz_blob_to_crypto_blob
calls KzgBlob::from_bytes(blob.as_ref()).map_err(Into::into)
The from_bytes method looks like this:
pub fn from_bytes(bytes: &[u8]) -> Result<Self, Error> {
if bytes.len() != BYTES_PER_BLOB {
return Err(Error::InvalidBytesLength(format!(
"Invalid byte length. Expected {} got {}",
BYTES_PER_BLOB,
bytes.len(),
)));
}
let mut new_bytes = [0; BYTES_PER_BLOB];
new_bytes.copy_from_slice(bytes);
Ok(Self::new(new_bytes))
}
with four blobs this would allocate 4 4096 32 = 524KB onto the stack -- this is likely not the only contributor but a big part
I had assumed that the stack overflow was coming from rust-embed and was being converted into a panic in debug mode.
The issue with rust-embed only happens in debug mode and it only happens because the Dockerfile copies the binary from one image to another. I have created a proof of concept of this here: https://github.com/kevaundray/poc-rust-embed
This would imply that the original issue was always the stackoverflow due to validate_blobs
I'll note that a possible quick fix for this is to decrease the batch size to 1 and verify each blob individually for now using a for loop. This should allow the compiler to free each blob after every for loop/verification because it can see the blob variable going out of scope.
Is the crash only happening in debug
builds? AFAIK Lighthouse is basically not expected to work compiled in debug
, and fixing this might be very involved.
One of the main sources of stack bloat for us is futures, and previous investigations using the compiler's -Ztype-sizes
flag showed a bunch of massive futures:
RUSTFLAGS='-Zprint-type-sizes' cargo +nightly build --release > type_sizes.txt
I suspect running that command on the relevant branch without --release
will reveal futures that are even larger than when we compile in release
, due to rustc
's inability to optimise them.
We made some improvements when we were getting stack overflows on the HTTP API in this PR, although the root of the issue still wasn't fully established:
It seems that warp
does some very deep nesting of futures that we have little control over. Warp has been annoying in many other ways, so there is some push to move the HTTP API server to axum
. This may be the best way to solve this problem for good.
If it is of interest to you, I would appreciate your insights and research here @kevaundray, but I also think you'd be justified in tapping out and focussing on PeerDAS. As a workaround if you'd like to run Lighthouse with debug assertions, you could turn them on in the release
profile:
# In top-level Cargo.toml
[profile.release]
debug-assertions = true
Or add a new debug-release profile:
[profile.debug-release]
inherits = "release"
debug-assertions = true
This would also have the advantage of optimising the BLS crypto code, which is usually very slow in debug builds too.
with four blobs this would allocate 4 4096 32 = 524KB onto the stack -- this is likely not the only contributor but a big part
Oh yeah this has been a known issue for a while, and I raised an issue for this last week #6120
Although i think people were getting stack overflow with the image built by ethPandaOps
, which I suppose is built with release profile?
I've managed to get a core dump of the crash with Lighthouse running with release profile.
Here's the backtrace - no debug symbols so it's not nice to read, but we know it's failing at kzg::Kzg::compute_cells_and_proofs
(gdb) bt
#0 0x0000ffffa027f200 in ?? () from /lib/aarch64-linux-gnu/libc.so.6
#1 0x0000ffffa023a67c in raise () from /lib/aarch64-linux-gnu/libc.so.6
#2 0x0000ffffa0227130 in abort () from /lib/aarch64-linux-gnu/libc.so.6
#3 0x0000aaaadb6b5bc0 in std::sys::pal::unix::abort_internal ()
#4 0x0000aaaadb6b444c in std::sys::pal::unix::stack_overflow::imp::signal_handler ()
#5 <signal handler called>
#6 0x0000aaaad9e7ad00 in <alloc::vec::Vec<T,A> as alloc::vec::spec_extend::SpecExtend<T,I>>::spec_extend ()
#7 0x0000aaaada010800 in rayon::iter::plumbing::bridge_producer_consumer::helper ()
#8 0x0000aaaad9bbcdb4 in _ZN10rayon_core4join12join_context28_$u7b$$u7b$closure$u7d$$u7d$17hac9acf36c7bc6fdcE.llvm.4150297595829751684 ()
#9 0x0000aaaada0108e4 in rayon::iter::plumbing::bridge_producer_consumer::helper ()
#10 0x0000aaaad9c8899c in <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute ()
#11 0x0000aaaad9126e5c in rayon_core::registry::WorkerThread::wait_until_cold ()
#12 0x0000aaaad9124578 in _ZN10rayon_core8registry8Registry15in_worker_cross17he57b8724c687b801E.llvm.9124983992865838625 ()
#13 0x0000aaaadafa28bc in kzg::Kzg::compute_cells_and_proofs ()
#14 0x0000aaaad9e7ae70 in <alloc::vec::Vec<T,A> as alloc::vec::spec_extend::SpecExtend<T,I>>::spec_extend ()
#15 0x0000aaaada010800 in rayon::iter::plumbing::bridge_producer_consumer::helper ()
#16 0x0000aaaad9bbcdb4 in _ZN10rayon_core4join12join_context28_$u7b$$u7b$closure$u7d$$u7d$17hac9acf36c7bc6fdcE.llvm.4150297595829751684 ()
#17 0x0000aaaada0108e4 in rayon::iter::plumbing::bridge_producer_consumer::helper ()
#18 0x0000aaaad9c8899c in <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute ()
#19 0x0000aaaad9126e5c in rayon_core::registry::WorkerThread::wait_until_cold ()
#20 0x0000aaaad9124578 in _ZN10rayon_core8registry8Registry15in_worker_cross17he57b8724c687b801E.llvm.9124983992865838625 ()
#21 0x0000aaaadafa28bc in kzg::Kzg::compute_cells_and_proofs ()
#22 0x0000aaaad9e7ae70 in <alloc::vec::Vec<T,A> as alloc::vec::spec_extend::SpecExtend<T,I>>::spec_extend ()
#23 0x0000aaaada010800 in rayon::iter::plumbing::bridge_producer_consumer::helper ()
#24 0x0000aaaad9c8899c in <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute ()
#25 0x0000aaaad9126e5c in rayon_core::registry::WorkerThread::wait_until_cold ()
#26 0x0000aaaad9124578 in _ZN10rayon_core8registry8Registry15in_worker_cross17he57b8724c687b801E.llvm.9124983992865838625 ()
#27 0x0000aaaadafa28bc in kzg::Kzg::compute_cells_and_proofs ()
#28 0x0000aaaad9e7ae70 in <alloc::vec::Vec<T,A> as alloc::vec::spec_extend::SpecExtend<T,I>>::spec_extend ()
#29 0x0000aaaada010800 in rayon::iter::plumbing::bridge_producer_consumer::helper ()
#30 0x0000aaaad9bbcdb4 in _ZN10rayon_core4join12join_context28_$u7b$$u7b$closure$u7d$$u7d$17hac9acf36c7bc6fdcE.llvm.4150297595829751684 ()
#31 0x0000aaaada0108e4 in rayon::iter::plumbing::bridge_producer_consumer::helper ()
#32 0x0000aaaad9c8899c in <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute ()
#33 0x0000aaaad9126e5c in rayon_core::registry::WorkerThread::wait_until_cold ()
#34 0x0000aaaad9124578 in _ZN10rayon_core8registry8Registry15in_worker_cross17he57b8724c687b801E.llvm.9124983992865838625 ()
#35 0x0000aaaadafa28bc in kzg::Kzg::compute_cells_and_proofs ()
#36 0x0000aaaad9e7ae70 in <alloc::vec::Vec<T,A> as alloc::vec::spec_extend::SpecExtend<T,I>>::spec_extend ()
#37 0x0000aaaada010800 in rayon::iter::plumbing::bridge_producer_consumer::helper ()
#38 0x0000aaaad9bbcdb4 in _ZN10rayon_core4join12join_context28_$u7b$$u7b$closure$u7d$$u7d$17hac9acf36c7bc6fdcE.llvm.4150297595829751684 ()
#39 0x0000aaaada0108e4 in rayon::iter::plumbing::bridge_producer_consumer::helper ()
#40 0x0000aaaad9bbcdb4 in _ZN10rayon_core4join12join_context28_$u7b$$u7b$closure$u7d$$u7d$17hac9acf36c7bc6fdcE.llvm.4150297595829751684 ()
#41 0x0000aaaad9c86640 in _ZN83_$LT$rayon_core..job..StackJob$LT$L$C$F$C$R$GT$$u20$as$u20$rayon_core..job..Job$GT$7execute17h9327bac41e84fa88E.llvm.4150297595829751684 ()
#42 0x0000aaaad9126e9c in rayon_core::registry::WorkerThread::wait_until_cold ()
#43 0x0000aaaadafcc710 in rayon_core::registry::ThreadBuilder::run ()
#44 0x0000aaaadafcf6b4 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#45 0x0000aaaadafd08b0 in core::ops::function::FnOnce::call_once{{vtable.shim}} ()
#46 0x0000aaaadb6b49d0 in std::sys::pal::unix::thread::Thread::new::thread_start ()
#47 0x0000ffffa027d5c8 in ?? () from /lib/aarch64-linux-gnu/libc.so.6
#48 0x0000ffffa02e5edc in ?? () from /lib/aarch64-linux-gnu/libc.so.6
Ooh you're using rayon
inside there? That could also be blowing out the stack by saving stack frames when it switches between tasks using work-stealing
Is the crash only happening in
debug
builds? AFAIK Lighthouse is basically not expected to work compiled indebug
, and fixing this might be very involved.One of the main sources of stack bloat for us is futures, and previous investigations using the compiler's
-Ztype-sizes
flag showed a bunch of massive futures:RUSTFLAGS='-Zprint-type-sizes' cargo +nightly build --release > type_sizes.txt
I suspect running that command on the relevant branch without
--release
will reveal futures that are even larger than when we compile inrelease
, due torustc
's inability to optimise them.We made some improvements when we were getting stack overflows on the HTTP API in this PR, although the root of the issue still wasn't fully established:
It seems that
warp
does some very deep nesting of futures that we have little control over. Warp has been annoying in many other ways, so there is some push to move the HTTP API server toaxum
. This may be the best way to solve this problem for good.If it is of interest to you, I would appreciate your insights and research here @kevaundray, but I also think you'd be justified in tapping out and focussing on PeerDAS. As a workaround if you'd like to run Lighthouse with debug assertions, you could turn them on in the
release
profile:# In top-level Cargo.toml [profile.release] debug-assertions = true
Or add a new debug-release profile:
[profile.debug-release] inherits = "release" debug-assertions = true
This would also have the advantage of optimising the BLS crypto code, which is usually very slow in debug builds too.
Ah didn't realize that debug was not expected to work, was mainly using it to get better stacktraces, however they were roughly the same in both profiles.
I'm happy to continue looking into this until a solution is found!
with four blobs this would allocate 4 4096 32 = 524KB onto the stack -- this is likely not the only contributor but a big part
Oh yeah this has been a known issue for a while, and I raised an issue for this last week #6120
Although i think people were getting stack overflow with the image built by
ethPandaOps
, which I suppose is built with release profile?
Yep, so it seems that there is a stackoverflow in debug mode due to validate_blobs which is separate to the stackoverflow in release mode for the ethPandaOps
which is built in release mode
Yeah the stack overflow in debug mode is probably the same issue reported in #6120 (the conversion (array copy) from Blob to kzg blob)
The issue in release mode seems to be rayon related, I just had a chat with Michael about this. Going to try some workarounds (increase num of threads, or increase the thread pool stack size)
Based on my rudimentary understanding of rayon
, it will store jobs awaiting completion on the stack while it works on 1 job per thread. If the thread pool size is 1 (as we were using in LH) and the running job create more jobs, then there will be a buildup of jobs waiting to execute on the stack.
I think the relevant part of code in rayon
is:
The join
is visible in the stacktrace above, and AFAIK is used under the hood by par_iter()
methods.
Increasing the thread pool size may alleviate the issue somewhat because jobs will get stolen off the stack before too many can build up. But I think it's still not foolproof. We may want to optimise peerdas-kzg
so it avoids allocating too much stack space in each frame, e.g. this method would make room for two large arrays on the stack as its return value:
If multiple of these jobs were spawned from different threads outside Rayon's pool, I think they could end up on the same Rayon thread and therefore the same stack. I guess that method doesn't often get called multiple times in a short period though?
If multiple of these jobs were spawned from different threads outside Rayon's pool, I think they could end up on the same Rayon thread and therefore the same stack. I guess that method doesn't often get called multiple times in a short period though?
I believe this is correct; there was an issue for it here: https://github.com/rayon-rs/rayon/issues/1105
With the docs updated here: https://docs.rs/rayon-core/1.12.1/src/rayon_core/thread_pool/mod.rs.html#85
Yeah, I thought that comment looked suss. But I don't think it's relevant to us because we are not usually calling that function (with install
) from a different Rayon pool. We use Rayon in a few other places in LH (e.g. in milhouse
for parallel hashing) but AFAIK that should never invoke the peerdas-kzg lib?
I was more referring to the case where we call compute_cells_and_kzg_proofs
from a bunch of non-Rayon threads and then Rayon saves jobs for them on a single thread's stack
Yeah the stack overflow in debug mode is probably the same issue reported in #6120 (the conversion (array copy) from Blob to kzg blob)
The issue in release mode seems to be rayon related, I just had a chat with Michael about this. Going to try some workarounds (increase num of threads, or increase the thread pool stack size)
I added a regression test here that seems to reliably give me a stackoverflow : https://github.com/sigp/lighthouse/pull/6151
I think one possible way which should also validate the hypothesis above is to not call the threadpool inside of a parallel iterator, so instead of:
let blob_cells_and_proofs_vec = blobs
.into_par_iter()
.map(|blob| {
let blob = KzgBlob::from_bytes(blob).map_err(KzgError::from)?;
kzg.compute_cells_and_proofs(&blob)
})
.collect::<Result<Vec<_>, KzgError>>()?;
We would do:
let blob_cells_and_proofs_vec = blobs
.into_iter()
.map(|blob| {
let blob = KzgBlob::from_bytes(blob).map_err(KzgError::from)?;
kzg.compute_cells_and_proofs(&blob)
})
.collect::<Result<Vec<_>, KzgError>>()?;
Where I changed into_par_iter
to into_iter
I can also Box the return type for compute_cells_and_proofs
so that those arrays are allocated on the heap as Michael mentioned above; The Cell
type is already heap allocated: https://github.com/crate-crypto/rust-eth-kzg/blob/51a4beceffbfd4767a4ac473b8baccf509ad6fbd/eip7594/src/lib.rs#L17 so I'll experiment with just Boxing the KzgProof/Boxing everything
Yeah, I thought that comment looked suss. But I don't think it's relevant to us because we are not usually calling that function (with
install
) from a different Rayon pool. We use Rayon in a few other places in LH (e.g. inmilhouse
for parallel hashing) but AFAIK that should never invoke the peerdas-kzg lib?I was more referring to the case where we call
compute_cells_and_kzg_proofs
from a bunch of non-Rayon threads and then Rayon saves jobs for them on a single thread's stack
My understanding (could be wrong) was that in the following code:
let blob_cells_and_proofs_vec = blobs
.into_par_iter()
.map(|blob| {
let blob = KzgBlob::from_bytes(blob).map_err(KzgError::from)?;
kzg.compute_cells_and_proofs(&blob)
})
.collect::<Result<Vec<_>, KzgError>>()?;
into_par_iter
was using the global rayon pool, while the call to compute_cells_and_proofs will invoke the peerdas-kzg lib and use the local rayon threadpool that peerdas-kzg instantiated, which would give rise to that issue -- does that sound congruent with your thoughts?
into_par_iter was using the global rayon pool, while the call to compute_cells_and_proofs will invoke the peerdas-kzg lib and use the local rayon threadpool that peerdas-kzg instantiated, which would give rise to that issue -- does that sound congruent with your thoughts?
Yes, agree. Are we using this pattern anywhere outside of your unit test?
sing this pattern anywhere outside of your unit test?
Yep, I believe we are still doing this in the build_sidecars
method -- For reference: https://github.com/sigp/lighthouse/blob/04d9eef8a74ae23eee7f473cbafd48f64f0ffa01/consensus/types/src/data_column_sidecar.rs#L129
Added a test here showing the CI failing even if we use vectors as return types, and then removing the stack allocation being done inside of map and showing it passing.
The stack allocation is akin to the let blob = KzgBlob::from_bytes(blob).map_err(KzgError::from)?;
line in the original codebase.
This has been fixed after upgrading to rust_eth_kzg = "0.3.4"
Description
We've recently switched from the
c-kzg
library to the Rustpeerdas-kzg
library on thedas
branch (#5941). The purpose for the switch was to:das
branch tounstable
without having to updatec-kzg
to a development versionHowever we've recently ran into some issues during interop testing after the change, and we're getting stack overflow panics, likely during block production:
Kurtosis config to reproduce: