Investigate PeerDAS KZG library stack overflow during block production

jimmygchen commented 4 months ago

Description

We've recently switched from the c-kzg library to the Rust peerdas-kzg library on the das branch (#5941). The purpose for the switch was to:

Try out a new alternative to c-kzg;
Allow us to merge das branch to unstable without having to update c-kzg to a development version

However we've recently ran into some issues during interop testing after the change, and we're getting stack overflow panics, likely during block production:

Jul 15 10:03:55.007 DEBG Fork choice successfully updated before block production, fork_choice_slot: 60, slot: 60, service: beacon
Jul 15 10:03:55.007 DEBG Not attempting re-org                   reason: head not late, service: beacon
Jul 15 10:03:55.008 ERRO No valid eth1_data votes, `votes_to_consider` empty, outcome: casting `state.eth1_data` as eth1 vote, genesis_time: 1721037115, earliest_block_timestamp: 1721037095, lowest_block_number: 0, service: deposit_contract_rpc
Jul 15 10:03:55.008 DEBG Produced vote for eth1 chain            block_hash: 0xcc93bda48e1227e94e768ee757160974267c72c872f4c89213fbee552bb2ecf2, deposit_count: 0, deposit_root: 0xd70a234731285c6804c2a4f56711ddb8c82c99740f207854891028af34e27e5e, service: deposit_contract_rpc
Jul 15 10:03:55.008 DEBG Issuing engine_getPayload               parent_hash: 0xd32b8265b9e5694d48a42d14175d2692fdb918d305b91d336f2fd2c3167a8df5, timestamp: 1721037835, prev_randao: 0x4c74c85276c06a77410fb641d54224b09045bd03cf19c131ac158b534cd31ad6, suggested_fee_recipient: 0x8943545177806ed17b9f23f0a21ee5948ecaa776, service: exec
Jul 15 10:03:55.011 DEBG Processed HTTP API request              method: GET, path: /eth/v1/node/version, status: 200 OK, elapsed: 86.916µs
Jul 15 10:03:55.103 DEBG Produced block on state                 block_size: 2141, service: beacon
Jul 15 10:03:55.115 DEBG Processed HTTP API request              method: GET, path: /eth/v2/validator/blocks/60, status: 200 OK, elapsed: 108.519167ms
Jul 15 10:03:55.115 DEBG Processed HTTP API request              method: GET, path: /eth/v1/config/spec, status: 200 OK, elapsed: 104.2355ms
Jul 15 10:03:55.117 DEBG Processed HTTP API request              method: GET, path: /eth/v1/node/syncing, status: 200 OK, elapsed: 199.834µs
Jul 15 10:03:55.257 DEBG Signed block received in HTTP API       slot: 60

thread '<unknown>' has overflowed its stack
fatal runtime error: stack overflow

Kurtosis config to reproduce:

participants:
# Supernodes
  - cl_type: prysm
    cl_image: ethpandaops/prysm-beacon-chain:peerdas-initial-sync
    cl_extra_params: [--subscribe-all-subnets]
  - cl_type: lighthouse
    cl_image: ethpandaops/lighthouse:das-devnet-2
    cl_extra_params: [--subscribe-all-data-column-subnets]
# Non supernodes
  - cl_type: prysm
    cl_image: ethpandaops/prysm-beacon-chain:peerdas-initial-sync
  - cl_type: lighthouse
    cl_image: ethpandaops/lighthouse:das-devnet-2
network_params:
  eip7594_fork_epoch: 0
  eip7594_fork_version: "0x50000038"
snooper_enabled: true
global_log_level: debug
additional_services:
  - dora
  - goomy_blob

michaelsproul commented 4 months ago

Looks like a panic inside publish_block (after the block is produced)

jimmygchen commented 4 months ago

Yeah, I think Pawan figured it's most likely something to do with the compute_cells_and_kzg_proofs function (which constructs data column sidecars from blobs), which is invoked from here: https://github.com/sigp/lighthouse/blob/ba8126e02fa8c18a6b8aafde4550088f0721b6c1/beacon_node/beacon_chain/src/block_verification.rs#L740

kevaundray commented 4 months ago

I think the stackoverflow in this case was a red-herring -- running the LH node in debug mode on the devnet, shows the issue was that rust-embed was, for some, either:

Not embedding the trusted setup file
Not correctly parsing the path to the embedded file

The fix, replacing rust-embed with include_str, has been merged into the das branch -- currently waiting for a few folks who reproduced the issue to confirm the issue is fixed on their end. For quick testing, I've uploaded docker images with the fix here kevaundray/lighthouse-eth:das for use in kurtosis

jimmygchen commented 4 months ago

Hi @kevaundray I've tried the kevaundray/lighthouse-eth:das image and managed to reproduced the error:

thread '<unknown>' has overflowed its stack
fatal runtime error: stack overflow

I can take another look tomorrow too.

kevaundray commented 4 months ago

Hi @kevaundray

I've tried the kevaundray/lighthouse-eth:das image and managed to reproduced the error:
thread '<unknown>' has overflowed its stack

fatal runtime error: stack overflow
I can take another look tomorrow too.

Oh that's surprising -- if you run it in debug mode, then that should give a stacktrace

No rush on this, I can try running it on a different machine x86_64 too and in the meantime, I will update the image to use debug mode

kevaundray commented 4 months ago

To help debugging further, I have created three docker images in debug mode based on the last three commits on the das branch and pasted the output of them in gist files.

Parts of the code I modified

Full backtrace

Went into lighthouse/src/main.rs and changed:

    if std::env::var("RUST_BACKTRACE").is_err() {
        std::env::set_var("RUST_BACKTRACE", "1");
    }

to

    if std::env::var("RUST_BACKTRACE").is_err() {
        std::env::set_var("RUST_BACKTRACE", "full");
    }

Enabling debug mode

So I could have just modified the Dockerfile and used the dev profile. I went a roundabout way by adding --debug to the MakeFile on cargo install and commenting out the places in the DockerFile and MakeFile that use the profile flag/env variable.

How the images were created

Created a bash script that checked out the following commits from the das branch:

04d9eef8a74ae23eee7f473cbafd48f64f0ffa01 (chore: update peerdas-kzg library)
55a3be72bb5923033ce39eeddb2c16497bbdbaea (Various fixes to custody range sync)
018f38257f791b47aa570db9ba7ce5245a7c58f3 (Avoid retrying same sampling peer that previously failed)

These are in order of newest to oldest.

Then just ran, built and published each of these commits with the commit hash in the tag name.

Docker Images

kevaundray/lighthouse-eth:das-dbg-04d9eef8a
kevaundray/lighthouse-eth:das-dbg-55a3be72b
kevaundray/lighthouse-eth:das-dbg-018f38257

Kurtosis config

# filename: peerdas.yaml
participants:
  # Supernodes
  - cl_type: prysm
    cl_image: ethpandaops/prysm-beacon-chain:peerdas-initial-sync
    cl_extra_params: [--subscribe-all-subnets]
  - cl_type: lighthouse
    cl_image: kevaundray/lighthouse-eth:das-dbg-{COMMIT_HASH}
    cl_extra_params: [--subscribe-all-data-column-subnets]
  # Non supernodes
  - cl_type: prysm
    cl_image: ethpandaops/prysm-beacon-chain:peerdas-initial-sync
  - cl_type: lighthouse
    cl_image: kevaundray/lighthouse-eth:das-dbg-{COMMIT_HASH}
network_params:
  eip7594_fork_epoch: 0
  eip7594_fork_version: "0x50000038"
snooper_enabled: true
global_log_level: debug
additional_services:
  - dora
  - goomy_blob

Running the docker images

Using Kurtosis, I run the following command:

kurtosis run --enclave local-testnet github.com/ethpandaops/ethereum-package --args-file peerdas.yaml --image-download always

The main difference here between the start_local_testnet script is that I have --image-download always -- the default for Kurtosis seems to be to use any cached image on the system locally, regardless of the tag. The docs mention it here

Results

On the 018f38257 branch, I get the rust-embed issue I mentioned before. Contents have been copied into this gist

On the 55a3be72b branch, I get the rust-embed issue also. Contents have been copied into this gist

On the 04d9eef8a branch, I don't get the rust-embed issue, however there seems to be an issue with the vc-lighthouse-geth pairs. This is possibly related to the timeout. Contents have been copied into this gist

Observations

A lot of time was spent on Execution in progress, when adding a lighthouse node:

jimmygchen commented 4 months ago

A lot of time was spent on Execution in progress, when adding a lighthouse node:

This is kurtosis waiting for Lighthouse to complete the startup process and wait until it's healthy before proceeding.

jimmygchen commented 4 months ago

Yeah I'm not exactly sure what's taking a long time, but I don't think it's to do with KZG as it loads in about 500ms. I'll look into the slow startup issue.

jimmygchen commented 4 months ago

Seems to load pretty quickly on my amd machine. Would be useful to check the Lighthouse container logs of the while "Execution in progress" is happening.

kevaundray commented 4 months ago

Seems to load pretty quickly on my amd machine. Would be useful to check the Lighthouse container logs of the while "Execution in progress" is happening.

For the latest commit (04d9eef8a) the logs for vc-geth-lighthouse was :

docker logs -f vc-2-geth-lighthouse--66190c11946b4c7980461b123f308375
Jul 18 08:24:39.178 INFO Logging to file                         path: "/validator-keys/keys/logs/validator.log"
Jul 18 08:24:39.274 INFO Lighthouse started                      version: Lighthouse/v5.2.1-04d9eef+
Jul 18 08:24:39.274 INFO Configured for network                  name: custom (/network-configs)
Unable to initialize validator config: Unable to parse proposer_nodes: Unknown argument or group id.  Make sure you are using the argument id and not the short or long flags

Yeah I'm not exactly sure what's taking a long time, but I don't think it's to do with KZG as it loads in about 500ms.

My current hypothesis is that the lengthy periods of time are due to retries. This is because in the error messages by Kurtosis where the service did not start up, it notes the retry attempts:

  Caused by: Unsuccessful ports check for IP '172.16.0.17' and port spec '{privatePortSpec:0x400088c570}', even after '240' retries with '500' milliseconds in between retries. Timeout '2m0s' has been reached

If this is correct, then the long startup times on 55a3be72b and 04d9eef8a would be due to Kurtosis retrying with the rust-embed issue, and the long startup times on 018f38257 would possibly be due to the Kurtosis retrying on the Unable to initialize validator config issue

kevaundray commented 4 months ago

For completeness, I'll add what the Dockerfile and Makefile looks like, that I used to build the docker images:

FROM rust:1.78.0-bullseye AS builder
RUN apt-get update && apt-get -y upgrade && apt-get install -y cmake libclang-dev
COPY . lighthouse
ARG FEATURES
# ARG PROFILE=release
ARG CARGO_USE_GIT_CLI=true
ENV FEATURES $FEATURES
# ENV PROFILE $PROFILE
ENV CARGO_NET_GIT_FETCH_WITH_CLI=$CARGO_USE_GIT_CLI
RUN cd lighthouse && make

FROM ubuntu:22.04
RUN apt-get update && apt-get -y upgrade && apt-get install -y --no-install-recommends \
  libssl-dev \
  ca-certificates \
  && apt-get clean \
  && rm -rf /var/lib/apt/lists/*
COPY --from=builder /usr/local/cargo/bin/lighthouse /usr/local/bin/lighthouse

The main change is that I remove the PROFILE env variable.

The only parts I changed in the Makefile were the install commands:

install:
    cargo install --debug --path lighthouse --force --locked \
        --features "$(FEATURES)" \
        $(CARGO_INSTALL_EXTRA_FLAGS)

# Builds the lcli binary in release (optimized).
install-lcli:
    cargo install --debug --path lcli --force --locked \
        --features "$(FEATURES)" \
        $(CARGO_INSTALL_EXTRA_FLAGS)

The main changes are that I removed the profile flag and added --debug.

kevaundray commented 4 months ago

I've pushed a new image with the fix in PR 6125 -- the tag is kevaundray/lighthouse-eth:das-dbg-cli-fix-04d9eef8a

For startup times, I get ~10 seconds to startup a cl-2-lighthouse-X pair and almost instantanious to startup a vc-lighthouse-X pair

kevaundray commented 4 months ago

I am still getting a stackoverflow though its a bit more detailed:

Jul 18 06:00:39.835 DEBG Beacon proposer found                   validator_index: 233, slot: 21, head_block_root: 0xc65d38627d411aaee3617a8019e3f942cd57c9b85da1101862649641ab696619, payload_attributes: V3(PayloadAttributesV3 { timestamp: 1721282440, prev_randao: 0x6568eee59d352cf3e7dcbd1aed7783a7d8ed5c0dcd609b5fdfd5dae6445927f7, suggested_fee_recipient: 0x8943545177806ed17b9f23f0a21ee5948ecaa776, withdrawals: [], parent_beacon_block_root: 0xc65d38627d411aaee3617a8019e3f942cd57c9b85da1101862649641ab696619 }), service: exec
Jul 18 06:00:40.000 DEBG Processed HTTP API request              method: GET, path: /eth/v1/node/version, status: 200 OK, elapsed: 1.281542ms
Jul 18 06:00:40.076 DEBG Running beacon chain per slot tasks     slot: Slot(21), service: beacon
Jul 18 06:00:40.077 DEBG No change in canonical head             head: 0xc65d38627d411aaee3617a8019e3f942cd57c9b85da1101862649641ab696619, service: beacon
Jul 18 06:00:40.079 DEBG Processed HTTP API request              method: GET, path: /eth/v1/validator/duties/proposer/0, status: 200 OK, elapsed: 5.268625ms
Jul 18 06:00:40.081 DEBG Processed HTTP API request              method: GET, path: /eth/v1/config/spec, status: 200 OK, elapsed: 6.118875ms
Jul 18 06:00:40.169 DEBG Processed HTTP API request              method: POST, path: /eth/v1/validator/duties/attester/0, status: 200 OK, elapsed: 11.719875ms
Jul 18 06:00:40.174 DEBG Received proposer preparation data      count: 64
Jul 18 06:00:40.175 DEBG Processed HTTP API request              method: POST, path: /eth/v1/validator/prepare_beacon_proposer, status: 200 OK, elapsed: 16.40725ms
Jul 18 06:00:40.175 DEBG Block production request from HTTP API  slot: 21
Jul 18 06:00:40.175 DEBG Fork choice successfully updated before block production, fork_choice_slot: 21, slot: 21, service: beacon
Jul 18 06:00:40.176 DEBG Not attempting re-org                   reason: head not late, service: beacon
Jul 18 06:00:40.327 ERRO No valid eth1_data votes, `votes_to_consider` empty, outcome: casting `state.eth1_data` as eth1 vote, genesis_time: 1721282188, earliest_block_timestamp: 1721282168, lowest_block_number: 0, service: deposit_contract_rpc
Jul 18 06:00:40.328 DEBG Produced vote for eth1 chain            block_hash: 0x78ce9187e998a0436cec8cc1d2d72d43d6263f8f257b2c479ad2764cc5637b69, deposit_count: 0, deposit_root: 0xd70a234731285c6804c2a4f56711ddb8c82c99740f207854891028af34e27e5e, service: deposit_contract_rpc
Jul 18 06:00:40.336 DEBG Issuing engine_getPayload               parent_hash: 0x3a41a02252cd0769ff78d2d58366706c02c252dfab081d372b7d45b37bd954ab, timestamp: 1721282440, prev_randao: 0x6568eee59d352cf3e7dcbd1aed7783a7d8ed5c0dcd609b5fdfd5dae6445927f7, suggested_fee_recipient: 0x8943545177806ed17b9f23f0a21ee5948ecaa776, service: exec
Jul 18 06:00:41.229 DEBG Produced block on state                 block_size: 3781, service: beacon

thread 'tokio-runtime-worker' has overflowed its stack
fatal runtime error: stack overflow

Full logs: https://gist.github.com/kevaundray/04504a2bbb175a9485f120251b9a24a3

jimmygchen commented 4 months ago

Unable to initialize validator config: Unable to parse proposer_nodes: Unknown argument or group id. Make sure you are using the argument id and not the short or long flags

yep your image actually revealed a bug we've had for a while! We rarely run validator client in debug mode so this bug has been around for a while without us noticing! Fixed in #6125

kevaundray commented 4 months ago

Unable to initialize validator config: Unable to parse proposer_nodes: Unknown argument or group id. Make sure you are using the argument id and not the short or long flags

yep your image actually revealed a bug we've had for a while! We rarely run validator client in debug mode so this bug has been around for a while without us noticing! Fixed in #6125

Oh great, can confirm that that does fix it in debug mode!

kevaundray commented 4 months ago

@jimmygchen was away for a bit, I added debug lines into the code and seems to imply that the issue was never peerdas kzg:

Jul 20 06:44:46.174 DEBG let kzg_proofs = Vec::from(proofs)      service: beacon
Jul 20 06:44:46.174 DEBG let kzg = self.kzg                      service: beacon
Jul 20 06:44:46.174 DEBG let blobs_ref: Vec<_>                   service: beacon

[beacon_node/beacon_chain/src/kzg_utils.rs:85:5] blobs.len() = 4

thread 'tokio-runtime-worker' has overflowed its stack
fatal runtime error: stack overflow

complete_partial_beacon_block

In complete_partial_beacon_block method, I added the following debug lines corresponding to the snippet above.

let kzg_proofs = Vec::from(proofs);
debug!(
    self.log,
    "let kzg_proofs = Vec::from(proofs)";
);
let kzg = self
    .kzg
    .as_ref()
    .ok_or(BlockProductionError::TrustedSetupNotInitialized)?;
debug!(
    self.log,
    "let kzg = self.kzg";
);
let blobs_ref: Vec<_> = blobs.iter().collect();
debug!(
    self.log,
    "let blobs_ref: Vec<_>";
);
kzg_utils::validate_blobs::<T::EthSpec>(
    kzg,
    expected_kzg_commitments,
    blobs_ref,
    &kzg_proofs,
)
.map_err(BlockProductionError::KzgError)?;
debug!(
    self.log,
    "kzg_utils::validate_blobs";
);

kzg_utils::validate_blobs

In validate_blobs, I added the following debug lines. Note, I use dbg! so it will print out differently in terminal.

pub fn validate_blobs<E: EthSpec>(
    kzg: &Kzg,
    expected_kzg_commitments: &[KzgCommitment],
    blobs: Vec<&Blob<E>>,
    kzg_proofs: &[KzgProof],
) -> Result<(), KzgError> {
    let _timer = crate::metrics::start_timer(&crate::metrics::KZG_VERIFICATION_BATCH_TIMES);
    dbg!("blobs size: ", blobs.len());
    let blobs = blobs
        .into_iter()
        .map(|blob| ssz_blob_to_crypto_blob::<E>(blob))
        .collect::<Result<Vec<_>, KzgError>>()?;
    dbg!("blobs collected");

    let ok = kzg.verify_blob_kzg_proof_batch(&blobs, expected_kzg_commitments, kzg_proofs);
    dbg!("verified verify_blob_kzg_proof_batch");
    ok
}

Stackoverflow

This implies that the stackoverflow is happening with four blobs on this line:

  let blobs = blobs
        .into_iter()
        .map(|blob| ssz_blob_to_crypto_blob::<E>(blob))
        .collect::<Result<Vec<_>, KzgError>>()?;

Looking further,

ssz_blob_to_crypto_blob calls KzgBlob::from_bytes(blob.as_ref()).map_err(Into::into)

The from_bytes method looks like this:

    pub fn from_bytes(bytes: &[u8]) -> Result<Self, Error> {
        if bytes.len() != BYTES_PER_BLOB {
            return Err(Error::InvalidBytesLength(format!(
                "Invalid byte length. Expected {} got {}",
                BYTES_PER_BLOB,
                bytes.len(),
            )));
        }
        let mut new_bytes = [0; BYTES_PER_BLOB];
        new_bytes.copy_from_slice(bytes);
        Ok(Self::new(new_bytes))
    }

with four blobs this would allocate 4 4096 32 = 524KB onto the stack -- this is likely not the only contributor but a big part

My mistake

I had assumed that the stack overflow was coming from rust-embed and was being converted into a panic in debug mode.

The issue with rust-embed only happens in debug mode and it only happens because the Dockerfile copies the binary from one image to another. I have created a proof of concept of this here: https://github.com/kevaundray/poc-rust-embed

This would imply that the original issue was always the stackoverflow due to validate_blobs

kevaundray commented 4 months ago

I'll note that a possible quick fix for this is to decrease the batch size to 1 and verify each blob individually for now using a for loop. This should allow the compiler to free each blob after every for loop/verification because it can see the blob variable going out of scope.

michaelsproul commented 4 months ago

Is the crash only happening in debug builds? AFAIK Lighthouse is basically not expected to work compiled in debug, and fixing this might be very involved.

One of the main sources of stack bloat for us is futures, and previous investigations using the compiler's -Ztype-sizes flag showed a bunch of massive futures:

RUSTFLAGS='-Zprint-type-sizes' cargo +nightly build --release > type_sizes.txt

I suspect running that command on the relevant branch without --release will reveal futures that are even larger than when we compile in release, due to rustc's inability to optimise them.

We made some improvements when we were getting stack overflows on the HTTP API in this PR, although the root of the issue still wasn't fully established:

https://github.com/sigp/lighthouse/pull/5104

It seems that warp does some very deep nesting of futures that we have little control over. Warp has been annoying in many other ways, so there is some push to move the HTTP API server to axum. This may be the best way to solve this problem for good.

If it is of interest to you, I would appreciate your insights and research here @kevaundray, but I also think you'd be justified in tapping out and focussing on PeerDAS. As a workaround if you'd like to run Lighthouse with debug assertions, you could turn them on in the release profile:

# In top-level Cargo.toml
[profile.release]
debug-assertions = true

Or add a new debug-release profile:

[profile.debug-release]
inherits = "release"
debug-assertions = true

This would also have the advantage of optimising the BLS crypto code, which is usually very slow in debug builds too.

jimmygchen commented 4 months ago

with four blobs this would allocate 4 4096 32 = 524KB onto the stack -- this is likely not the only contributor but a big part

Oh yeah this has been a known issue for a while, and I raised an issue for this last week #6120

Although i think people were getting stack overflow with the image built by ethPandaOps, which I suppose is built with release profile?

jimmygchen commented 4 months ago

I've managed to get a core dump of the crash with Lighthouse running with release profile.

Here's the backtrace - no debug symbols so it's not nice to read, but we know it's failing at kzg::Kzg::compute_cells_and_proofs

(gdb) bt
#0  0x0000ffffa027f200 in ?? () from /lib/aarch64-linux-gnu/libc.so.6
#1  0x0000ffffa023a67c in raise () from /lib/aarch64-linux-gnu/libc.so.6
#2  0x0000ffffa0227130 in abort () from /lib/aarch64-linux-gnu/libc.so.6
#3  0x0000aaaadb6b5bc0 in std::sys::pal::unix::abort_internal ()
#4  0x0000aaaadb6b444c in std::sys::pal::unix::stack_overflow::imp::signal_handler ()
#5  <signal handler called>
#6  0x0000aaaad9e7ad00 in <alloc::vec::Vec<T,A> as alloc::vec::spec_extend::SpecExtend<T,I>>::spec_extend ()
#7  0x0000aaaada010800 in rayon::iter::plumbing::bridge_producer_consumer::helper ()
#8  0x0000aaaad9bbcdb4 in _ZN10rayon_core4join12join_context28_$u7b$$u7b$closure$u7d$$u7d$17hac9acf36c7bc6fdcE.llvm.4150297595829751684 ()
#9  0x0000aaaada0108e4 in rayon::iter::plumbing::bridge_producer_consumer::helper ()
#10 0x0000aaaad9c8899c in <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute ()
#11 0x0000aaaad9126e5c in rayon_core::registry::WorkerThread::wait_until_cold ()
#12 0x0000aaaad9124578 in _ZN10rayon_core8registry8Registry15in_worker_cross17he57b8724c687b801E.llvm.9124983992865838625 ()
#13 0x0000aaaadafa28bc in kzg::Kzg::compute_cells_and_proofs ()
#14 0x0000aaaad9e7ae70 in <alloc::vec::Vec<T,A> as alloc::vec::spec_extend::SpecExtend<T,I>>::spec_extend ()
#15 0x0000aaaada010800 in rayon::iter::plumbing::bridge_producer_consumer::helper ()
#16 0x0000aaaad9bbcdb4 in _ZN10rayon_core4join12join_context28_$u7b$$u7b$closure$u7d$$u7d$17hac9acf36c7bc6fdcE.llvm.4150297595829751684 ()
#17 0x0000aaaada0108e4 in rayon::iter::plumbing::bridge_producer_consumer::helper ()
#18 0x0000aaaad9c8899c in <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute ()
#19 0x0000aaaad9126e5c in rayon_core::registry::WorkerThread::wait_until_cold ()
#20 0x0000aaaad9124578 in _ZN10rayon_core8registry8Registry15in_worker_cross17he57b8724c687b801E.llvm.9124983992865838625 ()
#21 0x0000aaaadafa28bc in kzg::Kzg::compute_cells_and_proofs ()
#22 0x0000aaaad9e7ae70 in <alloc::vec::Vec<T,A> as alloc::vec::spec_extend::SpecExtend<T,I>>::spec_extend ()
#23 0x0000aaaada010800 in rayon::iter::plumbing::bridge_producer_consumer::helper ()
#24 0x0000aaaad9c8899c in <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute ()
#25 0x0000aaaad9126e5c in rayon_core::registry::WorkerThread::wait_until_cold ()
#26 0x0000aaaad9124578 in _ZN10rayon_core8registry8Registry15in_worker_cross17he57b8724c687b801E.llvm.9124983992865838625 ()
#27 0x0000aaaadafa28bc in kzg::Kzg::compute_cells_and_proofs ()
#28 0x0000aaaad9e7ae70 in <alloc::vec::Vec<T,A> as alloc::vec::spec_extend::SpecExtend<T,I>>::spec_extend ()
#29 0x0000aaaada010800 in rayon::iter::plumbing::bridge_producer_consumer::helper ()
#30 0x0000aaaad9bbcdb4 in _ZN10rayon_core4join12join_context28_$u7b$$u7b$closure$u7d$$u7d$17hac9acf36c7bc6fdcE.llvm.4150297595829751684 ()
#31 0x0000aaaada0108e4 in rayon::iter::plumbing::bridge_producer_consumer::helper ()
#32 0x0000aaaad9c8899c in <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute ()
#33 0x0000aaaad9126e5c in rayon_core::registry::WorkerThread::wait_until_cold ()
#34 0x0000aaaad9124578 in _ZN10rayon_core8registry8Registry15in_worker_cross17he57b8724c687b801E.llvm.9124983992865838625 ()
#35 0x0000aaaadafa28bc in kzg::Kzg::compute_cells_and_proofs ()
#36 0x0000aaaad9e7ae70 in <alloc::vec::Vec<T,A> as alloc::vec::spec_extend::SpecExtend<T,I>>::spec_extend ()
#37 0x0000aaaada010800 in rayon::iter::plumbing::bridge_producer_consumer::helper ()
#38 0x0000aaaad9bbcdb4 in _ZN10rayon_core4join12join_context28_$u7b$$u7b$closure$u7d$$u7d$17hac9acf36c7bc6fdcE.llvm.4150297595829751684 ()
#39 0x0000aaaada0108e4 in rayon::iter::plumbing::bridge_producer_consumer::helper ()
#40 0x0000aaaad9bbcdb4 in _ZN10rayon_core4join12join_context28_$u7b$$u7b$closure$u7d$$u7d$17hac9acf36c7bc6fdcE.llvm.4150297595829751684 ()
#41 0x0000aaaad9c86640 in _ZN83_$LT$rayon_core..job..StackJob$LT$L$C$F$C$R$GT$$u20$as$u20$rayon_core..job..Job$GT$7execute17h9327bac41e84fa88E.llvm.4150297595829751684 ()
#42 0x0000aaaad9126e9c in rayon_core::registry::WorkerThread::wait_until_cold ()
#43 0x0000aaaadafcc710 in rayon_core::registry::ThreadBuilder::run ()
#44 0x0000aaaadafcf6b4 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#45 0x0000aaaadafd08b0 in core::ops::function::FnOnce::call_once{{vtable.shim}} ()
#46 0x0000aaaadb6b49d0 in std::sys::pal::unix::thread::Thread::new::thread_start ()
#47 0x0000ffffa027d5c8 in ?? () from /lib/aarch64-linux-gnu/libc.so.6
#48 0x0000ffffa02e5edc in ?? () from /lib/aarch64-linux-gnu/libc.so.6

michaelsproul commented 4 months ago

Ooh you're using rayon inside there? That could also be blowing out the stack by saving stack frames when it switches between tasks using work-stealing

kevaundray commented 4 months ago

Is the crash only happening in debug builds? AFAIK Lighthouse is basically not expected to work compiled in debug, and fixing this might be very involved.

One of the main sources of stack bloat for us is futures, and previous investigations using the compiler's -Ztype-sizes flag showed a bunch of massive futures:
RUSTFLAGS='-Zprint-type-sizes' cargo +nightly build --release > type_sizes.txt
I suspect running that command on the relevant branch without --release will reveal futures that are even larger than when we compile in release, due to rustc's inability to optimise them.

We made some improvements when we were getting stack overflows on the HTTP API in this PR, although the root of the issue still wasn't fully established:

Reduce size of futures in HTTP API to prevent stack overflows #5104

It seems that warp does some very deep nesting of futures that we have little control over. Warp has been annoying in many other ways, so there is some push to move the HTTP API server to axum. This may be the best way to solve this problem for good.

If it is of interest to you, I would appreciate your insights and research here @kevaundray, but I also think you'd be justified in tapping out and focussing on PeerDAS. As a workaround if you'd like to run Lighthouse with debug assertions, you could turn them on in the release profile:
# In top-level Cargo.toml
[profile.release]
debug-assertions = true
Or add a new debug-release profile:
[profile.debug-release]
inherits = "release"
debug-assertions = true
This would also have the advantage of optimising the BLS crypto code, which is usually very slow in debug builds too.

Ah didn't realize that debug was not expected to work, was mainly using it to get better stacktraces, however they were roughly the same in both profiles.

I'm happy to continue looking into this until a solution is found!

kevaundray commented 4 months ago

with four blobs this would allocate 4 4096 32 = 524KB onto the stack -- this is likely not the only contributor but a big part

Oh yeah this has been a known issue for a while, and I raised an issue for this last week #6120

Although i think people were getting stack overflow with the image built by ethPandaOps, which I suppose is built with release profile?

Yep, so it seems that there is a stackoverflow in debug mode due to validate_blobs which is separate to the stackoverflow in release mode for the ethPandaOps which is built in release mode

jimmygchen commented 4 months ago

Yeah the stack overflow in debug mode is probably the same issue reported in #6120 (the conversion (array copy) from Blob to kzg blob)

The issue in release mode seems to be rayon related, I just had a chat with Michael about this. Going to try some workarounds (increase num of threads, or increase the thread pool stack size)

michaelsproul commented 4 months ago

Based on my rudimentary understanding of rayon, it will store jobs awaiting completion on the stack while it works on 1 job per thread. If the thread pool size is 1 (as we were using in LH) and the running job create more jobs, then there will be a buildup of jobs waiting to execute on the stack.

I think the relevant part of code in rayon is:

https://github.com/rayon-rs/rayon/blob/3e3962cb8f7b50773bcc360b48a7a674a53a2c77/rayon-core/src/join/mod.rs#L133-L146

The join is visible in the stacktrace above, and AFAIK is used under the hood by par_iter() methods.

Increasing the thread pool size may alleviate the issue somewhat because jobs will get stolen off the stack before too many can build up. But I think it's still not foolproof. We may want to optimise peerdas-kzg so it avoids allocating too much stack space in each frame, e.g. this method would make room for two large arrays on the stack as its return value:

https://github.com/crate-crypto/rust-eth-kzg/blob/51a4beceffbfd4767a4ac473b8baccf509ad6fbd/eip7594/src/prover.rs#L82

If multiple of these jobs were spawned from different threads outside Rayon's pool, I think they could end up on the same Rayon thread and therefore the same stack. I guess that method doesn't often get called multiple times in a short period though?

kevaundray commented 4 months ago

If multiple of these jobs were spawned from different threads outside Rayon's pool, I think they could end up on the same Rayon thread and therefore the same stack. I guess that method doesn't often get called multiple times in a short period though?

I believe this is correct; there was an issue for it here: https://github.com/rayon-rs/rayon/issues/1105

With the docs updated here: https://docs.rs/rayon-core/1.12.1/src/rayon_core/thread_pool/mod.rs.html#85

michaelsproul commented 4 months ago

Yeah, I thought that comment looked suss. But I don't think it's relevant to us because we are not usually calling that function (with install) from a different Rayon pool. We use Rayon in a few other places in LH (e.g. in milhouse for parallel hashing) but AFAIK that should never invoke the peerdas-kzg lib?

I was more referring to the case where we call compute_cells_and_kzg_proofs from a bunch of non-Rayon threads and then Rayon saves jobs for them on a single thread's stack

kevaundray commented 4 months ago

Yeah the stack overflow in debug mode is probably the same issue reported in #6120 (the conversion (array copy) from Blob to kzg blob)

The issue in release mode seems to be rayon related, I just had a chat with Michael about this. Going to try some workarounds (increase num of threads, or increase the thread pool stack size)

I added a regression test here that seems to reliably give me a stackoverflow : https://github.com/sigp/lighthouse/pull/6151

I think one possible way which should also validate the hypothesis above is to not call the threadpool inside of a parallel iterator, so instead of:

        let blob_cells_and_proofs_vec = blobs
            .into_par_iter()
            .map(|blob| {
                let blob = KzgBlob::from_bytes(blob).map_err(KzgError::from)?;
                kzg.compute_cells_and_proofs(&blob)
            })
            .collect::<Result<Vec<_>, KzgError>>()?;

We would do:

        let blob_cells_and_proofs_vec = blobs
            .into_iter()
            .map(|blob| {
                let blob = KzgBlob::from_bytes(blob).map_err(KzgError::from)?;
                kzg.compute_cells_and_proofs(&blob)
            })
            .collect::<Result<Vec<_>, KzgError>>()?;

Where I changed into_par_iter to into_iter

I can also Box the return type for compute_cells_and_proofs so that those arrays are allocated on the heap as Michael mentioned above; The Cell type is already heap allocated: https://github.com/crate-crypto/rust-eth-kzg/blob/51a4beceffbfd4767a4ac473b8baccf509ad6fbd/eip7594/src/lib.rs#L17 so I'll experiment with just Boxing the KzgProof/Boxing everything

kevaundray commented 4 months ago

Yeah, I thought that comment looked suss. But I don't think it's relevant to us because we are not usually calling that function (with install) from a different Rayon pool. We use Rayon in a few other places in LH (e.g. in milhouse for parallel hashing) but AFAIK that should never invoke the peerdas-kzg lib?

I was more referring to the case where we call compute_cells_and_kzg_proofs from a bunch of non-Rayon threads and then Rayon saves jobs for them on a single thread's stack

My understanding (could be wrong) was that in the following code:

        let blob_cells_and_proofs_vec = blobs
            .into_par_iter()
            .map(|blob| {
                let blob = KzgBlob::from_bytes(blob).map_err(KzgError::from)?;
                kzg.compute_cells_and_proofs(&blob)
            })
            .collect::<Result<Vec<_>, KzgError>>()?;

into_par_iter was using the global rayon pool, while the call to compute_cells_and_proofs will invoke the peerdas-kzg lib and use the local rayon threadpool that peerdas-kzg instantiated, which would give rise to that issue -- does that sound congruent with your thoughts?

michaelsproul commented 4 months ago

into_par_iter was using the global rayon pool, while the call to compute_cells_and_proofs will invoke the peerdas-kzg lib and use the local rayon threadpool that peerdas-kzg instantiated, which would give rise to that issue -- does that sound congruent with your thoughts?

Yes, agree. Are we using this pattern anywhere outside of your unit test?

kevaundray commented 4 months ago

sing this pattern anywhere outside of your unit test?

Yep, I believe we are still doing this in the build_sidecars method -- For reference: https://github.com/sigp/lighthouse/blob/04d9eef8a74ae23eee7f473cbafd48f64f0ffa01/consensus/types/src/data_column_sidecar.rs#L129

kevaundray commented 4 months ago

Added a test here showing the CI failing even if we use vectors as return types, and then removing the stack allocation being done inside of map and showing it passing.

The stack allocation is akin to the let blob = KzgBlob::from_bytes(blob).map_err(KzgError::from)?; line in the original codebase.

jimmygchen commented 3 months ago

This has been fixed after upgrading to rust_eth_kzg = "0.3.4"

sigp / lighthouse