grandinetech / grandine

High performance Ethereum consensus client
GNU General Public License v3.0
176 stars 22 forks source link

Windows support #41

Open mjzk opened 2 months ago

mjzk commented 2 months ago

This issue is tracking to all the problems and changes to support building Grandine on Windows.
More about this Project idea.

Tasks:

sauliusgrigaitis commented 2 months ago

I'm not sure that Jemalloc is necessary on Windows. Many years ago when we had support for Windows we found that default allocator was doing very well. You can simply run some Holesky validators with Jemalloc and with the default allocator and compare memory usage patterns.

mjzk commented 2 months ago

I'm not sure that Jemalloc is necessary on Windows. Many years ago when we had support for Windows we found that default allocator was doing very well. You can simply run some Holesky validators with Jemalloc and with the default allocator and compare memory usage patterns.

@sauliusgrigaitis thanks. You are right. As my testing, jemalloc-sys could be compiled with the correct building environment and some tweaks. But it is very verbose for users. Jemalloc is not well tested in Windows from the real world usage of jemalloc. Lighthouse just disable jemalloc for Windows. So, I think it is still better to just disable it for simplicity now.

mjzk commented 2 months ago

@sauliusgrigaitis Current PoC has been built successfully after my small metric cross platform abstraction.

One notable change is in the workspace Cargo.toml's lints, I have unsafe code in new metric abstraction in that the windows crate just expose the related api in unsafe.

I don't know if you mind this change, but currently, there's no particularly simple way for me to override lints. In fact, if you want to perform statistics like idle time on Windows, unsafe code is necessary. It's just a matter of whether you're using that unsafe by a library or in your own code.

sauliusgrigaitis commented 2 months ago

@mjzk did you try to run Holesky validators on Windows? Let's get back to code review after the entire functionality is confirmed.

mjzk commented 2 months ago

@mjzk did you try to run Holesky validators on Windows? Let's get back to code review after the entire functionality is confirmed.

Not yet. I have 69 HolETHs, so I guess this is enough if the staking requirement in the Holesky testnet is the same to that of the mainnet's 32 ETH. In fact, I haven’t run a consensus client as a validator yet. My previous attempts with Lighthouse or our Grandine were only running as beacon nodes. I will probably give it a try tomorrow. How do we determine if a validator is working correctly? By relevant log output? Do you have any suggestion? @sauliusgrigaitis

mjzk commented 2 months ago

stackoverflow happen in ecdsa crate in the initialization in runtime. Need more investigations:

stack trace:

7: once_cell::imp::impl$4::initialize::closure$0<array$<k256::arithmetic::mul::LookupTable,33>,once_cell::sync::impl$6::get_or_init::closure_env$0<array$<k256::arithmetic::mul::LookupTable,33>,once_cell::sync::impl$11::force::closure_env$0<array$<k256::arithmetic::mul::LookupTable,33>,array$<k256::arithmetic::mul::LookupTable,33> (*)()> >,enum2$<once_cell::sync::impl$6::get_or_init::Void> >
        at C:\Users\conta\.cargo\registry\src\index.crates.io-6f17d22bba15001f\once_cell-1.19.0\src\imp_pl.rs:52
8: once_cell::imp::initialize_inner
        at C:\Users\conta\.cargo\registry\src\index.crates.io-6f17d22bba15001f\once_cell-1.19.0\src\imp_pl.rs:146
9: once_cell::imp::OnceCell<array$<k256::arithmetic::mul::LookupTable,33> >::initialize<array$<k256::arithmetic::mul::LookupTable,33>,once_cell::sync::impl$6::get_or_init::closure_env$0<array$<k256::arithmetic::mul::LookupTable,33>,once_cell::sync::impl$11::force::closure_env$0<array$<k256::arithmetic::mul::LookupTable,33>,array$<k256::arithmetic::mul::LookupTable,33> (*)()> >,enum2$<once_cell::sync::impl$6::get_or_init::Vo
        at C:\Users\conta\.cargo\registry\src\index.crates.io-6f17d22bba15001f\once_cell-1.19.0\src\imp_pl.rs:52
10: once_cell::sync::OnceCell<array$<k256::arithmetic::mul::LookupTable,33> >::get_or_try_init<array$<k256::arithmetic::mul::LookupTable,33>,once_cell::sync::impl$6::get_or_init::closure_env$0<array$<k256::arithmetic::mul::LookupTable,33>,once_cell::sync::impl$11::force::closure_env$0<array$<k256::arithmetic::mul::LookupTable,33>,array$<k256::arithmetic::mul::LookupTable,33> (*)()> >,enum2$<once_cell::sync::impl$6::get_or_in
        at C:\Users\conta\.cargo\registry\src\index.crates.io-6f17d22bba15001f\once_cell-1.19.0\src\lib.rs:1161
11: once_cell::sync::OnceCell<array$<k256::arithmetic::mul::LookupTable,33> >::get_or_init<array$<k256::arithmetic::mul::LookupTable,33>,once_cell::sync::impl$11::force::closure_env$0<array$<k256::arithmetic::mul::LookupTable,33>,array$<k256::arithmetic::mul::LookupTable,33> (*)()> >
        at C:\Users\conta\.cargo\registry\src\index.crates.io-6f17d22bba15001f\once_cell-1.19.0\src\lib.rs:1120
12: once_cell::sync::Lazy<array$<k256::arithmetic::mul::LookupTable,33>,array$<k256::arithmetic::mul::LookupTable,33> (*)()>::force<array$<k256::arithmetic::mul::LookupTable,33>,array$<k256::arithmetic::mul::LookupTable,33> (*)()>
        at C:\Users\conta\.cargo\registry\src\index.crates.io-6f17d22bba15001f\once_cell-1.19.0\src\lib.rs:1313
13: once_cell::sync::impl$12::deref<array$<k256::arithmetic::mul::LookupTable,33>,array$<k256::arithmetic::mul::LookupTable,33> (*)()>
        at C:\Users\conta\.cargo\registry\src\index.crates.io-6f17d22bba15001f\once_cell-1.19.0\src\lib.rs:1377
14: k256::arithmetic::mul::impl$6::mul_by_generator
        at C:\Users\conta\.cargo\registry\src\index.crates.io-6f17d22bba15001f\k256-0.13.3\src\arithmetic\mul.rs:396
15: ecdsa::hazmat::sign_prehashed<k256::Secp256k1,k256::arithmetic::scalar::Scalar>
        at C:\Users\conta\.cargo\registry\src\index.crates.io-6f17d22bba15001f\ecdsa-0.16.9\src\hazmat.rs:245
16: k256::ecdsa::impl$1::try_sign_prehashed<k256::arithmetic::scalar::Scalar>
        at C:\Users\conta\.cargo\registry\src\index.crates.io-6f17d22bba15001f\k256-0.13.3\src\ecdsa.rs:192
17: ecdsa::hazmat::SignPrimitive::try_sign_prehashed_rfc6979<k256::arithmetic::scalar::Scalar,k256::Secp256k1,digest::core_api::wrapper::CoreWrapper<digest::core_api::ct_variable::CtVariableCoreWrapper<sha2::core_api::Sha256VarCore,typenum::uint::UInt<typenum::uint::UInt<typenum::uint::UInt<typenum::uint::UInt<typenum::uint::UInt<typenum::uint::UInt<typenum::uint::UTerm,typenum::bit::B1>,typenum::bit::B0>,typenum::bit::B0>,t
        at C:\Users\conta\.cargo\registry\src\index.crates.io-6f17d22bba15001f\ecdsa-0.16.9\src\hazmat.rs:111
18: ecdsa::signing::impl$5::sign_prehash_with_rng<k256::Secp256k1,rand_core::os::OsRng>
        at C:\Users\conta\.cargo\registry\src\index.crates.io-6f17d22bba15001f\ecdsa-0.16.9\src\signing.rs:212
19: ecdsa::signing::impl$4::try_sign_digest_with_rng<k256::Secp256k1,digest::core_api::wrapper::CoreWrapper<sha3::Keccak256Core>,rand_core::os::OsRng>
        at C:\Users\conta\.cargo\registry\src\index.crates.io-6f17d22bba15001f\ecdsa-0.16.9\src\signing.rs:194
20: enr::keys::k256_key::impl$0::sign_v4
        at C:\Users\conta\.cargo\registry\src\index.crates.io-6f17d22bba15001f\enr-0.10.0\src\keys\k256_key.rs:32
21: enr::keys::combined::impl$2::sign_v4
        at C:\Users\conta\.cargo\registry\src\index.crates.io-6f17d22bba15001f\enr-0.10.0\src\keys\combined.rs:47
22: enr::builder::Builder<enum2$<enr::keys::combined::CombinedKey> >::signature<enum2$<enr::keys::combined::CombinedKey> >
        at C:\Users\conta\.cargo\registry\src\index.crates.io-6f17d22bba15001f\enr-0.10.0\src\builder.rs:127
23: enr::builder::Builder<enum2$<enr::keys::combined::CombinedKey> >::build<enum2$<enr::keys::combined::CombinedKey> >
        at C:\Users\conta\.cargo\registry\src\index.crates.io-6f17d22bba15001f\enr-0.10.0\src\builder.rs:164
24: eth2_libp2p::discovery::enr::build_enr
        at C:\repo\grandine\eth2_libp2p\src\discovery\enr.rs:231
25: eth2_libp2p::discovery::enr::build_or_load_enr<types::preset::Mainnet>
        at C:\repo\grandine\eth2_libp2p\src\discovery\enr.rs:138
26: eth2_libp2p::service::impl$0::new::async_fn$0<usize,types::preset::Mainnet>
        at C:\repo\grandine\eth2_libp2p\src\service\mod.rs:169
27: core::future::future::impl$1::poll<alloc::boxed::Box<enum2$<eth2_libp2p::service::impl$0::new::async_fn_env$0<usize,types::preset::Mainnet> >,alloc::alloc::Global> >
        at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23\library\core\src\future\future.rs:123
28: p2p::network::impl$0::new::async_fn$0<types::preset::Mainnet>
        at C:\repo\grandine\p2p\src\network.rs:164
29: runtime::runtime::run_after_genesis::async_fn$0<types::preset::Mainnet>
        at C:\repo\grandine\runtime\src\runtime.rs:551
30: grandine::impl$0::run::async_fn$0<types::preset::Mainnet>
        at C:\repo\grandine\grandine\src\main.rs:286
31: tokio::runtime::park::impl$4::block_on::closure$0<enum2$<grandine::impl$0::run::async_fn_env$0<types::preset::Mainnet> > >
        at C:\Users\conta\.cargo\registry\src\index.crates.io-6f17d22bba15001f\tokio-1.38.0\src\runtime\park.rs:281
32: tokio::runtime::park::CachedParkThread::block_on<enum2$<grandine::impl$0::run::async_fn_env$0<types::preset::Mainnet> > >
        at C:\Users\conta\.cargo\registry\src\index.crates.io-6f17d22bba15001f\tokio-1.38.0\src\runtime\park.rs:281
33: tokio::runtime::context::blocking::BlockingRegionGuard::block_on<enum2$<grandine::impl$0::run::async_fn_env$0<types::preset::Mainnet> > >
        at C:\Users\conta\.cargo\registry\src\index.crates.io-6f17d22bba15001f\tokio-1.38.0\src\runtime\context\blocking.rs:66
34: tokio::runtime::scheduler::multi_thread::impl$0::block_on::closure$0<enum2$<grandine::impl$0::run::async_fn_env$0<types::preset::Mainnet> > >
        at C:\Users\conta\.cargo\registry\src\index.crates.io-6f17d22bba15001f\tokio-1.38.0\src\runtime\scheduler\multi_thread\mod.rs:87
35: tokio::runtime::context::runtime::enter_runtime<tokio::runtime::scheduler::multi_thread::impl$0::block_on::closure_env$0<enum2$<grandine::impl$0::run::async_fn_env$0<types::preset::Mainnet> > >,enum2$<core::result::Result<tuple$<>,anyhow::Error> > >
        at C:\Users\conta\.cargo\registry\src\index.crates.io-6f17d22bba15001f\tokio-1.38.0\src\runtime\context\runtime.rs:65
36: tokio::runtime::scheduler::multi_thread::MultiThread::block_on<enum2$<grandine::impl$0::run::async_fn_env$0<types::preset::Mainnet> > >
        at C:\Users\conta\.cargo\registry\src\index.crates.io-6f17d22bba15001f\tokio-1.38.0\src\runtime\scheduler\multi_thread\mod.rs:89
37: tokio::runtime::runtime::Runtime::block_on<enum2$<grandine::impl$0::run::async_fn_env$0<types::preset::Mainnet> > >
        at C:\Users\conta\.cargo\registry\src\index.crates.io-6f17d22bba15001f\tokio-1.38.0\src\runtime\runtime.rs:349
38: grandine::block_on<enum2$<grandine::impl$0::run::async_fn_env$0<types::preset::Mainnet> > >
        at C:\repo\grandine\grandine\src\main.rs:774
39: grandine::impl$0::run_with_restart::closure$0<types::preset::Mainnet>
        at C:\repo\grandine\grandine\src\main.rs:112
40: core::ops::function::FnOnce::call_once<grandine::impl$0::run_with_restart::closure_env$0<types::preset::Mainnet>,tuple$<> >
        at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23\library\core\src\ops\function.rs:250
41: core::panic::unwind_safe::impl$25::call_once<enum2$<core::result::Result<tuple$<>,anyhow::Error> >,grandine::impl$0::run_with_restart::closure_env$0<types::preset::Mainnet> >
        at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23\library\core\src\panic\unwind_safe.rs:273
42: std::panicking::try::do_call<core::panic::unwind_safe::AssertUnwindSafe<grandine::impl$0::run_with_restart::closure_env$0<types::preset::Mainnet> >,enum2$<core::result::Result<tuple$<>,anyhow::Error> > >
        at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23\library\std\src\panicking.rs:559
43: std::panicking::try::do_catch<core::panic::unwind_safe::AssertUnwindSafe<rayon_core::join::join_context::call_a::closure_env$0<rayon::iter::collect::consumer::CollectResult<bls::signature::Signature>,rayon::iter::plumbing::bridge_producer_consumer::helper::closure_env$0<rayon::slice::IterProducer<helper_functions::verifier::Triple>,rayon::iter::map::MapConsumer<rayon::iter::map::MapConsumer<rayon::iter::while_some::While
44: std::panicking::try<enum2$<core::result::Result<tuple$<>,anyhow::Error> >,core::panic::unwind_safe::AssertUnwindSafe<grandine::impl$0::run_with_restart::closure_env$0<types::preset::Mainnet> > >
        at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23\library\std\src\panicking.rs:523
45: std::panic::catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<grandine::impl$0::run_with_restart::closure_env$0<types::preset::Mainnet> >,enum2$<core::result::Result<tuple$<>,anyhow::Error> > >
        at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23\library\std\src\panic.rs:149
46: grandine::Context::run_with_restart<types::preset::Mainnet>
        at C:\repo\grandine\grandine\src\main.rs:109
47: grandine::try_main
        at C:\repo\grandine\grandine\src\main.rs:517
48: grandine::main
        at C:\repo\grandine\grandine\src\main.rs:311
49: core::ops::function::FnOnce::call_once<std::process::ExitCode (*)(),tuple$<> >
        at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23\library\core\src\ops\function.rs:250
50: std::sys_common::backtrace::__rust_begin_short_backtrace<std::process::ExitCode (*)(),std::process::ExitCode>
        at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23\library\std\src\sys_common\backtrace.rs:155
51: std::rt::lang_start::closure$0<std::process::ExitCode>
        at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23\library\std\src\rt.rs:159
52: std::rt::lang_start_internal
        at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library\std\src\rt.rs:141
53: std::rt::lang_start<std::process::ExitCode>
        at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23\library\std\src\rt.rs:158
54: main
55: __scrt_common_main_seh
        at D:\a\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:288
56: BaseThreadInitThunk
57: RtlUserThreadStart
mjzk commented 2 months ago

Today, I spend more time on this problem. It is found the call from Grandine to our eth2_libp2p has this problem, but the bottom libraries like enr are just the Lighthouse maintained crates. The wired thing is Lighthouse and my standalone test code does not cause the stackoverflow. So, one guess is that, this problem may be related the project wide config in Grandine. More investigations needed.

sauliusgrigaitis commented 2 months ago

Do you build with --release?

mjzk commented 2 months ago

Do you build with --release?

Yes, BN works in release mode! This suggests that the compiler's optimization has resolved some stack-related system-level issues.

If this is acceptable, we are close to completing this task.

mjzk commented 1 month ago

After a week of intensive work, Grandine is now able to run validation on Windows. My validator working status could be seen here.

Some details:

  1. Lighthouse also encounters a stack overflow issue on Windows.
  2. Reth does not work well on Windows, showing a "Database commit error code: 998," though this is not due to database file corruption.
  3. With proper configuration, using Grandine + Reth as a validator node works fine on a laptop, although the synchronization process is relatively slow.

Currently, there are still some issues with cargo test, stemming from the consensus-spec-tests, specifically related to SSZ deserialization. However, these tests did not run on my Linux machine. Synchronizing the consensus-spec-tests is challenging, and the details require further investigation.

Overall, the progress on Windows has been good. However, there are many shared engineering flaws in Ethereum infrastructure when it comes to Rust projects, which is concerning.

mjzk commented 1 month ago

add some logs for test fixings tracking:


failures:

---- spec_tests::invalid::basic_vector_uint16_4_consensus_spec_tests_tests_general_phase0_ssz_generic_basic_vector_invalid_vec_uint16_4_max_one_more stdout ----
thread 'spec_tests::invalid::basic_vector_uint16_4_consensus_spec_tests_tests_general_phase0_ssz_generic_basic_vector_invalid_vec_uint16_4_max_one_more' panicked at C:\repo\grandine\spec_test_utils\src\lib.rs:100:14:
the file should be compressed with Snappy: Offset { offset: 65316, dst_pos: 0 }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- spec_tests::valid::basic_vector_uint16_5_consensus_spec_tests_tests_general_phase0_ssz_generic_basic_vector_valid_vec_uint16_5_max stdout ----
thread 'spec_tests::valid::basic_vector_uint16_5_consensus_spec_tests_tests_general_phase0_ssz_generic_basic_vector_valid_vec_uint16_5_max' panicked at C:\repo\grandine\spec_test_utils\src\lib.rs:100:14:
the file should be compressed with Snappy: Offset { offset: 65316, dst_pos: 0 }

---- spec_tests::valid::basic_vector_uint16_5_consensus_spec_tests_tests_general_phase0_ssz_generic_basic_vector_valid_vec_uint16_5_random stdout ----
thread 'spec_tests::valid::basic_vector_uint16_5_consensus_spec_tests_tests_general_phase0_ssz_generic_basic_vector_valid_vec_uint16_5_random' panicked at C:\repo\grandine\spec_test_utils\src\lib.rs:100:14:
the file should be compressed with Snappy: Offset { offset: 20260, dst_pos: 0 }

failures:
    spec_tests::invalid::basic_vector_uint16_4_consensus_spec_tests_tests_general_phase0_ssz_generic_basic_vector_invalid_vec_uint16_4_max_one_more
    spec_tests::valid::basic_vector_uint16_5_consensus_spec_tests_tests_general_phase0_ssz_generic_basic_vector_valid_vec_uint16_5_max
    spec_tests::valid::basic_vector_uint16_5_consensus_spec_tests_tests_general_phase0_ssz_generic_basic_vector_valid_vec_uint16_5_random

test result: FAILED. 1884 passed; 3 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.30s

error: test failed, to rerun pass `-p ssz --lib`
mjzk commented 1 month ago

add campanion PR in dedicated_executor.

mjzk commented 1 month ago

add some logs for test fixings tracking:

failures:

---- spec_tests::invalid::basic_vector_uint16_4_consensus_spec_tests_tests_general_phase0_ssz_generic_basic_vector_invalid_vec_uint16_4_max_one_more stdout ----
thread 'spec_tests::invalid::basic_vector_uint16_4_consensus_spec_tests_tests_general_phase0_ssz_generic_basic_vector_invalid_vec_uint16_4_max_one_more' panicked at C:\repo\grandine\spec_test_utils\src\lib.rs:100:14:
the file should be compressed with Snappy: Offset { offset: 65316, dst_pos: 0 }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- spec_tests::valid::basic_vector_uint16_5_consensus_spec_tests_tests_general_phase0_ssz_generic_basic_vector_valid_vec_uint16_5_max stdout ----
thread 'spec_tests::valid::basic_vector_uint16_5_consensus_spec_tests_tests_general_phase0_ssz_generic_basic_vector_valid_vec_uint16_5_max' panicked at C:\repo\grandine\spec_test_utils\src\lib.rs:100:14:
the file should be compressed with Snappy: Offset { offset: 65316, dst_pos: 0 }

---- spec_tests::valid::basic_vector_uint16_5_consensus_spec_tests_tests_general_phase0_ssz_generic_basic_vector_valid_vec_uint16_5_random stdout ----
thread 'spec_tests::valid::basic_vector_uint16_5_consensus_spec_tests_tests_general_phase0_ssz_generic_basic_vector_valid_vec_uint16_5_random' panicked at C:\repo\grandine\spec_test_utils\src\lib.rs:100:14:
the file should be compressed with Snappy: Offset { offset: 20260, dst_pos: 0 }

failures:
    spec_tests::invalid::basic_vector_uint16_4_consensus_spec_tests_tests_general_phase0_ssz_generic_basic_vector_invalid_vec_uint16_4_max_one_more
    spec_tests::valid::basic_vector_uint16_5_consensus_spec_tests_tests_general_phase0_ssz_generic_basic_vector_valid_vec_uint16_5_max
    spec_tests::valid::basic_vector_uint16_5_consensus_spec_tests_tests_general_phase0_ssz_generic_basic_vector_valid_vec_uint16_5_random

test result: FAILED. 1884 passed; 3 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.30s

error: test failed, to rerun pass `-p ssz --lib`

The problem has been located. The default config autocrlf of git on windows is set to true. This will stupidly change some of test's binary artifacts(just these mentioned files).

Action change will come soon.

mjzk commented 1 month ago

New CI action has passed.

Detail of the new CI action:

  1. Add runs-on matrix for multiple OSes running
  2. Steps for format check and clippy only run on Linux, but the step cargo test run for both OSes. The formatting and clippy for Both OSes is not very meaningful.
mjzk commented 1 month ago

PR #42 now is ready for review.

mjzk commented 1 month ago

@sauliusgrigaitis I can add a new step just for cargo build for binary to resolve issue #39, if f you don't object.

mjzk commented 1 month ago

@sauliusgrigaitis the screensaver and sleep on Windows has been tested.

Both works without problem.


FYI, before sleeping the Reth output like this,

2024-10-09T02:53:58.069057Z ERROR Invalid JWT: IAT (issued-at) claim is not within ±60 seconds from the current time
2024-10-09T02:54:00.106834Z ERROR Invalid JWT: IAT (issued-at) claim is not within ±60 seconds from the current time
2024-10-09T02:54:00.555525Z ERROR Invalid JWT: IAT (issued-at) claim is not within ±60 seconds from the current time
2024-10-09T02:54:00.617978Z ERROR Invalid JWT: IAT (issued-at) claim is not within ±60 seconds from the current time
2024-10-09T02:54:00.633123Z ERROR Invalid JWT: IAT (issued-at) claim is not within ±60 seconds from the current time

Grandine just says like "error while downloading Eth1 blocks"