informalsystems / tendermint-rs

Client libraries for Tendermint/CometBFT in Rust!
Apache License 2.0
607 stars 224 forks source link

increase tendermint_p2p::secret_connection::DATA_MAX_SIZE #1356

Closed mkaczanowski closed 7 months ago

mkaczanowski commented 1 year ago

What went wrong?

Link to the issue: https://github.com/iqlusioninc/tmkms/issues/729#issuecomment-1722363214

TMKMS (signer) use:

tendermint_p2p::secret_connection::DATA_MAX_SIZE

to define the buffer size for incoming (signing / privval) requests. It happened that SEI network pushes the block proposals that are up to 6 times bigger than 1024 bytes (current value of DATA_MAX_SIZE) and that makes the tmkms to fail

@tony-iqlusion suggested this might be the issue of tendermint-p2p. I would like to understand if the value should be bumped, or maybe tmkms uses it incorrectly?

Steps to reproduce

  1. setup SEI validator node
  2. setup tmkms
  3. see tmkms fail on the underflow
  4. change the tmkms buffer size (tendermint_p2p::secret_connection::DATA_MAX_SIZE)
  5. see tmkms succeed

Definition of "done"

either the DATA_MAX_SIZE is bumped or there is a clear explanation of how and where things should be defined

tony-iqlusion commented 1 year ago

Perhaps we could raise DATA_MAX_SIZE to something much larger like 64kB.

I'm not sure how it was originally chosen, but it's not like the systems we're running it on are hurting for memory.

qezz commented 8 months ago

Following up on https://github.com/iqlusioninc/tmkms/issues/729#issuecomment-1894761567

Simply bumping the DATA_MAX_SIZE doesn't seem to help. Unix sock connection still works fine.

But TCP connection gets interrupted at https://github.com/informalsystems/tendermint-rs/blob/bcc0b377812b8e53a02dff156988569c5b3c81a2/p2p/src/secret_connection.rs#L486

with

Jan 17 17:25:06 tmkms[1191569]: 2024-01-17T17:25:06.025779Z  INFO tmkms::connection::tcp: KMS node ID: ...
Jan 17 17:25:08 tmkms[1191569]: 2024-01-17T17:25:08.899507Z ERROR tmkms::client: [atlantic-2@tcp://....:...] protocol error:
Jan 17 17:25:08 tmkms[1191569]:    0: io error
Jan 17 17:25:08 tmkms[1191569]:    1: Connection reset by peer (os error 104)
Jan 17 17:25:08 tmkms[1191569]:   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ BACKTRACE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Jan 17 17:25:08 tmkms[1191569]:                                 ⋮ 5 frames hidden ⋮
Jan 17 17:25:08 tmkms[1191569]:    6: flex_error::tracer_impl::eyre::<impl flex_error::tracer::ErrorMessageTracer for eyre::Report>::new_message::h0b8460f0e4a979f4
Jan 17 17:25:08 tmkms[1191569]:       at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/flex-error-0.4.4/src/tracer_impl/eyre.rs:10
Jan 17 17:25:08 tmkms[1191569]:    7: <flex_error::source::DisplayOnly<E> as flex_error::source::ErrorSource<Tracer>>::error_details::h57d5877ceb76caa7
Jan 17 17:25:08 tmkms[1191569]:       at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/flex-error-0.4.4/src/source.rs:142
Jan 17 17:25:08 tmkms[1191569]:    8: tendermint_p2p::error::Error::trace_from::ha6080d371e5d66bf
Jan 17 17:25:08 tmkms[1191569]:       at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/flex-error-0.4.4/src/macros.rs:547
Jan 17 17:25:08 tmkms[1191569]:    9: tendermint_p2p::error::Error::io::h426e6ce68aac4f61
Jan 17 17:25:08 tmkms[1191569]:       at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/flex-error-0.4.4/src/macros.rs:863
Jan 17 17:25:08 tmkms[1191569]:   10: <tendermint_p2p::error::Error as core::convert::From<std::io::error::Error>>::from::hbbb4244039ac05b8
Jan 17 17:25:08 tmkms[1191569]:       at /root/.cargo/git/checkouts/tendermint-rs-8ec10096483ea4be/823e79c/p2p/src/error.rs:67
Jan 17 17:25:08 tmkms[1191569]:   11: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual::h3f5e03ebe27c2693
Jan 17 17:25:08 tmkms[1191569]:       at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/result.rs:1962
Jan 17 17:25:08 tmkms[1191569]:   12: tendermint_p2p::secret_connection::share_auth_signature::h6159957cf21c45a1
Jan 17 17:25:08 tmkms[1191569]:       at /root/.cargo/git/checkouts/tendermint-rs-8ec10096483ea4be/823e79c/p2p/src/secret_connection.rs:483
Jan 17 17:25:08 tmkms[1191569]:   13: tendermint_p2p::secret_connection::SecretConnection<IoHandler>::new::h7573f282a1fcd921
Jan 17 17:25:08 tmkms[1191569]:       at /root/.cargo/git/checkouts/tendermint-rs-8ec10096483ea4be/823e79c/p2p/src/secret_connection.rs:315
Jan 17 17:25:08 tmkms[1191569]:   14: tmkms::connection::tcp::open_secret_connection::h9c5cb53469e7f11a
Jan 17 17:25:08 tmkms[1191569]:       at /checkout/src/connection/tcp.rs:45
Jan 17 17:25:08 tmkms[1191569]:   15: tmkms::session::Session::open::ha505adff198f368f
Jan 17 17:25:08 tmkms[1191569]:       at /checkout/src/session.rs:40
Jan 17 17:25:08 tmkms[1191569]:   16: tmkms::client::run_client::{{closure}}::hcd7575440f34a02d
Jan 17 17:25:08 tmkms[1191569]:       at /checkout/src/client.rs:104
Jan 17 17:25:08 tmkms[1191569]:   17: std::panicking::try::do_call::heeb3168f3784dd06
Jan 17 17:25:08 tmkms[1191569]:       at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:502
Jan 17 17:25:08 tmkms[1191569]:   18: __rust_try<unknown>
Jan 17 17:25:08 tmkms[1191569]:       at <unknown source file>:<unknown line>
Jan 17 17:25:08 tmkms[1191569]:   19: std::panicking::try::hc37a7f7faff34a19
Jan 17 17:25:08 tmkms[1191569]:       at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:466
Jan 17 17:25:08 tmkms[1191569]:   20: std::panic::catch_unwind::h4db7d0408a4313cc
Jan 17 17:25:08 tmkms[1191569]:       at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panic.rs:142
Jan 17 17:25:08 tmkms[1191569]:   21: tmkms::client::run_client::ha91ae47579a3f2a2
Jan 17 17:25:08 tmkms[1191569]:       at /checkout/src/client.rs:104
Jan 17 17:25:08 tmkms[1191569]:   22: tmkms::client::main_loop::h13c002e8a7fbb6e0
Jan 17 17:25:08 tmkms[1191569]:       at /checkout/src/client.rs:68
Jan 17 17:25:08 tmkms[1191569]:   23: tmkms::client::Client::spawn::{{closure}}::h1859ca7e2f68f87b
Jan 17 17:25:08 tmkms[1191569]:       at /checkout/src/client.rs:46
Jan 17 17:25:08 tmkms[1191569]:                                 ⋮ 14 frames hidden ⋮

On the node's side

Error: error creating node: error="error with private validator socket client: can't get pubkey: send: endpoint connection timed out" closerError=""

To overwrite the tmkms deps, I used

[patch.crates-io]
tendermint = { git = "https://github.com/qezz/tendermint-rs", tag = "v0.34.0-tcp-conn-patch.4", package = "tendermint", features = ["secp256k1"] }
tendermint-config = { git = "https://github.com/qezz/tendermint-rs", tag = "v0.34.0-tcp-conn-patch.4", package = "tendermint-config" }
tendermint-p2p = { git = "https://github.com/qezz/tendermint-rs", tag = "v0.34.0-tcp-conn-patch.4", package = "tendermint-p2p" }
tendermint-proto = { git = "https://github.com/qezz/tendermint-rs", tag = "v0.34.0-tcp-conn-patch.4", package = "tendermint-proto" }

tried these values

pub const DATA_MAX_SIZE: usize = 10240;
pub const DATA_MAX_SIZE: usize = 1024000;
pub const DATA_MAX_SIZE: usize = 65536;

I didn't have a chance to look deeper, unfortunately

tony-iqlusion commented 8 months ago

This seems to be some sort of bug in the tendermint-p2p crate.

It's certainly not helped by the API: where Secret Connection is a message-oriented protocol, tendermint-p2p uses io::{Read, Write} for its API, which are intended for a stream-oriented protocol like TCP/TLS.

mzabaluev commented 7 months ago

Let's try with #1393 and re-consider if that does not help. This proposal looks like suppressing the symptom to me.

qezz commented 7 months ago

shall we try the new version on a testnet?