Closed mkaczanowski closed 7 months ago
Perhaps we could raise DATA_MAX_SIZE
to something much larger like 64kB.
I'm not sure how it was originally chosen, but it's not like the systems we're running it on are hurting for memory.
Following up on https://github.com/iqlusioninc/tmkms/issues/729#issuecomment-1894761567
Simply bumping the DATA_MAX_SIZE
doesn't seem to help. Unix sock connection still works fine.
But TCP connection gets interrupted at https://github.com/informalsystems/tendermint-rs/blob/bcc0b377812b8e53a02dff156988569c5b3c81a2/p2p/src/secret_connection.rs#L486
with
Jan 17 17:25:06 tmkms[1191569]: 2024-01-17T17:25:06.025779Z INFO tmkms::connection::tcp: KMS node ID: ...
Jan 17 17:25:08 tmkms[1191569]: 2024-01-17T17:25:08.899507Z ERROR tmkms::client: [atlantic-2@tcp://....:...] protocol error:
Jan 17 17:25:08 tmkms[1191569]: 0: io error
Jan 17 17:25:08 tmkms[1191569]: 1: Connection reset by peer (os error 104)
Jan 17 17:25:08 tmkms[1191569]: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ BACKTRACE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Jan 17 17:25:08 tmkms[1191569]: ⋮ 5 frames hidden ⋮
Jan 17 17:25:08 tmkms[1191569]: 6: flex_error::tracer_impl::eyre::<impl flex_error::tracer::ErrorMessageTracer for eyre::Report>::new_message::h0b8460f0e4a979f4
Jan 17 17:25:08 tmkms[1191569]: at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/flex-error-0.4.4/src/tracer_impl/eyre.rs:10
Jan 17 17:25:08 tmkms[1191569]: 7: <flex_error::source::DisplayOnly<E> as flex_error::source::ErrorSource<Tracer>>::error_details::h57d5877ceb76caa7
Jan 17 17:25:08 tmkms[1191569]: at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/flex-error-0.4.4/src/source.rs:142
Jan 17 17:25:08 tmkms[1191569]: 8: tendermint_p2p::error::Error::trace_from::ha6080d371e5d66bf
Jan 17 17:25:08 tmkms[1191569]: at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/flex-error-0.4.4/src/macros.rs:547
Jan 17 17:25:08 tmkms[1191569]: 9: tendermint_p2p::error::Error::io::h426e6ce68aac4f61
Jan 17 17:25:08 tmkms[1191569]: at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/flex-error-0.4.4/src/macros.rs:863
Jan 17 17:25:08 tmkms[1191569]: 10: <tendermint_p2p::error::Error as core::convert::From<std::io::error::Error>>::from::hbbb4244039ac05b8
Jan 17 17:25:08 tmkms[1191569]: at /root/.cargo/git/checkouts/tendermint-rs-8ec10096483ea4be/823e79c/p2p/src/error.rs:67
Jan 17 17:25:08 tmkms[1191569]: 11: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual::h3f5e03ebe27c2693
Jan 17 17:25:08 tmkms[1191569]: at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/result.rs:1962
Jan 17 17:25:08 tmkms[1191569]: 12: tendermint_p2p::secret_connection::share_auth_signature::h6159957cf21c45a1
Jan 17 17:25:08 tmkms[1191569]: at /root/.cargo/git/checkouts/tendermint-rs-8ec10096483ea4be/823e79c/p2p/src/secret_connection.rs:483
Jan 17 17:25:08 tmkms[1191569]: 13: tendermint_p2p::secret_connection::SecretConnection<IoHandler>::new::h7573f282a1fcd921
Jan 17 17:25:08 tmkms[1191569]: at /root/.cargo/git/checkouts/tendermint-rs-8ec10096483ea4be/823e79c/p2p/src/secret_connection.rs:315
Jan 17 17:25:08 tmkms[1191569]: 14: tmkms::connection::tcp::open_secret_connection::h9c5cb53469e7f11a
Jan 17 17:25:08 tmkms[1191569]: at /checkout/src/connection/tcp.rs:45
Jan 17 17:25:08 tmkms[1191569]: 15: tmkms::session::Session::open::ha505adff198f368f
Jan 17 17:25:08 tmkms[1191569]: at /checkout/src/session.rs:40
Jan 17 17:25:08 tmkms[1191569]: 16: tmkms::client::run_client::{{closure}}::hcd7575440f34a02d
Jan 17 17:25:08 tmkms[1191569]: at /checkout/src/client.rs:104
Jan 17 17:25:08 tmkms[1191569]: 17: std::panicking::try::do_call::heeb3168f3784dd06
Jan 17 17:25:08 tmkms[1191569]: at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:502
Jan 17 17:25:08 tmkms[1191569]: 18: __rust_try<unknown>
Jan 17 17:25:08 tmkms[1191569]: at <unknown source file>:<unknown line>
Jan 17 17:25:08 tmkms[1191569]: 19: std::panicking::try::hc37a7f7faff34a19
Jan 17 17:25:08 tmkms[1191569]: at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:466
Jan 17 17:25:08 tmkms[1191569]: 20: std::panic::catch_unwind::h4db7d0408a4313cc
Jan 17 17:25:08 tmkms[1191569]: at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panic.rs:142
Jan 17 17:25:08 tmkms[1191569]: 21: tmkms::client::run_client::ha91ae47579a3f2a2
Jan 17 17:25:08 tmkms[1191569]: at /checkout/src/client.rs:104
Jan 17 17:25:08 tmkms[1191569]: 22: tmkms::client::main_loop::h13c002e8a7fbb6e0
Jan 17 17:25:08 tmkms[1191569]: at /checkout/src/client.rs:68
Jan 17 17:25:08 tmkms[1191569]: 23: tmkms::client::Client::spawn::{{closure}}::h1859ca7e2f68f87b
Jan 17 17:25:08 tmkms[1191569]: at /checkout/src/client.rs:46
Jan 17 17:25:08 tmkms[1191569]: ⋮ 14 frames hidden ⋮
On the node's side
Error: error creating node: error="error with private validator socket client: can't get pubkey: send: endpoint connection timed out" closerError=""
To overwrite the tmkms deps, I used
[patch.crates-io]
tendermint = { git = "https://github.com/qezz/tendermint-rs", tag = "v0.34.0-tcp-conn-patch.4", package = "tendermint", features = ["secp256k1"] }
tendermint-config = { git = "https://github.com/qezz/tendermint-rs", tag = "v0.34.0-tcp-conn-patch.4", package = "tendermint-config" }
tendermint-p2p = { git = "https://github.com/qezz/tendermint-rs", tag = "v0.34.0-tcp-conn-patch.4", package = "tendermint-p2p" }
tendermint-proto = { git = "https://github.com/qezz/tendermint-rs", tag = "v0.34.0-tcp-conn-patch.4", package = "tendermint-proto" }
tried these values
pub const DATA_MAX_SIZE: usize = 10240;
pub const DATA_MAX_SIZE: usize = 1024000;
pub const DATA_MAX_SIZE: usize = 65536;
I didn't have a chance to look deeper, unfortunately
This seems to be some sort of bug in the tendermint-p2p
crate.
It's certainly not helped by the API: where Secret Connection is a message-oriented protocol, tendermint-p2p
uses io::{Read, Write}
for its API, which are intended for a stream-oriented protocol like TCP/TLS.
Let's try with #1393 and re-consider if that does not help. This proposal looks like suppressing the symptom to me.
shall we try the new version on a testnet?
What went wrong?
Link to the issue: https://github.com/iqlusioninc/tmkms/issues/729#issuecomment-1722363214
TMKMS (signer) use:
to define the buffer size for incoming (signing / privval) requests. It happened that SEI network pushes the block proposals that are up to 6 times bigger than 1024 bytes (current value of
DATA_MAX_SIZE
) and that makes the tmkms to fail@tony-iqlusion suggested this might be the issue of tendermint-p2p. I would like to understand if the value should be bumped, or maybe tmkms uses it incorrectly?
Steps to reproduce
tendermint_p2p::secret_connection::DATA_MAX_SIZE
)Definition of "done"
either the DATA_MAX_SIZE is bumped or there is a clear explanation of how and where things should be defined