Open tony-iqlusion opened 9 months ago
This might be related to #1356, although curiously people claim things work with Unix domain sockets and the errors only occur over TCP.
I did confirm read_exact
is used: https://github.com/informalsystems/tendermint-rs/blob/64e9b33/p2p/src/secret_connection.rs#L622
We're continuing to receive reports of this problem even with tendermint-p2p
v0.34.1: https://github.com/iqlusioninc/tmkms/issues/729#issuecomment-1979571645
There's now a PR to tmkms to have it aggregate chunks of data itself and keep trying to decode them as a proto: https://github.com/iqlusioninc/tmkms/pull/903
...and it really seems like tendermint-p2p
should probably be doing that. Really I think the API should be refactored to be message-oriented, rather than stream-oriented.
What went wrong?
There are ongoing reports of failure to read a complete message in the
tendermint-p2p
crate which seem to be manifesting as an incomplete Protobuf message in TMKMS:https://github.com/iqlusioninc/tmkms/issues/729
Steps to reproduce
Unfortunately this seems very chain-specific and I have not been given proper repro steps myself in the aforementioned issue.
Definition of "done"
It's unclear what actually needs to change here.
A somewhat related issue is the API that
tendermint-p2p
exposes: it's a message-oriented protocol, but the interface uses thestd::io::{Read, Write}
traits, which are stream-oriented. It's possible TMKMS is misusing the API, but it's the wrong abstraction to begin with.