// vs sei-tendermint proposal
type Proposal struct {
Type SignedMsgType
Height int64
Round int32
PolRound int32
BlockID BlockID
Timestamp time.Time
Signature []byte
// this is a list, and can be very long...
TxKeys []*TxKey
Evidence *EvidenceList
LastCommit *Commit
Header Header
ProposerAddress []byte
}
As you can see, Tendermint protocol has a slight design flaw: Proposal is implicitly assumed to have a fixed length, but it doesn't declare it explicitly. Remote signer implementations assume that protobuf message, which is sent over the wire is always less than 1024 bytes.
And this is not an issue for networks where Proposal holds the original structure, but not for Sei. When Sei accumulates a certain amount of TxKeys, Proposal becomes bigger than 1024 bytes. When remote signer reads it, it reads 1024 bytes and tries to unmarshal it it fails with buffer underflow.
Why not to fix it in remote signer?
Yes, at first glance, it feels that remote signer can just aggregate multiple chunks and unmarshal it once it gets the all the chunks. Why not do just do that?
The reason why it is not possible (at least, in efficient way) is hidden in another issue with old Tendermint (and Sei) tcp signer implementation.
When protobuf message is being send over the wire, it suppose to have a length encoded in the beginning of the message. And Sei puts the length, however, it is not the length of the whole message -- this is the length of a chunk.
So instead of sending length in a first chunk, it sends a length of each chunk in each chunk, which makes it impossible to aggregate it on wire level. It becomes possible on protobuf level, in a very ugly way -- see the implementation here https://github.com/iqlusioninc/tmkms/pull/903 .
In short, I've made a loop which aggregates chunks (with removed length) and tries to unmarshall it on each new chunk. Which means, if Proposal is e.g. 7kb it will try to unmarshal it 7 times: this solution is pretty fragile.
What happens in this PR
While I was debugging the issue, I've noticed that even though tmkms gets Sei's Proposal, since it doesn't have Sei's .proto files it uses original Tendermint .proto to unmrashal the Proposal. So it signs the original proposal, or, "canonical", if you will, but not the extended Sei's Proposal. However, it works fine on Sei Testnet without any issues. I dig further and found that local signer also uses stripped down Proposal, doing CanonicalizeProposal before signing (see privval/file.go#L445).
[!NOTE]
I don't understand why only part of proposal is being signed -- maybe you could point me where to read about this, how does this work?
In any case, I figured that since Sei accepts short Proposal we can just remove additional fields. In this PR, I do it right before sending bytes because of the nature of protobuf: I don't see a sane way of how to send a different proposal without changing all the types in function definitions.
If you know a better way how to do that please let me know: for now, you can treat this PR as an expression of intention of what I want to happen, rather then complete implementation.
Misc
The described problem of chunk length instead of msg length is out of scope here. As far as I can see, it is the case for cometBFT as well, since the code looks the same: cometbft/p2p/conn/secret_connection.go#L221 -- however, I didn't try to break it, it is hard to do without an extended Proposal.
As for tmkms, even though I have a PR with workaround open, it is rather a hotfix for us and other validators.
The length problem should be addressed on a wire level, in tendermint-p2p: it doesn't do any aggregation and reads only one chunk now (see tendermint-rs/p2p/secret_connection.rs#L618). But it makes sense to fix only after it will be fixed in producers.
How to test?
Testing could be simple or could be tricky, depending on what tooling you already have. In order to see failure, you just need to plug Sei to tmkms over TCP. You need to have bigger transactions, so that TxKeys will become big enough to cause overflow. You can see observe it on testnet without any efforts, since oracle-feeder puts 30-50 txs almost in every block.
Locally, you can send a lot of transactions manually, something like
This pull request addresses the problem of buffer underflow problem, which happens when tmkms signer is plugged over tcp to Sei.
Original issue, opened in tmkms repo https://github.com/iqlusioninc/tmkms/issues/729
Even though the issue is opened in tmkms, I believe that fix should be applied in Sei.
What happens
This snippet illustrates the difference between Sei's
Proposal
and originalProposal
structure.As you can see, Tendermint protocol has a slight design flaw:
Proposal
is implicitly assumed to have a fixed length, but it doesn't declare it explicitly. Remote signer implementations assume that protobuf message, which is sent over the wire is always less than 1024 bytes.And this is not an issue for networks where
Proposal
holds the original structure, but not for Sei. When Sei accumulates a certain amount ofTxKeys
,Proposal
becomes bigger than 1024 bytes. When remote signer reads it, it reads 1024 bytes and tries to unmarshal it it fails withbuffer underflow
.Why not to fix it in remote signer?
Yes, at first glance, it feels that remote signer can just aggregate multiple chunks and unmarshal it once it gets the all the chunks. Why not do just do that?
The reason why it is not possible (at least, in efficient way) is hidden in another issue with old Tendermint (and Sei) tcp signer implementation.
When protobuf message is being send over the wire, it suppose to have a length encoded in the beginning of the message. And Sei puts the length, however, it is not the length of the whole message -- this is the length of a chunk.
See here: https://github.com/sei-protocol/sei-tendermint/blob/256802838ef226092062790b6fb4340d25243589/privval/secret_connection.go#L210-L219
So instead of sending length in a first chunk, it sends a length of each chunk in each chunk, which makes it impossible to aggregate it on wire level. It becomes possible on protobuf level, in a very ugly way -- see the implementation here https://github.com/iqlusioninc/tmkms/pull/903 .
In short, I've made a loop which aggregates chunks (with removed length) and tries to unmarshall it on each new chunk. Which means, if
Proposal
is e.g. 7kb it will try to unmarshal it 7 times: this solution is pretty fragile.What happens in this PR
While I was debugging the issue, I've noticed that even though tmkms gets Sei's
Proposal
, since it doesn't have Sei's.proto
files it uses original Tendermint.proto
to unmrashal theProposal
. So it signs the original proposal, or, "canonical", if you will, but not the extended Sei'sProposal
. However, it works fine on Sei Testnet without any issues. I dig further and found that local signer also uses stripped down Proposal, doingCanonicalizeProposal
before signing (see privval/file.go#L445).In any case, I figured that since Sei accepts short
Proposal
we can just remove additional fields. In this PR, I do it right before sending bytes because of the nature of protobuf: I don't see a sane way of how to send a different proposal without changing all the types in function definitions.If you know a better way how to do that please let me know: for now, you can treat this PR as an expression of intention of what I want to happen, rather then complete implementation.
Misc
The described problem of chunk length instead of msg length is out of scope here. As far as I can see, it is the case for cometBFT as well, since the code looks the same: cometbft/p2p/conn/secret_connection.go#L221 -- however, I didn't try to break it, it is hard to do without an extended
Proposal
.As for tmkms, even though I have a PR with workaround open, it is rather a hotfix for us and other validators.
The length problem should be addressed on a wire level, in tendermint-p2p: it doesn't do any aggregation and reads only one chunk now (see tendermint-rs/p2p/secret_connection.rs#L618). But it makes sense to fix only after it will be fixed in producers.
How to test?
Testing could be simple or could be tricky, depending on what tooling you already have. In order to see failure, you just need to plug Sei to tmkms over TCP. You need to have bigger transactions, so that TxKeys will become big enough to cause overflow. You can see observe it on testnet without any efforts, since oracle-feeder puts 30-50 txs almost in every block.
Locally, you can send a lot of transactions manually, something like
-- this works for me.
So you can check that without patch tmkms (and Sei) keep failing with buffer overflow, and with patch it works well.