sei-protocol / sei-tendermint

Tendermint fork with ABCI++ support, bug fixes, and custom features
Other
39 stars 36 forks source link

remove extended Proposal payload before sending to remote signer (fixes buffer underflow) #240

Open zarkone opened 4 months ago

zarkone commented 4 months ago

This pull request addresses the problem of buffer underflow problem, which happens when tmkms signer is plugged over tcp to Sei.

Original issue, opened in tmkms repo https://github.com/iqlusioninc/tmkms/issues/729

Even though the issue is opened in tmkms, I believe that fix should be applied in Sei.

What happens

This snippet illustrates the difference between Sei's Proposal and original Proposal structure.

// tendermint/cometbft proposal:
type Proposal struct {
  Type      SignedMsgType
  Height    int64
  Round     int32
  PolRound  int32
  BlockID   BlockID
  Timestamp time.Time
  Signature []byte
}
// vs sei-tendermint proposal
type Proposal struct {
  Type            SignedMsgType
  Height          int64
  Round           int32
  PolRound        int32
  BlockID         BlockID
  Timestamp       time.Time
  Signature       []byte

  // this is a list, and can be very long...
  TxKeys          []*TxKey
  Evidence        *EvidenceList
  LastCommit      *Commit
  Header          Header
  ProposerAddress []byte
}

As you can see, Tendermint protocol has a slight design flaw: Proposal is implicitly assumed to have a fixed length, but it doesn't declare it explicitly. Remote signer implementations assume that protobuf message, which is sent over the wire is always less than 1024 bytes.

And this is not an issue for networks where Proposal holds the original structure, but not for Sei. When Sei accumulates a certain amount of TxKeys, Proposal becomes bigger than 1024 bytes. When remote signer reads it, it reads 1024 bytes and tries to unmarshal it it fails with buffer underflow.

Why not to fix it in remote signer?

Yes, at first glance, it feels that remote signer can just aggregate multiple chunks and unmarshal it once it gets the all the chunks. Why not do just do that?

The reason why it is not possible (at least, in efficient way) is hidden in another issue with old Tendermint (and Sei) tcp signer implementation.

When protobuf message is being send over the wire, it suppose to have a length encoded in the beginning of the message. And Sei puts the length, however, it is not the length of the whole message -- this is the length of a chunk.

See here: https://github.com/sei-protocol/sei-tendermint/blob/256802838ef226092062790b6fb4340d25243589/privval/secret_connection.go#L210-L219

So instead of sending length in a first chunk, it sends a length of each chunk in each chunk, which makes it impossible to aggregate it on wire level. It becomes possible on protobuf level, in a very ugly way -- see the implementation here https://github.com/iqlusioninc/tmkms/pull/903 .

In short, I've made a loop which aggregates chunks (with removed length) and tries to unmarshall it on each new chunk. Which means, if Proposal is e.g. 7kb it will try to unmarshal it 7 times: this solution is pretty fragile.

What happens in this PR

While I was debugging the issue, I've noticed that even though tmkms gets Sei's Proposal, since it doesn't have Sei's .proto files it uses original Tendermint .proto to unmrashal the Proposal. So it signs the original proposal, or, "canonical", if you will, but not the extended Sei's Proposal. However, it works fine on Sei Testnet without any issues. I dig further and found that local signer also uses stripped down Proposal, doing CanonicalizeProposal before signing (see privval/file.go#L445).

[!NOTE] I don't understand why only part of proposal is being signed -- maybe you could point me where to read about this, how does this work?

In any case, I figured that since Sei accepts short Proposal we can just remove additional fields. In this PR, I do it right before sending bytes because of the nature of protobuf: I don't see a sane way of how to send a different proposal without changing all the types in function definitions.

If you know a better way how to do that please let me know: for now, you can treat this PR as an expression of intention of what I want to happen, rather then complete implementation.

Misc

The described problem of chunk length instead of msg length is out of scope here. As far as I can see, it is the case for cometBFT as well, since the code looks the same: cometbft/p2p/conn/secret_connection.go#L221 -- however, I didn't try to break it, it is hard to do without an extended Proposal.

As for tmkms, even though I have a PR with workaround open, it is rather a hotfix for us and other validators.

The length problem should be addressed on a wire level, in tendermint-p2p: it doesn't do any aggregation and reads only one chunk now (see tendermint-rs/p2p/secret_connection.rs#L618). But it makes sense to fix only after it will be fixed in producers.

How to test?

Testing could be simple or could be tricky, depending on what tooling you already have. In order to see failure, you just need to plug Sei to tmkms over TCP. You need to have bigger transactions, so that TxKeys will become big enough to cause overflow. You can see observe it on testnet without any efforts, since oracle-feeder puts 30-50 txs almost in every block.

Locally, you can send a lot of transactions manually, something like

    ~/go/bin/seid tx bank send $admin_0 $admin_1 1usei --fees 4000usei -y &
    ~/go/bin/seid tx bank send $admin_1 $admin_2 1usei --fees 4000usei -y &
    ~/go/bin/seid tx bank send $admin_2 $admin_3 1usei --fees 4000usei -y &
    ~/go/bin/seid tx bank send $admin_3 $admin_4 1usei --fees 4000usei -y &
    # etc...

-- this works for me.

So you can check that without patch tmkms (and Sei) keep failing with buffer overflow, and with patch it works well.