iqlusioninc / tmkms

Tendermint KMS: Key Management System for Tendermint Validators
https://tendermint.com/
Apache License 2.0
323 stars 117 forks source link

Protobuf: buffer underflow #729

Open activenodes opened 1 year ago

activenodes commented 1 year ago

Chains with:

Every ~40 sigratures (softsign due to low time blocks) connection go down with this error:

Let me know if you need any more details.

Thanks @tony-iqlusion

tony-iqlusion commented 1 year ago

What is on the other side of the privval connection? You're the first to report an error like this in many years of TMKMS existing, so it seems like the other side is the likely culprit.

activenodes commented 1 year ago

It's a testnet (Sei Network). Other 3/4 guys can confirm the same issue (like @mkaczanowski)

imperator-co commented 1 year ago

Hi, I can confirm that we have the same problem, afaik "buffer underflow" indicates an attempt to read a message that is larger than the amount of data available in the buffer (on tmkms side)

tony-iqlusion commented 1 year ago

It indicates the data is misframed. It seems very unlikely this is a TMKMS bug, and something wrong with the specific app you are trying to run.

activenodes commented 1 year ago

It indicates the data is misframed. It seems very unlikely this is a TMKMS bug, and something wrong with the specific app you are trying to run.

I'm not a Horcrux user, but it works fine in this chain too. What can we provide you to analyze the cause?

tony-iqlusion commented 1 year ago

I need a much better description of the problem than this, and instructions on how to reproduce it. Ideally it would be nice to have a dump of the incoming Protobuf which TMKMS can't parse.

What chain are you running?

How did you install it?

How are you running it?

Which message is causing the problem? Can you link to the Protobuf Schema for it?

qezz commented 1 year ago

Expanding a bit on this issue, we, too, faced this issue on Sei testnet (atlantic-2).

From what I understand, Sei has its own fork of tendermint - https://github.com/sei-protocol/sei-tendermint. So I believe the schema is https://github.com/sei-protocol/sei-tendermint/tree/main/proto, but I'm not 100% sure.

As for us, we were running tmkms as usual, as a separate process, and pointing seid (https://github.com/sei-protocol/sei-chain) to it via config parameter (which sets unix socket). The mode was softsign.

tony-iqlusion commented 1 year ago

I'm curious if they're expecting to use gRPC for the privval connection: https://github.com/sei-protocol/sei-tendermint/tree/main/privval

We don't support that yet: #73. We've been waiting for it to be added to upstream Tendermint, since it was removed in v0.35.

What version of Horcrux is it working with?

activenodes commented 1 year ago

My idea is to prepare you an environment with everything configured (chain/validator/tmkms) and give you access. Can you be interested?

tony-iqlusion commented 1 year ago

That would be very helpful, thank you

activenodes commented 1 year ago

That would be very helpful, thank you

I'm getting that server ready for you. Where I can send you logins (SSH)? If you want to contact me on Discord (Simo | Active Nodes#3233) or give me an email. Thanks!

tony-iqlusion commented 1 year ago

I will attempt to contact you via Discord

activenodes commented 1 year ago

I will attempt to contact you via Discord

Changed permissions... Could you try again with Discord (Simo | Active Nodes#3233)? I'm not using Signal... Thanks

tony-iqlusion commented 1 year ago

Request sent

IbrarMakaveli commented 10 months ago

Hey @tony-iqlusion , Any update on this issue ? Let me know how I can help to deal with, thanks

activenodes commented 10 months ago

Note. regarding previous tests and SSH access, now devnet is suspended. So we have to use testnet or mainnet chain

tony-iqlusion commented 10 months ago

I haven't had time to look, sorry. Note that there's a Hub upgrade today and a Stride upgrade tomorrow so I don't have much time this week either.

What would be very, very helpful is if someone could log the message that TMKMS can't parse for inspection.

activenodes commented 10 months ago

What would be very, very helpful is if someone could log the message that TMKMS can't parse for inspection.

Should we increase the debug level? Or how to collect as many logs as possible?

Thanks

tony-iqlusion commented 10 months ago

You can try adding a dbg!(&msg) here: https://github.com/iqlusioninc/tmkms/blob/b6ce617/src/rpc.rs#L36

Out of curiosity, do you have protocol_version = "0.34" in the config?

activenodes commented 10 months ago

You can try adding a dbg!(&msg) here: https://github.com/iqlusioninc/tmkms/blob/b6ce617/src/rpc.rs#L36

Thanks, I will try later in testnet

Out of curiosity, do you have protocol_version = "0.34" in the config?

Yes protocol_version = "v0.34" (with the "v") ... I think I have all the chains like this. is it right?

activenodes commented 10 months ago

Here: https://pastebin.com/4wDMbwP3

ERROR tmkms::client: [atlantic-2@tcp://127.0.0.1:26658] protocol error: malformed message packet: failed to decode Protobuf message: buffer underflow`

tony-iqlusion commented 10 months ago

Thanks, I believe this should be the message which is causing the issue encoded in hex:

ba2b2ab72b0aa82b08201091a6cc0d20ffffffffffffffffff012a480a20a33e4740e0b974cf1e010e6794ff620487e52ad63718d77e46ba3a62e1adc406122408011220cf19b022e389f5cd701b17b800caf505171bb98ae967349da72513b2138c1820320c08f1fbf8a60610a7bab0920242220a20bb985439aebaf357fad861849528d82e763031abd79fcc87c3a305bc1485388442220a2039825ada697b2d6959d75fa711636c7627c450c87ed8fe1aeaba4f20e28d435042220a20d9c22318e69463191372f92f66e5bdbf859c9b0e52020753785df5f53ea4bacb42220a20fc7ae857c36f4ea2f56285281e283ccdd600f4934183574bf5fd0ebb4c278cf942220a20ab49231edaa3fec0921d42b85cf90ba61d5251144980fa16f62a842a201755ad42220a20a7a17a651bfc02228490723a752a96e9f8ac3d64d261377726e2cdb0ccc646e342220a203178335d0ebe7e0ce2905df5b23aa05b9295e09caf80ba45472d2f61ccc9fb2f42220a20887cddac11b90e7a71888fefec183600dd79b94e9c8c43fb24d66038970eb91242220a202bfaf4b02410e8ba736e0443d63543e855cb4fb1eb99522a66d764802392700e42220a209a3e9332f29019d03097bcf290f377939b6bf8ee6a9d1952920cb9433515ca9542220a20dd148712e5a43ca35a0ed3560d6e753598da4e2ff71bd16d9dd553eddc4adcde42220a207e1df69adb7f32259154e375e20e98cf0a1c95b9f78ef1349df3ab539ab615be42220a2070a1e82287b3775ab287f7fe4e803c9bce2b03cbb4d0dba83b86104a114cb6e542220a207783770f1b6f8bdb347a8a8f113ecc67842783b4afbebb608aced8a51bb29bce42220a2008f7c95794aec249d2eba33161199c28574bc77f13accff28e5d70c23c3ebbd642220a20b53a7d779ebd7d0778aa9bb66e4565fe80903cfac23268ec016f2cf00818be9042220a20fa2f7fe3daa69c0cdd44058a44734bb6d44fce3e0c4db585e77a5190869137a142220a20f26cd81b78fa41bd46df409302c976f3451b3f0065a0e1e3e8dcb9947fd7d07c42220a20d002698eebbfd05a20492eaae6ed769068fac6cf0f3a55b6869ae8cacf5b92bf42220a20c1a848f44dfc893511262c49dd48072eb50472d49398c44b27044a028c76113b42220a2089c0affd2d4c2151cc654eebec6f0d29302ed611664965616e534e3b854eb43f42220a2084df4847e6d5998ac091ff0d6905400c3c7264855277e88cf06a184b81d2676b42220a2010e5cb10974a839fdd1584f1efc256d20fbf9d710c40ac4b968e23cb4927125b42220a20b12308f12fac8ecdc13b84feba49bf617ae0fb084b8d5fa8e8c975ee7f38c7954a0052ad200890a6cc0d1a480a203dee68aaa35b98433eba6c0416ec13403e50e3a2afb6ed1beaca515bbd7a98fc

I'll try to take a look and see if I can figure out what it's supposed to be

tony-iqlusion commented 10 months ago

If I plug that string into: https://protobuf-decoder.netlify.app/

...it doesn't appear to have a valid field number, which would need to match one of these field types (1 - 8):

https://docs.rs/tendermint-proto/latest/src/tendermint_proto/prost/v0_38/tendermint.privval.rs.html#79-96

...but the decoder says if that is a protobuf, its field number is 695, which would be a very unusual field number for a proto (they start at 1 and generally don't much higher than 30 typically)

So I have no idea what that message is supposed to be or how/why Horcrux would support it because it doesn't seem to be encoded correctly as a proto as far as I can tell.

It's possible Horcrux simply ignores messages it doesn't understand, whereas TMKMS considers it an error.

tony-iqlusion commented 10 months ago

Does anyone happen to know what that message might be, or know where to ask about it?

IbrarMakaveli commented 10 months ago

I don't really know too, if you could ignore this message, and deploy the code on a new branch, I could test and compile the tmkms on a machine and do the test on the testnet. We could ask Sei dev, but I'm not sure they know what it means either.

mkaczanowski commented 9 months ago

I found the issue and our (Chorus One) testnet node runs with the patched / hacked tmkms version and everything seems to be fine.

TL;DR

The issue is that the serialized block proposal (SIGNED_MSG_TYPE_PROPOSAL) for weights from 4000 bytes to 6000 (eyeballed logs with additional log messages). The buffer defined below is 1024 bytes only: https://github.com/iqlusioninc/tmkms/blob/main/src/rpc.rs#L197

resulting in message being always cut, thus malformed.

Proof of concept fix

I simply bumped the buffer size to:

let mut buf = vec![0; DATA_MAX_SIZE * 10];

Proper fix

Now I wonder whether this limit is actually correct in this context: https://docs.rs/tendermint-p2p/latest/tendermint_p2p/secret_connection/constant.DATA_MAX_SIZE.html

because this is not strictly p2p message (between peers in tendermint network), but rather a node <-> signer relationship.

Now the question is how to fix it, shall I make the buffer len configurable or hardcode higher value, or does anyone has a better idea?

another option to fix it, could be simply using dynamic allocation via Vec::new(): https://doc.rust-lang.org/std/io/trait.Read.html#method.read_to_end

@tony-iqlusion could you share your thoughts here?

tony-iqlusion commented 9 months ago

@mkaczanowski sounds like a bug/issue in the tendermint-p2p crate. I would suggest opening an upstream issue on tendermint-rs

mkaczanowski commented 9 months ago

issue created: https://github.com/informalsystems/tendermint-rs/issues/1356

mkaczanowski commented 9 months ago

funny enough this value comes from tmkms: https://github.com/informalsystems/tendermint-rs/pull/696/files

(git blame)

tony-iqlusion commented 9 months ago

Yeah, all of tendermint-rs was originally extracted from TMKMS.

I don't know the origins of this particular constant but regardless tendermint-rs is the right place for a fix.

Thanks for filing the issue.

IbrarMakaveli commented 8 months ago

@mkaczanowski @tony-iqlusion For your information, I've modified the buffer size, and I still get the error. Am I the only one, or has anyone else been able to test it ?

tony-iqlusion commented 8 months ago

FWIW I just merged #775 which also made the buffer size larger.

If increasing it doesn't help, I'm not sure what the problem is.

mkaczanowski commented 8 months ago

thanks, it should help. We run our sei validator with patched (buffer size) tmkms for weeks and there is no issues

tony-iqlusion commented 8 months ago

As of #775 the buffer size is now 64kB. It was released as v0.13.0-pre.1 in https://github.com/iqlusioninc/tmkms/pull/779.

Going to go ahead and close this. Please report back if there are problems.

tony-iqlusion commented 8 months ago

I've cut a v0.13.0-rc.0 release of TMKMS which we're now running in production at @iqlusioninc.

It would be great if someone can confirm it addresses this issue before I cut a final release.

qezz commented 8 months ago

Hey Tony, thanks

We will try the new release on Sei Testnet shortly

qezz commented 8 months ago

We were running TMKMS v0.13.0-rc.0 for ~2 hours on Sei Testnet, and we don't see any issues with it so far

On the graphs, the marker is when the version has been switched

image

We'll keep running the node with the new version. If any issues arise, will let you know.

P.S. Also, if I read Cargo.lock correctly, it has yubihsm 0.42.1 with the remote connection fix (https://github.com/iqlusioninc/yubihsm.rs/pull/464)

tmkms (tags/v0.13.0-rc.0?) $ cat Cargo.lock | rg "name.+yubi" -A 4 -B 1
[[package]]
name = "yubihsm"
version = "0.42.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "467a4c054be41ff657a6823246b0194cd727fadc3c539b265d7bc125ac6d4884"
dependencies = [

This is awesome! We haven't checked the new version with the remote signers yet, but will do soon (late this week, or next week)

Thank you @tony-iqlusion!

tony-iqlusion commented 8 months ago

Awesome, thanks!

I will wrap up a few more small issues and cut a final release in the next few days.

IbrarMakaveli commented 8 months ago

@tony-iqlusion I thing, I still have the issue on our testnet because we use softsign, any idea maybe our config is wrong, below our config for testnet Sei :

### Network

[[chain]]
id = "atlantic-2"
key_format = { type = "bech32", account_key_prefix = "seipub", consensus_key_prefix = "seivalconspub" }
state_file = "sei-testnet-consensus.json"

### Software-based Signer Configuration

[[providers.softsign]]
chain_ids = ["atlantic-2"]
key_type = "consensus"
path = "sei-testnet-consensus.key"

## Validator Configuration

[[validator]]
chain_id = "atlantic-2"
addr = "tcp://IP:PORT"
secret_key = "sei-testnet-consensus.key"
protocol_version = "v0.34"
reconnect = true
tony-iqlusion commented 8 months ago

@IbrarMakaveli can you paste the complete error?

IbrarMakaveli commented 8 months ago

@tony-iqlusion Here below the error : protocol error: malformed message packet: failed to decode Protobuf message: buffer underflow

tony-iqlusion commented 8 months ago

@IbrarMakaveli can you confirm you see the following on startup?

tmkms 0.13.0-rc.0 starting up...

I'll try to add some better debug logging so we can appropriately gauge how large these messages can be.

IbrarMakaveli commented 8 months ago

Yes I have this message on my logs on startup : Oct 16 23:33:12 tmkms bash[123858]: 2023-10-16T23:33:12.036546Z INFO tmkms::commands::start: tmkms 0.13.0-rc.0 starting up...

Thanks you, let me know if I can help

qezz commented 8 months ago

Our config looks very similar, except we don't set validator.secret_key, and the validator.addr is local socket

[[validator]]
addr = "unix:///path/to/sei-testnet.sock"
chain_id = "atlantic-2"
protocol_version = "v0.34"
reconnect = true
tony-iqlusion commented 8 months ago

Unfortunately the tendermint-p2p crate handles the message framing so it's not easy to add debug logging and adjust the buffer size automatically, however I tried bumping it to 256kB: #792

IbrarMakaveli commented 8 months ago

To give you some info, I have added print on buffer at read_msg function :

fn read_msg(conn: &mut impl Read) -> Result<Vec<u8>, Error> {
    let mut buf = vec![0; DATA_MAX_SIZE];
    let buf_read = conn.read(&mut buf)?;
    println!("Number bytes read: {}", buf_read); 
    buf.truncate(buf_read);
    Ok(buf)
}

The error (buffer underflow) happen when the number bytes read is more than 1024, I still thing the buffer is limit to 1024

tony-iqlusion commented 8 months ago

The constant is defined here: https://github.com/iqlusioninc/tmkms/blob/main/src/rpc.rs#L14

You can always try changing DATA_MAX_SIZE to a numeric literal of your choosing to eliminate it as a culprit:

let mut buf = vec![0; 1048576];

The error (buffer underflow) happen when the number bytes read is more than 1024, I still thing the buffer is limit to 1024

What numbers are you seeing here? That's what I really need to know.

tony-iqlusion commented 8 months ago

I've released TMKMS v0.13.0!

@IbrarMakaveli can you try it out and see if it resolves the issue?

IbrarMakaveli commented 7 months ago

@tony-iqlusion I've just tried again with this tag, I still have this error, has anyone been able to test with SoftSign ? Is there a Linux dependency ? A parameter to change ?

qezz commented 5 months ago

Getting back to this issue. Today I discovered that

more specifically,

2024-01-16T18:18:31.875640Z DEBUG tmkms::session: [atlantic-2@tcp://...:26759] sending response: PublicKey(PubKeyResponse { pub_key: Some(PublicKey { sum: Some(Ed25519([ ... < cut >
2024-01-16T18:18:32.056127Z ERROR tmkms::client: [atlantic-2@tcp://...:26759] protocol error: malformed message packet: failed to decode Protobuf message: buffer underflow
2024-01-16T18:18:33.056223Z DEBUG tmkms::session: [atlantic-2@tcp://...:26759] connecting to validator...
2024-01-16T18:18:33.056288Z  INFO tmkms::connection::tcp: KMS node ID: <nodeid>
2024-01-16T18:18:33.056706Z  INFO tmkms::session: [atlantic-2@tcp://...:26759] connected to validator successfully
2024-01-16T18:18:33.056712Z  WARN tmkms::session: [atlantic-2@tcp://...:26759]: unverified validator peer ID! (<id>)
2024-01-16T18:18:33.156651Z ERROR tmkms::client: [atlantic-2@tcp://...:26759] protocol error: malformed message packet: failed to decode Protobuf message: buffer underflow
2024-01-16T18:18:34.156745Z DEBUG tmkms::session: [atlantic-2@tcp://...:26759] connecting to validator...
2024-01-16T18:18:34.156817Z  INFO tmkms::connection::tcp: KMS node ID: <nodeid>

then it manages to connect to validator for several blocks, and then fails again