Bidon15 commented 2 years ago

Introduction

At Celestia we are relying on the ability to generate and broadcast large transactions, hence using testground, we emulated the scenario when a set of validators in the network can generate and broadcast transaction up to 500kb each that will be included in the next block (e.g. 3 validators generating and broadcasting 500kb each results in a ~1.5mb next block size)

We would like to demonstrate below how the same test scenario and environment produced different outcome for tendermint v0.35.6(*) and latest downgraded v0.34.20

(*) - We have added two ABCI++ methods for our needs and this change https://github.com/celestiaorg/celestia-core/pull/793

Environment

Env Number	Tendermint version	Cosmos sdk version
1	v0.35.6	v0.46.0-beta2
2	v0.34.20	v0.46.0

Testground Network Configuration

Bandwidth: 100 and 256Mib

Latency: 0ms

Config.toml for each of the validator

Mempool

max_txs_bytes	1073741824
max_tx_bytes	1048576
size	5000

Consensus

timeout_propose	3 sec
timeout_prevote	1 sec
timeout_precommit	1 sec
timeout_commit	30 sec

RPC

timeout_broadcast_tx_commit	40 sec
max_body_bytes	1000000
max_header_bytes	1048576

Notes:

in v0.35.x the underscored dash is replaced with a normal dash. Still I’ve left the v0.34.x style in the tables
in v0.35.x new mode config has been set to “validator”

Test Scenario

Pre-Requisites:

Cobra commands are used for interacting with each of the validators(**)
Each of the validators start from genesis block
Connection is established between each of them
1 block is produced

(**) - this means that we are communicating with the BL of the node as if we are node operators and using CLI commands

Steps:

Each validator generates random 500kb tx
Each validator executes tx.GenerateOrBroadcastTxCLI with 500kb data and:
1. flag -b block
Waits for 5 minutes the tx to be included in the next block

Expected Results:

All Txs get included in the next block
Each of the validators fail gracefully due to error like: timed out waiting for tx to be included in a block

Actual Results

v0.35.x

Number of Validators	TX size	Next block size
3	500kb	~1.5Mb
20	200kb	~2Mb

The whole chain get stuck in rounds for 5 minutes without either:

Error message like timeout for tx to be included in the a block
Including all txs in the next block in a 30-40 seconds consensus.timeout_commit

v0.34.x

Number of Validators	TX size	Next block size
3	500kb	~1.5Mb
20	200kb	~2Mb

After successful downgrade to latest version of v0.34.x, we observed:

With 100Mib bandwidth for each of the validator, error messages are provided and chain is continuing to produce more blocks
With 256Mib bandwidth → 1.5Mb and 2Mb blocks are stably produced in a 30-40seconds time window

More Info:

Logs from testground and each of the validators can be found in this issue: https://github.com/celestiaorg/celestia-app/issues/563

In addition, we are continuing the investigation on our side to have an understanding if the root cause might be on our fork and provide more data: https://github.com/celestiaorg/celestia-core/issues/814

thanethomson commented 2 years ago

Given #9155, is this issue still relevant?

sergio-mena commented 1 year ago

@Bidon15 Further to @thanethomson's comment, is the bad behaviour seen in v0.37.x (or in a later release of v0.34.x)? Note that all releases in the v0.35.x branch have been retracted as @thanethomson is pointing out. If this bad behaviour is only seen in v0.35.x we will close this issue. Feel free to re-open in the future if you see it happening in v0.34.x, v0.37.x or any future release.

Bidon15 commented 1 year ago

Hey @sergio-mena

On v0.34.x the issue is not reproducible. Can't say for 0.37.x

thanethomson commented 1 year ago

Cool, thanks @Bidon15. We'll close this issue for now then. If you see this happen again, please feel free to reopen it.

tendermint / tendermint

Broadcasting 200~500kb tx on v0.35.6 vs 0.34.20 #9220