Closed jferrant closed 1 month ago
@kantai I feel like this case could and is acceptable to happen. This test says it expects EXACTLY 3 burn blocks gap, but isn't it more accurate to assume AT LEAST a block gap of 3? Don't want to change this test if this is legit a bug that the test uncovers...
Hmm... I think there's actually a race in the test. I don't think its a bug in the miner or node implementations, but it would cause flakiness (beyond just the failed assertion: it'll cause later ones too).
I think the race is here:
boot_to_pre_epoch_3_boundary(
&naka_conf,
&blocks_processed,
&[stacker_sk],
&[sender_signer_sk],
&mut Some(&mut signers),
&mut btc_regtest_controller,
);
let burnchain = naka_conf.get_burnchain();
let sortdb = burnchain.open_sortition_db(true).unwrap();
let tip = SortitionDB::get_canonical_burn_chain_tip(sortdb.conn()).unwrap();
let block_height_before_mining = tip.block_height;
// Mine 3 Bitcoin blocks rapidly without waiting for Stacks blocks to be processed.
// These blocks won't be considered "mined" until the next_block_and_wait call.
for _i in 0..3 {
btc_regtest_controller.build_next_block(1);
let tip = SortitionDB::get_canonical_burn_chain_tip(sortdb.conn()).unwrap();
// Verify that the canonical burn chain tip hasn't advanced yet
assert_eq!(
tip.block_height,
btc_regtest_controller.get_headers_height() - 1
);
assert_eq!(tip.block_height, block_height_before_mining);
}
info!("Bootstrapped to Epoch-3.0 boundary, starting nakamoto miner");
//
// THIS IS THE RACING ZONE?
//
// Mine a new block and wait for it to be processed.
// This should update the canonical burn chain tip to include all 4 new blocks.
next_block_and_wait(&mut btc_regtest_controller, &blocks_processed);
let tip = SortitionDB::get_canonical_burn_chain_tip(sortdb.conn()).unwrap();
// Verify that the burn chain tip has advanced by 4 blocks
assert_eq!(
tip.block_height,
block_height_before_mining + 4,
"Burn chain tip should have advanced by 4 blocks"
);
I think what happens is that if next_block_and_wait
is invoked before the stacks node has booted into nakamoto, then the test assertions will all be off by one. In the test run you linked above, you see three lines right next to each other:
2024-10-24T16:04:57.9392982Z INFO [1729785897.939022] [testnet/stacks-node/src/tests/nakamoto_integrations.rs:1113] [tests::nakamoto_integrations::flash_blocks_on_epoch_3] Bootstrapped to one block before Epoch 3.0 boundary, Epoch 2.x miner should continue for one more block
2024-10-24T16:04:57.9454686Z INFO [1729785897.945206] [testnet/stacks-node/src/tests/nakamoto_integrations.rs:1752] [tests::nakamoto_integrations::flash_blocks_on_epoch_3] Bootstrapped to Epoch-3.0 boundary, starting nakamoto miner
2024-10-24T16:04:57.9459772Z INFO [1729785897.945219] [testnet/stacks-node/src/tests/neon_integrations.rs:666] [tests::nakamoto_integrations::flash_blocks_on_epoch_3] Issuing block at 1729785897, waiting for bump (29)
Which I suspect is the issue... but I'll need to poke around a little more at what the passing runs look like.
Ah, nevermind! I think the race is between the epoch2x miner broadcasting their last winning block and shutting down for the epoch transition. This probably is buggy behavior (it's probably better for the miner to wait for the switch to the epoch3 miner until they've broadcasted any winning blocks), but it's not the worst behavior either (perhaps the stacks node really should prioritize getting to the new mining path).
I think the test should be updated as you are proposing -- a larger gap than 3 in the blocks is "okay" as long as the nakamoto blocks eventually start being produced.
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
See https://github.com/stacks-network/stacks-core/actions/runs/11489470403/job/32019267705?pr=5261#logs