stacks-network / stacks-core

The Stacks blockchain implementation
https://docs.stacks.co
GNU General Public License v3.0
3.01k stars 672 forks source link

☂️ Tenure extends for increasing tenure budgets #5434

Closed kantai closed 4 days ago

kantai commented 2 weeks ago

The goal of the performance improvements (#5430, #5431, #5432) is to make the stacks-node more performant right now, and in particular, free up CPU time in the stacks-node such that if it is spending more time performing block processing, the nodes will continue to be able to stay in sync and responsive on their network interfaces (this is important for stackerdb messages to propagate, signers and miners to stay in sync, etc).

Once these improvements are in place, the tenure budget can be safely increased. This can (and should) be done without consensus changes. This can be done by simply issuing a tenure extend from the miner and the signer set approving it.

The basic idea is to have the miner thread time the length of its tenure, along with a configuration setting that tells it when it should try to perform a tenure extend. The signer set will similarly hold timers (measured from when they last signed off on a block proposal which spent some amount of block budget: the timer isn’t reset when they sign a block with just transfers, but it is reset if they process a block with contract-calls, e.g.), and when the timer expires, they would allow a tenure extend.

This would still allow the “spikiness” in budget consumption that we have today (the spikiness issue is somewhat orthogonal, and would be treated by #5433), but the budget would itself be higher, and the timing of extends would enforce some metering of the spikes (so that contract call budgets would be reset at, e.g., every 5 minutes rather than every bitcoin block, or whatever the timeout is set to). During initial rollout, this timeout will need to be set conservatively, but could be made more aggressive through configuration changes in miners and signers.

hstove commented 2 weeks ago

The signer set will similarly hold timers (measured from when they last signed off on a block proposal which spent some amount of block budget: the timer isn’t reset when they sign a block with just transfers, but it is reset if they process a block with contract-calls, e.g.), and when the timer expires, they would allow a tenure extend.

Can you expand on this - specifically, why would the timer only reset when a block with contract calls is processed? Wouldn't that encourage spikiness? My assumption would be that this timer resets when a block includes a TenureExtend.

kantai commented 2 weeks ago

Can you expand on this - specifically, why would the timer only reset when a block with contract calls is processed? Wouldn't that encourage spikiness? My assumption would be that this timer resets when a block includes a TenureExtend.

Yes, I think you’re right, the timer should start when a tenure begins (or the extension begins: basically the timer should start whenever there's a tenure change payload).

However, the signers must also do some metering according to the block evaluation time. Right now, its the case that the tenure budget is expended often with just a few seconds of evaluation, but the cost tracker is an imperfect (and pessimistic) estimator of runtime. Things like cache locality in the MARF definitely impact block evaluation time, and so the signers should take that into account. This is simple enough for them to do naively by just tracking the wall clock time of processing the block proposals.

I think the way to do this is to "bump" the budget timer by the amount of time they spend processing proposals during the tenure: so if a proposal takes them 1 minute to evaluate, they bump the budget timer by 1 minute (so if they would have allowed the budget to be reset at time t, they instead will allow the reset at time t + 60s.

jferrant commented 1 week ago

Could you have signers track based on the last time they saw a tenure change payload rather than a contract call? I ask because the tenure change payload is always guaranteed to be the very first transaction in the block so might be easier to track that rather than the last block with a contract call.

obycode commented 1 week ago

I had not thought about it this way, but I like the idea of factoring in the actual block processing time. To put it another way, we can think of it as the signer saying, "once I have seen X minutes of downtime, I will allow a tenure extend." So when the tenure starts or extends, I start a countdown at X minutes. When I get a block proposal, I pause the countdown, process the block, send my signature, and resume the countdown. When my countdown reaches 0, I will allow a tenure extension in the next block I process.

One simple way to synchronize between miners and signers could be for the signer to include a flag in its block signature message indicating to the miner that it is ready for a tenure extend. When the miner sees that enough signers have set this flag, it can go ahead and issue one in its next block.

hstove commented 1 week ago

Assuming we do some measurement of wall-clock time on block processing, I just want to note that we should have the node track this and return it to the signer in the HTTP response to the block proposal. If the signer tries to track this, we can end up with too much variance from other reasons for latency.

aldur commented 1 week ago

@obycode, to enlist the help of @hstove and @jferrant can you split this into smaller issues that they can handle in parallel?

obycode commented 1 week ago

EDIT (by @aldur): See below for an updated design, this is left for historical references.


Here are my initial thoughts for the design:

Overview

The task here is to allow a miner to extend its tenure based on time since the last tenure extension. The signers decide when a miner is allowed to extend, so we need some mechanism to communicate this between the miner and the signers. I propose adding a field, extend_countdown into the BlockResponse message that a signer sends to a miner as a result of a block proposal. The value of extend_countdown is the number of seconds of idle time that must pass before the signer will allow a tenure extension. The miner can track these countdowns from all signers and decide when to extend its tenure based on when it thinks it can get 70% of the signers to approve it.

Signer Details

The signer configuration will specify a tenure extend time period. The first version of this to go live on mainnet should start off with this value defaulting to 10 minutes, to ensure minimal impact on the network. As we validate that these tenure extends do not cause any problems, we can spread the word to signers to iteratively lower this number.

When a new burn block arrives, record the current time, idle_start, and initialize an idle_countdown counter to the configured duration. When a block proposal arrives, compute the time passed since the idle_start and subtract it from idle_countdown, then begin evaluation of the proposal. Append the idle_countdown into the BlockResponse for this proposal. Once the response is sent, record the current time again to idle_start. Repeat.

If a block proposal arrives that contains a TenureExtend transaction with cause IdleTimeExtension, check that the current idle_countdown is less than or equal to 0 (letting this value go negative is useful feedback to the miner). If so, process the block as usual, else, reject the block. This rejection would have a new reason code.

Miner Details

The miner needs to now keep track of the signers’ current idle time countdowns and decide when it can refresh its budget with an IdleTimeExtension. The sign coordinator can keep track of the signer countdowns as it receives BlockResponses and report back to the miner. Since the sign coordinator returns as soon as 70% approve the current block, we may need to do something different to handle tracking the countdowns from responses that come in after this threshold is reached. After each round of signing, the miner should record its estimated time to extend. It can compute this by ordering the countdown responses in ascending order, and selecting a time at which > 70% will have reached 0, then adding that to the current time and saving the value. This calculation is needed in the case where the miner is not able to mine any blocks (either because there is no budget or there are no transactions in the mempool), so it will not get any new countdown values from the signers.

Testing

Designing good integration tests for this new behavior is important. We will need to test several different scenarios:

obycode commented 6 days ago

Updated design after discussion with @jferrant and @hstove:

Overview

The task here is to allow a miner to extend its tenure based on time since the last tenure extension. The signers decide when a miner is allowed to extend, so we need some mechanism to communicate this between the miner and the signers. I propose adding a field, extend_timestamp into the BlockResponse message that a signer sends to a miner as a result of a block proposal. The value of extend_timestamp is the wall clock time after which the signer will allow a tenure extension. The miner can track these times from all signers and decide when to extend its tenure based on when it thinks it can get 70% of the signers to approve it.

Signer Details

The signer configuration will specify a tenure extend time period. The first version of this to go live on mainnet should start off with this value defaulting to something like 5 minutes. As we validate that these tenure extends do not cause any problems, we can spread the word to signers to iteratively lower this number.

When a new burn block arrives, record the current time, idle_start, and initialize an idle_countdown counter to the configured duration. When a block proposal arrives, record the time, process_start. The block validation endpoint will validate the block and return the cost of that block. If the block has a non-zero cost, subtract (process_start - idle_start) from the idle_countdown. If it has a 0 cost, then subtract (now - idle_start) from the idle_countdown. This difference in how the idle time is computed is important to encourage miners to continue mining blocks with STX transfers after their budget is spent but before enough idle time has passed for a tenure extend.

In the BlockResponse for this proposal, include a timestamp which is current time plus idle_countdown. Once the response is sent, record the current time again to idle_start. Repeat with each block proposal.

We keep track of "idle" time instead of just flat wall time because it allows the signers to factor in how long it actually takes to process the blocks. This will flatten out the total processing time in scenarios where the cost budgeting is overly pessimistic, causing us to see some blocks that can spend the entire budget and be processed in 3 seconds, while others that spend the entire budget take 3 minutes to process.

If a block proposal arrives that contains a TenureExtend transaction and the tenure_consensus_hash is equal to the burn_view_consensus_hash, check that the current idle_countdown is less than or equal to 0 (letting this value go negative is useful feedback to the miner). If so, process the block as usual, else, reject the block. This rejection would have a new reason code.

Miner Details

The miner needs to now keep track of the signers’ current idle timestamps and decide when it can refresh its budget with a tenure extension. A new component will process the StackerDB messages as they arrive, rather than directly in the sign coordinator. This is important because the sign coordinator stops listening for block responses from signers as soon as it hits the 70% threshold, but it is important for the miner to track the idle timestamps of all signers that report it. This component will be responsible for keeping track of the signers' latest idle timestamps, queryable from the miner. It will also provide the sign coordinator with block signatures. After each round of signing, the miner should record its estimated time to extend. It can compute this by ordering the countdown responses in ascending order, and selecting a time at which > 70% of the signing power will have passed their timestamp. Before each attempt to mine a block, check if this timestamp has passed and if so, issue the tenure extension.

Testing

Designing good integration tests for this new behavior is important. We will need to test several different scenarios: