Protocol Request: change how the close time is used when applying transactions

What problem does your feature solve?

Right now when applying transactions, we use the close time decided during consensus, which is a different close time than what was used to validate transactions. This creates the possibility that a transaction gets accepted in a ledger only to fail later. This can break certain smart contracts if transactions are submitted at exactly the wrong time.

What would you like to see?

close time should be set for the ledger after transactions get applied, in the same way the protocol version is updated after.

What alternatives are there?

we could try to move the checks upstream, but this may be challenging as "combine candidates" (SCP) can modify the transaction set and close time independently.

Expanding on this a bit, on why this is annoying.

For regular transactions submitted by SDKs that set expiration time to make it easier to know when they can safely re-submit (with a potentially new sequence number) a transaction. This is trying to guard people against clients running in parallel and issues caused by mem pool & network congestion/surge pricing. With this bug, they're charged a transaction fee for no good reason.

The larger issue is with smart contracts that today are written as chains of pre-signed transaction, some have "branches" with 2 mutually exclusive "paths" based on time.

For example, one of the transaction is only valid before a certain time (let's call it A), and the other is only valid after a certain time (B).

With this bug in the protocol, it's possible for a transaction to be accepted in the candidate tx set even though it will fail during consensus because a future timestamp is used when executing that transaction.

So basically A is valid, but then fails, yet consumes a sequence number, therefore unlocking transactions that are child of A. If that transaction was anything other than a "no-op", this can have very bad effects on the smart contract (not just unlocking the branch, there is no way to "retry" that transaction).

Running into this in the context of smart contracts is not super likely as the timescale for expiration for those contracts are in the order of hours or days (or even years); but nevertheless, we should close the possibility of failure.

Actually there might be a different way to fix this: what we can do is change the combine candidates function to remove transactions from the transaction set (and dependent transactions) based on the new close time.

See: https://github.com/stellar/stellar-core/blob/c097a3fe36f108bcbb52e4995b81bfe617f3109a/src/herder/HerderSCPDriver.cpp#L724

Right now this function is a no-op; only downside with this is that when we change the transaction set here, that means that we introduce a new value on the network - that said, as it's a deterministic function so nodes that are in sync should actually construct the new transaction set without having to rely on peers.

Actually this last solution (change the way we combine candidates) doesn't work: if we were to remove transactions at that stage it would imply that somebody spamming the network with "close to expiration" transactions would not pay any transaction fees; worst they could spam with "very large fees" and take over ledgers entirely (only to see those transactions removed for the ballot protocol) so the network would happily close empty ledgers.

possible alternative: use the ct associated with the txset (instead of max) and punt the requirement on the txset before they get into the candidate set.

possible alternative: use the ct associated with the txset (instead of max) and punt the requirement on the txset before they get into the candidate set.

I've started experimenting this, and testing has brought up a question. Nicolas pointed out that if we just take the best txset, with its own specific closetime, rather than combining txsets, and (by taking the maximum of all candidates) their closetimes, then we will have an answer to the question "who proposed this chosen txset?" (currently there is no answer to that question because a txset, in particular its closetime, may be the result of combining several candidates, and not equal to any one specific candidate proposed by any given node), which in turn would allow us to pass around a SIGNED rather than a BASIC StellarValue as the chosen candidate to externalize. I don't know of any existing idea to use that information, but it would be another piece of information, so we might come up with a use for it eventually.

However, we're currently combining not only the candidates' closetimes (by taking the maximum) to produce the externalized value, but also all the candidates' upgrades (again, by taking various maxima, such as of proposed ledger version and maximum txset size). If we continue to do that, then we still won't be able to identify an externalized StellarValue as having been proposed by a specific node, and therefore still won't be able to sign it, because the upgrades might constitute some combination of the candidates' upgrades that doesn't equal any one candidate's.

I have only a tiny amount of context so far with which to consider the question: should we also change the selection of upgrades to be that of the StellarValue containing the best txset, or should we continue to combine upgrades while changing only the closetime to be that of the StellarValue containing the best txset?

I have only a tiny amount of context so far with which to consider the question: should we also change the selection of upgrades to be that of the StellarValue containing the best txset, or should we continue to combine upgrades while changing only the closetime to be that of the StellarValue containing the best txset?

Some musings:

Jon and I discussed this, and while we didn't come to a conclusion, I do gather that the upgrades do happen independently of the transactions: we could view the ledger closes as an optimization of alternating transaction set applies and ledger upgrades. So we do have a free choice between proposing to change upgrades in the same way as we are proposing to change closetime, or not doing so and letting them work differently (single-candidate-selection for closetimes and txsets, candidate-combination for upgrades).
I think it's clear that we'll gain the new information of being able to use SIGNED StellarValues, identifying one particular node which nominated the whole StellarValue, only if we do change upgrades in the same way as we're proposing to change closetimes. What isn't clear is how valuable that is (if at all). I don't think we have any specific ideas for it yet, but if someone were to come up with something that could put it to clear use, then that could provide some push in the direction of changing upgrades to match closetimes.
As far as I know, there isn't any correlation among the different upgrades that we combine, in the sense that today, I think, if changing any given parameter A from 1 to 2 and changing any given parameter B from 1 to 2 would each be sensible individually, then it would also be sensible to do both at once. But if the type of possible upgrades could expand to one in which certain combinations of parameter changes would be insensible, then it might no longer make sense just to combine each parameter individually -- the simple union of individual changes could produce a result which was inconsistent. In that case, the right way to combine them consistently would become ambiguous; there could be multiple options of which none would include all candidate parameter changes. That would not be inherently wrong -- we have the same situation with transaction sets, and we make a heuristic but deterministic choice as to which is the best. But it might be some argument for choosing a single nominated upgrade in the same way as we already choose a single nominated tx set, and are proposing to choose a single nominated closetime (the one in the same StellarValue as the chosen tx set), just because choosing the best single nominated set of upgrades could be simpler (there might be far fewer options to choose among) than choosing the best consistent combination of all tx sets.
A risk of not combining upgrades, which I'm not sure whether it exists or not because I'm not familiar enough with the protocol yet, might be that upgrades that some nodes proposed, which would be positives from their perspective, and happened to be neutral from other nodes' perspectives, and therefore good for the network as a whole, might never happen, because the nodes that considered the upgrades neutral (and therefore did not propose them) always had their tx sets nominated, perhaps because they were much higher-traffic validators. I'm not sure whether that's what would happen or not, though -- would the high-traffic validators, upon seeing upgrade nominations that they found neutral, start flooding them too on the grounds that some nodes' wanting them and the others' not actively not wanting them meant that they were a net good for the network as a whole?

I'm going to start the code and the CAP by proposing to change upgrades as well as closetimes/tx-sets to come from one single specific nomination; I like the idea of being able to identify the unique proposer of a specific ledger, and I like the sound of making the means of choosing a value to externalize consistent across all components of the ledger (closetime, txset, upgrades). But I am extremely far from being a protocol expert, so I would easily be swayed by an argument to the contrary by such an expert.

So basically A is valid, but then fails, yet consumes a sequence number, therefore unlocking transactions that are child of A. If that transaction was anything other than a "no-op", this can have very bad effects on the smart contract (not just unlocking the branch, there is no way to "retry" that transaction).

Is there a reason that this is not also a problem when a transaction consumes a sequence number but fails for some other reason besides txTOO_LATE? For example, what if a transaction was valid when accepted into a transaction set but no longer had a high enough balance by the time it was about to be applied, because of previously-applied transactions? That, I think, would also consume a sequence number and fail; if it were part of a smart contract, wouldn't that also lead to dependent transactions potentially being executed despite the transaction that they depended on having failed?

This sounds almost tautological, but any good smart contract should be designed such that transactions (Tc) contingent on the success of another transaction (T) are not executable unless the first transaction succeeds. There are a few ways that this can be achieved. Some examples:

Make it so that T cannot fail, thereby avoiding the problem entirely.
Make it so that T has a side-effect which enables the contingent transactions. This can be achieved, for example, with a sequence number gap between T and Tc. Then T contains a BumpSequenceOp which bridges the gap. Transactions which fail have no side-effects (other than fee charged, sequence number consumed, and one-time signers removed) so the sequence number gap can only be bridged upon success.
Make it so that T configures the signers of some account into a particular state.

The second technique is very powerful, because it allows conditionality on whether T succeeds or fails. If T succeeds, then the sequence number gap is bridged. If T fails, then the next sequence number becomes available. This can permit two different branches of execution (for example, perhaps some kind of cooperative recovery in the event of failure).

The issue with txTOO_LATE is that it makes it effectively impossible to use the first approach if you are using time-bounds.

The issue with txTOO_LATE is that it makes it effectively impossible to use the first approach if you are using time-bounds.

Thanks -- I think I might be closer to getting this now. But possibly not all the way -- let me see if I can state this in words of my own, and find out whether they're sensible or not.

I can see why you call that second approach powerful -- it does sound rigorous to me. I guess that the BumpSequenceOp is the last operation of T in such cases, and that the BumpSequenceOp has no failure case?

What I'm not sure I understand is how you can ever use the first approach, even without time bounds. Is the point that the txTOO_LATE path that we're addressing in this issue the only way that a transaction can currently consume a sequence number without being applied, and that if it is applied, it can restrict itself to using only operations that have no failure cases, or that it can guarantee don't fail because of preconditions that it had established through previous checks and transactions?

Indeed, BumpSequenceOp does not fail. It doesn't even have to be the last operation, it just has to be one of the operations.

The gist of what you wrote is correct, but it's not how I would state it. I would flip the statement around. It is possible to use operations that can be guaranteed not to fail due to established preconditions (and the required preconditions might be the empty set), and the only exception to that is failures caused by txTOO_LATE. Let me give you some example of operations and combinations of operations that never fail given that you have the right preconditions:

BumpSequenceOp always succeeds
PaymentOp to self always succeeds for any amount
Account A has balance and limit B of asset X. Account A' has balance and limit B' of asset X. Transaction contains ChangeTrustOp with source account A', asset X, and limit (B+B'); PaymentOp with source account A, destination A', and amount B. As long as A' cannot change it's limit unilaterally (eg. it is a cooperative escrow account) then this can be guaranteed to succeed.

The first two are kind of silly, but the third is a real example where you can guarantee the preconditions for a payment. Many situations will be simplified with CAP-23.

Adding a few comments from the CAP.

Once the network has upgraded to the new protocol, the following semantic changes occur:

nit: use a numbered list as it makes it a lot easier to reference

The selection during the nomination protocol of a StellarValue to externalize from the set of candidates occurs by choosing the closeTime from the StellarValue which contains the "best" transaction set (according to the same heuristic metric that already exists for "best transaction set"). The signature of that StellarValue is preserved, so it remains SIGNED. The externalized upgrades set is still computed as a maximal combination of all the candidate upgrade sets, but that does not affect the signature -- we were already (even in older protocols) signing only a version of the StellarValue with the upgrades stripped.

I mentioned this in previous conversations. There is a missing bullet here (possible thanks to the previous bullet) that describes how we change the validity criteria for a StellarValue (during all stages of SCP): we now consider invalid a StellarValue that contains transactions that are not valid at the associated time. In particular, this implies that a StellarValue introduced by validators will not contain transactions that would have been valid at the previous ledger but are invalid in the (about to be closed) next ledger.

In fact you're actually already referring to this in the "design rationale", so you just forgot to describe it.

Design Rationale

That section should only contain information that supports the existing approach and avoid describing what it's not.

In particular:

the commentary on older approaches can be found in the link at the top of the document (that happens to be this issue for this proposal)
the commentary on why we're not changing upgrades is not super interesting either - I think it would have been worth discussing if we were to change it.

If you feel that this is adding value you can still include something like this in the CAP (in an appendix/separate sections) - but written in such a way as to support the existing approach (similarly to how other CAPs answer questions like "Why are Signer Sponsoring IDs Stored Separately from Signers?" for CAP-0033). In this particular case, we would be justifying why we don't want to change the behavior later in consensus (either by trimming transactions at the end of nomination or by changing the behavior when externalizing).

The selection of a single StellarValue from the set of candidates during the nomination protocol reflects the literature on and formal models of the Stellar Consensus Protocol better than the current implementation's combination of candidates.

I am not sure I understand what this means : the composite function described in the white paper doesn't say anything about this; and we do not propagate any one of the values from nomination into the ballot protocol either.

The selection of a single StellarValue from the set of candidates is the simplest way of ensuring that the combination function, given a set of self-consistent candidates, produces a consistent externalized value.

Same thing, I do not understand what this means.

Backwards Incompatibilities

Based on today's protocol meeting, I think it would be worth expanding on practical implications to contrast the conditions necessary for a transaction to be included (only to fail later during consensus) compared to what happens with the proposed behavior.

In particular, I think it can be useful to list out all the things that need to happen "just right" in order to have a practical difference, and put some commentary on how this highlights broken/very precarious situations: for a transaction to be included like this, a ledger has to have capacity at that very moment and the transaction has to be in the transaction queue of the validator that happens to win nomination at that exact time.

I've opened up PR #679 , a draft pull request for commentary on the in-progress second draft of the document (the first draft having been merged in PR #671 ). I'll reply to Nicolas's comments here, and commit changes to that branch, and we'll have the option of making further comments in the PR as well.

Adding a few comments from the CAP.

Once the network has upgraded to the new protocol, the following semantic changes occur:

nit: use a numbered list as it makes it a lot easier to reference

I'm not sure whether you meant that you'd like all lists to be numbered, or whether it's just "Semantics" specifically that would benefit from being able to be referred to by number -- for now I've only changed that one, but if you meant the former, I'll change all the lists.

The selection during the nomination protocol of a StellarValue to externalize from the set of candidates occurs by choosing the closeTime from the StellarValue which contains the "best" transaction set (according to the same heuristic metric that already exists for "best transaction set"). The signature of that StellarValue is preserved, so it remains SIGNED. The externalized upgrades set is still computed as a maximal combination of all the candidate upgrade sets, but that does not affect the signature -- we were already (even in older protocols) signing only a version of the StellarValue with the upgrades stripped.

I mentioned this in previous conversations. There is a missing bullet here (possible thanks to the previous bullet) that describes how we change the validity criteria for a StellarValue (during all stages of SCP): we now consider invalid a StellarValue that contains transactions that are not valid at the associated time. In particular, this implies that a StellarValue introduced by validators will not contain transactions that would have been valid at the previous ledger but are invalid in the (about to be closed) next ledger.

In fact you're actually already referring to this in the "design rationale", so you just forgot to describe it.

Yes, it's in the rationale -- and also the code! -- but I forgot it in the "Semantics". I've added it now.

Design Rationale

That section should only contain information that supports the existing approach and avoid describing what it's not.

In particular:

the commentary on older approaches can be found in the link at the top of the document (that happens to be this issue for this proposal)

the commentary on why we're not changing upgrades is not super interesting either - I think it would have been worth discussing if we were to change it.

If you feel that this is adding value you can still include something like this in the CAP (in an appendix/separate sections) - but written in such a way as to support the existing approach (similarly to how other CAPs answer questions like "Why are Signer Sponsoring IDs Stored Separately from Signers?" for CAP-0033). In this particular case, we would be justifying why we don't want to change the behavior later in consensus (either by trimming transactions at the end of nomination or by changing the behavior when externalizing).

Okay, I've reorganized that section to begin with the explanations of what we are proposing, and I've moved the descriptions of rejected ideas to sub-sections at the end of the "Rationale" (like appendixes to the Rationale section). Let me know if that reads well now.

The selection of a single StellarValue from the set of candidates during the nomination protocol reflects the literature on and formal models of the Stellar Consensus Protocol better than the current implementation's combination of candidates.

I am not sure I understand what this means : the composite function described in the white paper doesn't say anything about this; and we do not propagate any one of the values from nomination into the ballot protocol either.

The selection of a single StellarValue from the set of candidates is the simplest way of ensuring that the combination function, given a set of self-consistent candidates, produces a consistent externalized value.

Same thing, I do not understand what this means.

You're right to not understand it; those were me not understanding how the paper and the code corresponded. Having gone back to re-read that part of the paper after having been in the code for a bit, I see that the paper is much more flexible than I remembered it as having been, admitting an arbitrary (pure) compositing function in the nominating phase (which we take advantage of; it's combineCandidates()).

Backwards Incompatibilities

Based on today's protocol meeting, I think it would be worth expanding on practical implications to contrast the conditions necessary for a transaction to be included (only to fail later during consensus) compared to what happens with the proposed behavior.

In particular, I think it can be useful to list out all the things that need to happen "just right" in order to have a practical difference, and put some commentary on how this highlights broken/very precarious situations: for a transaction to be included like this, a ledger has to have capacity at that very moment and the transaction has to be in the transaction queue of the validator that happens to win nomination at that exact time.

I'll work on that.

I also plan to add one or two tables to try to illustrate the proposed semantic changes in what I hope will be a clearer way (rows being places in the code whose semantics change, with two columns, for old and new behavior -- a sort of document version of a State pattern).

I mentioned this in previous conversations. There is a missing bullet here (possible thanks to the previous bullet) that describes how we change the validity criteria for a StellarValue (during all stages of SCP): we now consider invalid a StellarValue that contains transactions that are not valid at the associated time. In particular, this implies that a StellarValue introduced by validators will not contain transactions that would have been valid at the previous ledger but are invalid in the (about to be closed) next ledger.

By the way, it occurs to me that I don't understand why we weren't already validating this. It's true that before my changes, we didn't know exactly what the closeTime was going to be at next ledger close if that StellarValue's transaction set turned out to be the best. But we were taking the maximum of all closeTimes in combineCandidates(), so we did know that if that StellarValue's transaction set turned out to be the best -- indeed, if it were considered as a valid candidate at all in the combineCandidates() that produced the best transaction set -- then the accompanying ledger close time would be at least as high as that in the StellarValue. And if a transaction were expired with respect to its own associated closeTime, then it would also be expired with respect to any greater closeTime. What am I missing that was preventing us from already validating nominated transactions against the closeTime in the same StellarValue (which is already validated as being greater than the last ledger close time, so validating transactions against it is a strictly stronger condition than validating against the last ledger close time)?

I mentioned this in previous conversations. There is a missing bullet here (possible thanks to the previous bullet) that describes how we change the validity criteria for a StellarValue (during all stages of SCP): we now consider invalid a StellarValue that contains transactions that are not valid at the associated time. In particular, this implies that a StellarValue introduced by validators will not contain transactions that would have been valid at the previous ledger but are invalid in the (about to be closed) next ledger.

By the way, it occurs to me that I don't understand why we weren't already validating this. It's true that before my changes, we didn't know exactly what the closeTime was going to be at next ledger close if that StellarValue's transaction set turned out to be the best. But we were taking the maximum of all closeTimes in combineCandidates(), so we did know that if that StellarValue's transaction set turned out to be the best -- indeed, if it were considered as a valid candidate at all in the combineCandidates() that produced the best transaction set -- then the accompanying ledger close time would be at least as high as that in the StellarValue. And if a transaction were expired with respect to its own associated closeTime, then it would also be expired with respect to any greater closeTime. What am I missing that was preventing us from already validating nominated transactions against the closeTime in the same StellarValue (which is already validated as being greater than the last ledger close time, so validating transactions against it is a strictly stronger condition than validating against the last ledger close time)?

After more thought, that gives me an idea for a whole new proposal for fixing this protocol bug (v4!). If my quoted paragraph above is wrong, then the following is nonsense, but if it's right, then this new idea is my favorite so far.

This new proposal combines ideas from three sources, with a couple slight modifications:

The v2 proposal in this issue to trim transactions based on the new close time in combineCandidates(). This proposal makes a slight modification, by doing this in a different place in that same method.
The v3 proposal in this issue, published in the first-draft CAP, specifically the validation of the transactions in a StellarValue based on the closeTime in that StellarValue (which I argue in the quoted paragraph that we could have been doing already).
Siddharth's recently-merged code (in stellar-core PR #2608 for stellar-core-internal issue #69; see also stellar-core-internal issue #66) to trim transactions before flooding them based on an estimate of the next ledger-close time (a strictly stronger form of trimming than what the previous code did, which was to trim only those transactions that were expired based on the last ledger close time).

(It makes no use of the idea from the v1 proposal in the description of this issue, the one that would have deferred updating the "last ledger close time" in the ledger header until after applying transactions.)

Here are the new set of changes I propose (contingent, again, upon the quoted paragraph being correct!):

We trim transactions that turn out to be expired based on the externalized closeTime in combineCandidates(). However, I propose to do this at a different place than the v2 proposal in this issue did: I propose to trim each individual candidate before combining them into a nominated value (but after computing the externalized closeTime, in the same way as we do in the current protocol, by taking the maximum closeTime of all candidates). The point of this change is that it allows us to take into account that some transactions may be expired before we decide which is the "best" transaction set. Thus, if transaction set A were bigger than transaction set B, but transaction set B would be bigger than transaction set A after trimming, this change will lead to our using transaction set B.
Adopt the change in the quoted paragraph: in all phases of the protocol, wherever we validate StellarValues in the current protocol (up to the point where they become fully validated candidates that are passed into combineCandidates()), we perform the validation that there are no guaranteed-expired transactions in the StellarValue's transaction set based on the closeTime in the StellarValue, not the last ledger close time. This is a strictly stronger validation, because we already validate that a StellarValue's closeTime is greater than the last ledger close time. Specifically, the place where this validation needs to be strengthened is in HerderSCPDriver::validateValueHelper(). It does not need to change in HerderImpl::triggerNextLedger(), for the reason detailed in the next list item: Siddharth has already done that work.
Siddharth's new code in HerderImpl::triggerNextLedger() already guarantees that we trim all transactions that would exceed our proposed closeTime, because our proposed closeTime is just "now" (according to our system's clock), whereas Siddharth's code makes an estimate which is later than "now". Therefore, this proposal does not need to change Siddharth's code there. So this list item is not really a change in this proposal; it's a note that Siddharth has already made a change (indeed, a stronger change) than one that this proposal would have had to make prior to his change.

Here are some of the things that I like about this v4 proposal:

The proposal prevents the race that motivated the CAP; it ensures in all cases that a transaction that returns txTOO_LATE does not consume a sequence number and does not pay a fee. This is manifestly ensured in a single place in the code: in combineCandidates(), where we would now explicitly trim any transactions that would be expired with respect to the computed closeTime in the to-be-nominated StellarValue.
The proposal does not change the current semantics of how closeTime is generated in combineCandidates(); that remains a maximum of all the candidate closeTimes. So any concerns deriving from the closeTime generated by the new proposal being potentially less than the closeTime that would have been generated by the existing protocol become moot. (Because the closeTime remains a combination, we can't make the nominated values SIGNED, either, so that prevents me from wandering into any tangential discussions of SIGNED versus BASIC StellarValues, or potential changes in how we combine upgrades.) There's less potential "distraction" than in the v3 proposal.
Because it does not change the point in closeLedger() at which the last ledger close time is updated, it does not suffer from the "surprisingly committed transaction" problem that scuttled the v1 proposal.
The change to validate against the StellarValue's closeTime, which it makes on top of the change to trim transaction sets in combineCandidates(), avoids the introduction of the denial-of-service attack that scuttled the v2 proposal, for the following reason: an attacker who tried to spam the network with close-to-expiration transactions would have to reckon with a different definition of "close-to-expiration". In the v2 proposal, it would suffice for an attacker to set an expiration time just after the last ledger close. But with this proposal, most such transactions would be dismissed before getting into the transaction set because the validation would be stronger: it would be against a newly-proposed closeTime. And a newly- proposed closeTime would have an unpredictably high chance of being the winning closeTime, or very close to it. Therefore, an attacker that paid very high fees and set expiration times that gave the transactions a chance of getting into the transaction set but being trimmed in combineCandidates() would have to face a substantial fraction of the transactions not being trimmed, and therefore of paying a subtantial fraction of the very high fees that the attacked had offered in order to get into the transaction sets. This would disincentivize such attacks to, as far as I can think so far, at least as great a degree as they are already disincentivized, and possibly a greater one, as opposed to a lesser one in the v2 proposal (but I'll have to do some calculations to be confident of this). Siddharth's code helps here by trimming transactions early (in HerderImpl::triggerNextLedger()) to an even stricter standard than is required just for correctness in this v4 proposal.
This proposal potentially offers higher-quality transaction sets to come out of combineCandidates() than we have today, because it trims likely-expired transactions more aggressively in the nomination phase than the current protocol. (It also offers higher-quality transaction sets than the v2 proposal, because it takes into account (in combineCandidates()) which transactions will be trimmed before deciding what constitutes the best transaction set.)

Sorry I am not really following.

On the quoted paragraph, right now we use lcl.ct to verify, in the current CAP, we're going to use the associated ct from the same StellarValue but that change by itself is not enough: if we continue to use the max(ct), some transactions may be considered expired later in the protocol (see below).

We trim transactions that turn out to be expired based on the externalized closeTime in combineCandidates(). However, I propose to do this at a different place than the v2 proposal in this issue did: I propose to trim each individual candidate before combining them into a nominated value (but after computing the externalized closeTime, in the same way as we do in the current protocol, by taking the maximum closeTime of all candidates)

nit: it's not "externalized" (that's what happens at the end of the ballot protocol), it's just the current best candidate.

I am not sure I understand what this means. Are you saying that you'd still want to remove transactions from within the combineCandidate function (the only place where we have a potentially different close time)? If so, this is still vulnerable to the problems outlined in this response; I guess a slight variation of it where the system will bias towards the txset that came with the larger ct (as other txsets will lose some transactions because of it), so this effectively makes it easier to "win" nomination. Also note that if we have this bias, it's basically a worst version (as it has bias to keep the StellarValue with the highest close time) of the current proposal that at least randomizes between proposed values.

As for 2. I imagine that you're listing is for completeness as we need to do this with the current version of the CAP.

Of note (not super important, but helps to understand why the solution has to work holistically) : 3. should be thought of as "best effort" as it's just feeding into the nomination protocol, a bad actor controls that code and may decide to inject whatever they want into the transaction set, in particular it can construct "worst case" scenarios with ct = lcl.ct+1 and a txset filled with transactions that expire at ct+1 - very likely any other candidate will have a ct bigger than that ; or inject the maximum possible ct ~60 seconds in the future as to maximize the number of expired transactions on other proposed StellarValue as oulined earlier.

Sorry I am not really following.

On the quoted paragraph, right now we use lcl.ct to verify, in the current CAP, we're going to use the associated ct from the same StellarValue but that change by itself is not enough: if we continue to use the max(ct), some transactions may be considered expired later in the protocol (see below).

Yes.

We trim transactions that turn out to be expired based on the externalized closeTime in combineCandidates(). However, I propose to do this at a different place than the v2 proposal in this issue did: I propose to trim each individual candidate before combining them into a nominated value (but after computing the externalized closeTime, in the same way as we do in the current protocol, by taking the maximum closeTime of all candidates)

nit: it's not "externalized" (that's what happens at the end of the ballot protocol), it's just the current best candidate.

Right, sorry, I keep doing that. I meant "nominated".

I am not sure I understand what this means. Are you saying that you'd still want to remove transactions from within the combineCandidate function (the only place where we have a potentially different close time)?

Yes.

If so, this is still vulnerable to the problems outlined in this response; I guess a slight variation of it where the system will bias towards the txset that came with the larger ct (as other txsets will lose some transactions because of it), so this effectively makes it easier to "win" nomination. Also note that if we have this bias, it's basically a worst version (as it has bias to keep the StellarValue with the highest close time) of the current proposal that at least randomizes between proposed values.

That bias would exist, yes, but I don't see how it's a worst case -- specifically, it seems to me that it's strictly better than what we have today. It can't be the worst case -- today's is worse! Here's what I mean: that bias will indeed favor sets with the larger closeTime. But, as you point out, that bias comes from precisely the transactions that I'm proposing that we trim in combineCandidates(). And today, that trimming does not occur; instead, those transactions make it all the way through being charged a fee and consuming a sequence number before returning txTOO_LATE -- in other words, they're precisely the transactions that suffer from this bug!

So, yes, I would very much like to have a bias in favor of transactions that have a chance of being applied over transactions that are doomed to fail and (if they're part of smart contracts that behave as you described) potentially trigger catastrophic bugs. :)

Therefore, I argue that introducing that bias represents a strict improvement over what we do today. You might argue that it's not as much of an improvement as v3 of the CAP would make in some respects, and I would agree, but I'll get to more detail on that later.

I want to note that when you say "I guess a slight variation of it", I think you are acknowledging (and I think it is true) that this v4 proposal is not actually vulnerable to the same spamming issue that you linked to, which is the one that doomed the v2 proposal. The reason is that the additional change of validating against the same closeTime as in the StellarValue, instead of only against the last ledger close time, means that an attacker can no longer get transactions into the transaction set just by making them expire just past the last ledger close time. To get into a transaction set and avoid paying a fee, which would mean creating a transaction that would ultimately be trimmed in combineCandidates(), would, because of v4's stronger validation than v2, have to predict what the window will be -- if any -- between the closeTime that the validator that the attacker was sending transactions to and the eventual nominated closeTime, which would be the maximum of all candidate closeTimes. And, as you've pointed out, it's the common case that there's only one candidate, so that there is no window. And if there is a window, it's still not predictable when it will start or end. Consequently, an attacker would be running a substantial risk of getting their transactions into the ledger and paying the fee, which they've made very high so that they can spam the ledger. So spam attempts are economically disincentivized, just as we want them to be.

As for 2. I imagine that you're listing is for completeness as we need to do this with the current version of the CAP.

No, that's not just for completeness. I'm specifying the v4 proposal with respect to the current master branch, not with respect to the v3 proposal. The v4 proposal shares code change #2 with v3, and it adds code change #1 with respect to v3, but it also lacks one of the code changes in v3, namely the change in combined closeTime selection. If I were specifying v4 with respect to v3, I'd have to list a removed proposed change as well as an added one. I thought it was cleaner to list the exact two proposed code changes from the current master.

Of note (not super important, but helps to understand why the solution has to work holistically) : 3. should be thought of as "best effort" as it's just feeding into the nomination protocol, a bad actor controls that code and may decide to inject whatever they want into the transaction set, in particular it can construct "worst case" scenarios with ct = lcl.ct+1 and a txset filled with transactions that expire at ct+1 - very likely any other candidate will have a ct bigger than that ;

Yes, understood and agreed.

or inject the maximum possible ct ~60 seconds in the future as to maximize the number of expired transactions on other proposed StellarValue as oulined earlier.

Yes, also agreed. This is a good point to return to my earlier discussion that I paused with "not as much of an improvement as v3 of the CAP would make in some respects".

The "maximum closeTime injection" attack which biases the nomination protocol towards your transaction set does indeed exist against this v4 proposal. But, as far as I can tell, it already exists today, in the same form, and I don't see how the v4 proposal is making it any worse. And it's not the bug which is described in this issue at all -- that one's something completely separate. It's not even the same class of attack, in the sense that the "maximum closeTime injection" requires control of a validator, not just the bandwidth to spam one of them. So I don't think that we need to be conflating the attempt to fix that problem with the fixing of the txTOO_LATE bug. I would not want to make that problem worse, but I don't think that I am making it worse.

The v3 proposal, unlike my v4 proposal, does make an attempt to solve the "maximum closeTime injection" attack. That's good. But, since it's separate, I think it also suggests a way forward for getting the subject of this issue, the txTOO_LATE bug, fixed with less risk and less controversy than v3, while still preserving our ability to move on to the full v3 as a further step if we can come to agreement on it -- or, if we discover any even better further step than v3 in the meantime, to move on to that step instead of to v3. In other words -- maybe I should have referred to the v4 proposal as "v2.5"!

Here's what I mean, in case it's not already clear. The v4 proposal is an attempt to fix the txTOO_LATE bug and only the txTOO_LATE bug, minimizing new attack surfaces or risks by minimizing any change which isn't necessary just to fix the txTOO_LATE bug. In particular, unlike v3, it makes no attempt to address the "maximum closeTime injection".

And this makes v3 technically safer and cognitively less of a load, and therefore less prone to resistance, because it is or is very close to, as far as I can tell, a strict improvement over the current protocol's behavior. In other words, I think it's close to demonstrable that there's no sequence of events that v4 could be reasonably suggested to make worse.

Here's the demonstration, expressed in terms of the two code changes I propose. What I called "#3" above was actually just a note that the code didn't need to change because Siddharth has already done what would otherwise have been necessary. So that one can be ignored in this argument. Also, with apologies, I'm going to reverse the order of #1 and #2, because I realize now that in my argument, the "pure-improvement" quality of what I called #2 above does not depend on #1 having been made, but the "pure-improvement" quality of what I called #1 above does depend on #2 having been made. So I should have flipped them in the first place.

Perform the validation that there are no guaranteed-expired transactions in the StellarValue's transaction set based on the closeTime in the StellarValue, not the last ledger close time. As you noted, this is also part of proposal v3. This code change occurs in HerderSCPDriver::validateValueHelper(). I have already argued in my comment above that this is a pure win: we might as well already be doing that validation, because any transaction that failed it would have been doomed to return txTOO_LATE in some future code path anyway. And the place that it ended up failing might have ended up being closeLedger(), meaning that it would have triggered this bug. So, by doing the stronger validation, we at least drop a doomed transaction sooner, meaning we spend less time and space on doomed transactions, and sometimes we avoid a potentially critical bug. That extra validation is therefore a pure win.
Trim transactions that turn out to be expired based on the combined closeTime in computed in combineCandidates(). (So, given the same set of candidates, the new combineCandidates() would produce the same combined closeTime as the current combineCandidates(), but possibly a smaller transaction set.) Perform this trimming on each individual candidate before combining them into a nominated value (as opposed to trimming the combined set after computing that from the individual candidates, as in the v2 proposal). I argue that this is also a pure win. Here's why: the only transactions that we trim are transactions that, if we hadn't trimmed them (which we do not today), would have triggered this bug, and the effect of trimming is therefore to eliminate buggy txTOO_LATE transactions at the cost of not charging them a fee. Not charging them a fee is what we want if they are good-faith transactions, because they didn't get as far as being applied before expiring. So the only question is whether failing to charge bad-faith transactions a fee at this point would open up an attack vector. In the v2 proposal, when this was the only code change, that was clearly the case. But, as I argued above (and I think you agreed, because you stated that bias rather than starvation would be the new variant), the presence of code change #1 has disincentivized the attack: filling up the ledger with soon-to-expire transactions now requires hitting an unpredictable and often non-existent window rather than just setting the expiration time to lastLedgerCloseTime + 1. That means that an attacker who wants a chance at hitting that window also runs a risk of having their transactions applied. And that means a risk of paying the very high fees that they're paying to try to keep other transactions out of the transaction set. And that, again, is what we want.

To me, that suggests the following plan:

Propose v4 (AKA v2.5) as the new CAP-34.
Open up a separate protocol request issue for the further change of replacing the new CAP-34 trimming code in combineCandidates() with the different change of choosing the closeTime associated with the best transaction set instead of the maximum closeTime of all candidates. That would be a further upgrade of "v2.5" (v4) to "v3". And it would have a separate purpose: the txTOO_LATE bug would already have been solved; the further change (CAP-35+) would be intended to address the "maximum closeTime injection" issue, which, again, already exists; it's not newly-introduced by proposal v4/v2.5.

Then we could fight those two battles separately, and have separably- reviewable code changes. I think it would make the first and more urgent battle -- fixing the txTOO_LATE bug, which is after all the subject of this issue -- much easier, because there's less to think about and I think we can argue (as I did above) that it's a clear win.

And if we can win that smaller battle, then we can build upon it a further argument that changing from the maximum closeTime to the best-set-associated closeTime would represent an even further improvement, because it solves another bug, namely the "maximum closeTime injection".

Am I making sense? :)

That bias would exist, yes, but I don't see how it's a worst case -- specifically, it seems to me that it's strictly better than what we have today. It can't be the worst case -- today's is worse! Here's what I mean: that bias will indeed favor sets with the larger closeTime. But, as you point out, that bias comes from precisely the transactions that I'm proposing that we trim in combineCandidates(). And today, that trimming does not occur; instead, those transactions make it all the way through being charged a fee and consuming a sequence number before returning txTOO_LATE -- in other words, they're precisely the transactions that suffer from this bug!

I was talking about worst than solution 3 as your comment is a counter to proposal 3.

So, yes, I would very much like to have a bias in favor of transactions that have a chance of being applied over transactions that are doomed to fail and (if they're part of smart contracts that behave as you described) potentially trigger catastrophic bugs. :) Therefore, I argue that introducing that bias represents a strict improvement over what we do today. You might argue that it's not as much of an improvement as v3 of the CAP would make in some respects, and I would agree, but I'll get to more detail on that later.

kind, of: the reason solution 2 was bad was that we could end up with "empty ledgers" (or close to) instead of failed transactions. This achieves the same result. Basically the network does less (potentially close to no) work instead of doing "bad" work like processing failed transactions. Reason we landed on solution 3 was that we thought that it was better to maximize the amount of real work that the network can do.

The other bad property from 2 (that is preserved in this proposal 4) is that we introduce new values to the network, which may have to be propagated (as the set of accumulated values may not be the same on all nodes until convergence of the nomination protocol), which potentially increases latency quite a bit.

Stepping back though: it sounds like you've been looking into solution 4 with the assumption that the controversial bit in this CAP is the choice of close time, it's not. I don't think that the point that dm raised during the protocol meeting on potential issues with cross chain swap is related to the choice between max(ct) and a different (smaller) close time; but instead that in some cases transactions will not be included in the ledger at all instead of showing up as failed, and therefore not consume the sequence number. We'll have to wait to hear from him to understand exactly the exact sequence that he was trying to protect against - the idea was that we need something, anything to happen on the network when a pre-image is revealed (so that the preimage is attached to the ledger close) or we introduce an asymmetry in the cross chain swap protocol. I suspect that those protocols actually need this CAP in order to work correctly, not the opposite.

I suggest we sync offline and post back here with the conclusion.

I was talking about worst than solution 3 as your comment is a counter to proposal 3.

Yes, I agree about that comparison.

So, yes, I would very much like to have a bias in favor of transactions that have a chance of being applied over transactions that are doomed to fail and (if they're part of smart contracts that behave as you described) potentially trigger catastrophic bugs. :) Therefore, I argue that introducing that bias represents a strict improvement over what we do today. You might argue that it's not as much of an improvement as v3 of the CAP would make in some respects, and I would agree, but I'll get to more detail on that later.

kind, of: the reason solution 2 was bad was that we could end up with "empty ledgers" (or close to) instead of failed transactions. This achieves the same result. Basically the network does less (potentially close to no) work instead of doing "bad" work like processing failed transactions. Reason we landed on solution 3 was that we thought that it was better to maximize the amount of real work that the network can do.

Agreed, taking the closeTime from the winning transactionSet is even better at maximizing useful work.

The other bad property from 2 (that is preserved in this proposal 4) is that we introduce new values to the network, which may have to be propagated (as the set of accumulated values may not be the same on all nodes until convergence of the nomination protocol), which potentially increases latency quite a bit.

Agreed there too.

Stepping back though: it sounds like you've been looking into solution 4 with the assumption that the controversial bit in this CAP is the choice of close time, it's not.

I don't think I'm assuming that -- it's more like I'm not even sure whether there is a single "the" most controversial part, and simply splitting it up into two steps each of which has smaller individual justifications (first, fix the txTOO_LATE bug with as few semantic changes as can be made to fix it; second, add the throughput/latency improvements by selecting a more precise closeTime as you describe above, thereby also fixing the "maximum closeTIme injection" bad-validator attack) might make it more palatable.

I don't think that the point that dm raised during the protocol meeting on potential issues with cross chain swap is related to the choice between max(ct) and a different (smaller) close time; but instead that in some cases transactions will not be included in the ledger at all instead of showing up as failed, and therefore not consume the sequence number. We'll have to wait to hear from him to understand exactly the exact sequence that he was trying to protect against - the idea was that we need something, anything to happen on the network when a pre-image is revealed (so that the preimage is attached to the ledger close) or we introduce an asymmetry in the cross chain swap protocol. I suspect that those protocols actually need this CAP in order to work correctly, not the opposite.

If that turns out to be the case, then maybe there will be no further need to convince anyone of anything.

I do agree v3 would leave us in the best state (that we've thought of so far) long-term. I'll keep refactoring the code and CAP under the assumption/hope that we'll get to implement that version in one step.

I have marked the second revision of the document, which I hope integrates the changes from the discussions above, as ready for review: https://github.com/stellar/stellar-protocol/pull/679

There is now a PR ready for review: https://github.com/stellar/stellar-core/pull/2625

A few comments on terminology and correctness

2 | Which transactions to include (not to trim) in nominated StellarValue?

Terminology is slightly wrong here. "Nominated" implies confirmed nominated. Here, we're talking about transactions to include in a nominate message at the very beginning of a consensus round (so, we haven't confirmed anything yet). It would be clearer to say "Which transactions to include (not to trim) in StellarValue a node votes to nominate?"

4 | Which close time to produce when compositing candidates into single nominee?

Let's use language consistent with the whitepaper, there is a definition for "compositing candidates into single nominee" - composite value. So you can simply say "Which close time to use in a composite value?"

5 | Generate BASIC or SIGNED StellarValues when combining candidates (in "compositing" function)? 6 | Expect BASIC or SIGNED StellarValues in ballot protocol?

Having both bullets is redundant here. You are producing a composite value to feed into the ballot protocol, so the answers to 5 and 6 will always be the same. I recommend combining these into a single question: "Expect BASIC or SIGNED StellarValue in a composite value to be used in the ballot protocol"?

The proposal additionally repairs a degree of lack of Byzantine fault tolerance which we shall call "maximum closeTime injection": an ill-behaved validator can no longer reduce the quality of transaction sets produced by the compositing function by nominating as large a closeTime as allowed (the maximum time slip is currently one minute) and thus possibly causing some other nodes' nominated transactions to expire. As explained below in the "Detailed illustration of failure mode" section, a node would have to provide a transaction set with the best set of unexpired transactions in order to influence the composited closeTime.

This is unrelated to Byzantine fault tolerance, which is needed to detect if some nodes failed based on messages received from other nodes. Nominating a value with maximum close time is valid according to the protocol. Instead, this proposal helps mitigate a specific attack you describe later.

A few comments on terminology and correctness

Thanks -- I've tried to incorporate these in an in-progress third draft, in this commit: https://github.com/rokopt/stellar-protocol/commit/99e2b0e1cc96b51931e3e2ac39a6ffb268470771 .

A few more things, as I spent more time with the implementation and was referring to the CAP for details:

The proposal additionally repairs a degree of lack of Byzantine fault tolerance which we shall call "maximum closeTime injection": an ill-behaved validator can no longer reduce the quality of transaction sets produced by the compositing function by nominating as large a closeTime as allowed (the maximum time slip is currently one minute) and thus possibly causing some other nodes' nominated transactions to expire. As explained below in the "Detailed illustration of failure mode" section, a node would have to provide a transaction set with the best set of unexpired transactions in order to influence the composited closeTime.

This sounds like the proposal eliminates this attack completely, but I don't think it does. If a transaction set introduced by the misbehaving node ends up being the composite value, then you end up with the same problem. So while this proposal makes it harder to execute the attack, it still seems possible to me.

In the presence of change #3, which causes any StellarValue which contains any transaction which is expired with respect to the closeTime in the same StellarValue to fail validation, the change to use the closeTime from the StellarValue which contains the chosen best transaction set becomes optimal.

I had a hard time parsing this sentence, and would really appreciated it if you re-flowed it into multiple simpler sentences

A few more terminology nits (which I think are important, nevertheless):

Please avoid using expressions such as "best transaction set" when referring to composite values. It creates a false sense of transaction set quality, i.e. selected transaction set is somehow better than other transaction sets -- it's not. Combining candidates is just a function for applications that implement SCP to deterministically select something to give to the ballot protocol. In Stellar, we select the biggest transaction set (we know very little of its quality; it could be all failed transactions).
Similar thing about "winning" nomination, it's a bit misleading. When a composite value is produced, and the ballot protocol is started, nomination keeps running in the background. So your composite value might change while the ballot protocol is running. What you are describing is "a value selected for the ballot protocol".

See this conversation for exploration of the semantics of minTime/txTOO_EARLY (so far we have only been explicitly mentioning maxTime/txTOO_LATE).

I've attempted to correct the problems Marta pointed out in her second round of review in this commit: https://github.com/rokopt/stellar-protocol/commit/4e9feaec4b52c795d9c45fc7778c9c6338966005

The CAP has also been updated to reflect the latest implementation changes; see PR #691 .

stellar / stellar-protocol