Limit core to ever store 1 transaction per source account

dmkozh commented 1 year ago

Background

Currently the Core implementation contains a pretty complex machinery built around maintaining multiple transactions per account, as long as their sequence numbers make a sequence starting from the ledger seqNum for that acount. This allows submitting multiple transactions to the Core in a short succession that most likely will be executed eventually (as long as fees are high enough or the ledger load is low). This also makes the Core implementation a fair bit more complex and results in a pretty tricky to interpret behavior (for example, it's not possible to 'kick out' your own transaction from the queue in case of traffic surge, which leads to a special sort of transaction rejections).

Problem

The current approach may not guarantee that the fee can be charged from the account at the transaction application time, as in order to do that, the Core would need to simulate all the effects of the transaction to make sure that the account balance is enough. So it is possible, for example, to send a payment of the whole account balance in one transaction and another transaction with arbitrary large fee. If both transactions are to be included in the ledger, then the second transaction will fail due to insufficient fee. However, it has to be flooded in order to be included into ledger in the first place.

This hasn't been much of on issue before, because all the transactions have a similar size. But with the upcoming Soroban launch, the current approach is unlikely to remain feasible. Soroban transactions may reach hundreds of kilboytes in size. Soroban works around such large transactions by introducing a 'bandwidth' fee dependent on the transaction size, However, due to the problem described above, it's possible to make the network flood very large transactions without actually paying any fees.

Also purely from the implementation standpoint, the Soroban transactions will likely need a separate transaction queue, so maintaining the already complicated logic would become even harder.

Solution

The most straightforward solution to the problem is to only allow storing and flooding one transaction per source account at any given time. Basically, every transaction that for which sequence number doesn't match the account sequence number stored in the ledger will be immediately rejected by the Core node.

Transition plan

This change may be staged in steps:

Introduce a core configuration parameter that allows to store only a single transaction per account, disabled by default
Enable the parameter on some validators to make sure there is no significant impact on the network behavior
Enable the parameter by default
Cleanup the Core code from the old logic

Only the first step has to happen before Soroban launch; other steps can be combined with it, although it is preferable to perform as many steps as possible before Soroban launch (as they don't require protocol upgrades)

Consequences

The described behavior change may affect the client systems (mainly Horizon), as we are moving all the batching capabilities out of Core.

Some clear way of communicating the transaction rejection reasons may need to be provided, although likely something existing can be reused (like TRY_AGAIN_LATER).

mollykarcher commented 1 year ago

Overall, I would see this as a positive change. I'm hugely in favor of simplifying the contract between horizon and core, particularly around transaction submission failure states. And there has been a lot of confusion around this that has arisen with partners as of late, particularly around the bulk transaction submission/disbursement use cases. However, it definitely has the potential to negatively impact some (maybe many) downstream clients, and we would need to socialize it pretty well to existing customers submitting transactions in advance of actually making this change.

Any partners utilizing channel accounts should be unaffected, and this is the behavior that we want to encourage and espouse moving forward. But this practice been less publicized/adopted to date as we would like, making it likely that a lot of existing partners will not have such a setup, and therefore may be implicitly relying on multiple tx/account/ledger to achieve their current level of throughput.

Horizon also has an internal queue to assist in lining up account transactions in the right submission order based on sequence number, before submitting them to core. We'd want to remove that as part of this work as well, but I think these should both actually be able to be implemented/deployed independently; having a horizon queue would be silly/superfluous without the core queue, but it wouldn't necessarily break anything.

dmkozh commented 1 year ago

Horizon change definitely can wait and maybe even help making the transition easier for whoever relies on this. The main issue that core needs to solve is transaction flooding, which is much more expensive than just storing the transaction in memory would be. So staging the migration makes a lot of sense.

marta-lokhova commented 1 year ago

We discussed with @dmkozh that it might be useful to evaluate the current network usage patterns, to get an idea how many accounts would get affected. Specifically, we are looking for:

Source accounts that wish to execute more than 100 ops in a single ledger. If we are to restrict core to 1 tx/source account, users will still have an option to bundle ops in a single transaction, but only up to 100 ops.
For accounts with less than 100 ops per ledger, we look for the ones that had multiple transactions per ledger, and some of those transactions failed at application time (with txFAILED result code). If users were to switch to multiple ops per transactions, it would be atomic, potentially affecting operations that used to execute successfully. Note that we exclude accounts where all transactions in that ledger failed, as bundling those into a single transactions wouldn't have made a difference.

Looking at ~2.5 days worth of historical data (45,247 ledgers, [45849152, 45894399]):

There were no ledgers with more than 100 ops per tx source account
Total of 29 ledgers had multiple transactions per source accounts, where some transactions failed.
In those 29 ledgers, a total of 35 accounts were affected (most of them had 2 transactions in a ledger, where 1 failed)

So it seems like this pattern is not super common (at least in the recent data).

marta-lokhova commented 1 year ago

One important detail that we haven't articulated here (and I did not take into account when running the network usage experiment above) is that always storing 1 tx/source account in core doesn't actually imply 1 tx/source account per ledger. This is because in the current implementation, we use mempool to vote on the current ledger and select transactions for the next ledger. Closing a ledger, removing applied txs from the mempool and selecting the next tx set all happens in the same synchronous sequence. This prevents transactions from the same source account from being considered for consecutive ledgers, resulting in one transaction per source account every alternate ledger. Edit: the above statement is not exactly correct. We allow some time to accumulate new transactions for the next ledger, but it depends on consensus latency. The buffer is (5 seconds - (consensus+ledger application) time). During this window, we allow new transactions in the tx queue. Note that this means that if the window size is not sufficient (it can be 0, for example), the transaction will be considered for the ledger after next, not the next one. The exact frequency of this happening depends on network utilization (how fast can consensus finish) and tx submission patterns (for example, Horizon ingestion lag may cause the transaction to miss the window).

I wonder if this is too limiting for downstream users. I'll look into extending the usage analysis to get a better idea of network patterns, but wanted to make sure we're aware of this @dmkozh @MonsieurNicolas

marta-lokhova commented 1 year ago

Taking a step back (and maybe this is an argument for going forward with this change anyway): it is sorta in line with what we wanted to do in the future, i.e. move ledger application to the background, so ledger close and overlay could run in parallel. This does imply independent tx sets for ledgers N vs N+1 (so can’t touch the same ledger entries)

tomerweller commented 1 year ago

If we are to restrict core to 1 tx/source account, users will still have an option to bundle ops in a single transaction, but only up to 100 ops.

I don't think that bundling unrelated operations should be considered a viable alternative given that it introduces unnecessary dependencies.

@marta-lokhova can we look at a larger dataset (~month) and cross references the affected accounts with the stelar.expert directory? It would be great to known if any of the affected accounts are known entities that we could proactively warn in advance.

leighmcculloch commented 1 year ago

Have we considered other solutions to the problem?

This hasn't been much of on issue before, because all the transactions have a similar size. But with the upcoming Soroban launch, the current approach is unlikely to remain feasible.

It sounds like the problem being addressed is specifically related to Soroban transactions and not transactions that are being executed today.

Soroban transactions will be executed in a different phase, and as I understand it potentially a different quota per ledger. Soroban transactions also already have their own unique rules, like their own set of operation types separate to existing operations.

Given that Soroban transactions are already quite isolated from existing transactions, have we considered limiting this change to Soroban transactions, such that the impact of the change is zero for any existing user of the network where the problem doesn't exist?

(I acknowledge this might not be feasible for implementation reasons, or might not be wise if it introduces obscure behavior into transaction submission that clients struggle to understand.)

dmkozh commented 1 year ago

have we considered limiting this change to Soroban transactions

We indeed have considered that. There are, however, some issues:

While Soroban transactions are kind of separate, they still are paid for with the classic accounts, so there would need to be some pretty tricky rules for the behavior (like you can do multiple classic ops, but not when you have a pending soroban tx and you won't be able to send a soroban tx as well etc.)
There isn't much isolation during flooding or at least we'd like to avoid having it. Ideally, all the transactions should be flooded and compared together using the same set of metrics, as to maximize the ledger utilization. That probably won't happen at full scale in v1, but we still can do something rough in terms of estimating the classic operation costs. This would allow us to utilize ledger for classic txs when we don't have much soroban traffic (full isolation would mean that we might have a big chunk of ledger capacity unused while the transactions are surge priced).
The implementation and maintenance aren't really straightforward with a special-cased approach (these are already pretty messy) and the previous point makes things even harder. Given the tight timelines this solution is probably the most optimal in terms of time-to-benefit ratio (especially given that now we know that impact is limited).

marta-lokhova commented 1 year ago

Closing this issue: the limit is on by default as of v19.13.0, and will be required in v20.0.0

stellar / stellar-core