stellar / stellar-core

Reference implementation for the peer-to-peer agent that manages the Stellar network.
https://www.stellar.org
Other
3.12k stars 967 forks source link

expose transaction queue information to downstream clients #2920

Open tomerweller opened 3 years ago

tomerweller commented 3 years ago

Problem

Transaction submission is currently a black box. From the moment a transaction is submitted until it makes it's way to consensus the client can not reason about it's status. If it doesn't make it to consensus then things get even more complicated.

Core to the rescue

However, core can answer some questions to help clients reason about the status of transactions in flight. For example:

Implementation Proposal

  1. Core exposes a stream of TxQueue events which Horizon ingests
  2. Horizon provides a subscription API for transaction submitters that includes all TxQueue events for a specific transaction

Notes

MonsieurNicolas commented 3 years ago

I'm open to suggestions on how to make this better.

Exposing more complexity to clients is not necessarily the right way to do this.

I think at the end of the day, clients don't (should not) care to know why things fail at the local node level: they just want to be able to make sane assumptions on what should they do (including nothing).

Lets first start to try to answer the easier question, which is from the network point of view, and this is a tri state: "this transaction is being considered for the current ledger", "no this transaction is not being considered" and "maybe".

The problem is that the cases where we can answer "yes" or "no" at the network level (as in "I can see inside the memory of all computers on that network") is fairly limited because transitions (each node's state a second later) are randomized by a bunch of factors (new transactions submited to the network, how network traffic flows, when nodes receive messages, etc).

List of cases where we know "for sure" that a transaction is not going to be included:

List of cases where we know "for sure" that a transaction is going to be included:

I think we'd have to settle for some probabilistic approach already at the network level.

For example if a transaction transaction (and its predecessors) is one of the most expensive transactions in the validator that will introduce the tx set to the network (as defined by the nomination protocol) and we're close to nomination time, we're pretty confident that the transaction will be included in the next ledger.

Obviously all clients can't just talk to validators like this, and unless we can get high probability answers to those questions we're back to "maybe" type of answers that are not super useful to tease out.

In any case, I'll close this comment with how I started: I'm open to suggestions on how to make this better.

tomerweller commented 3 years ago

@MonsieurNicolas thanks for this reply. I think this is an important discussion to have.

Exposing more complexity to clients is not necessarily the right way to do this.

I somewhat agree with the sentiment of not giving devs too much rope to hang themselves with, but in this case we're giving them close to nothing. Yes, Stellar is a non-trivial distributed system - but trying to obfuscate that with an overly simplified tx submission contract is backfiring - tx submitters are at loss when attempting to debug. I argue that simplification should happen in the upper levels of the stack (horizon/sdk/wallets). Core should output known facts - for example, if a tx was dropped from the queue because it became invalid - that's a useful fact that clients have no access to right now.

The problem is that the cases where we can answer "yes" or "no" at the network level is fairly limited.

Fully agreed. I'm not advocating for core to try and guesstimate network level behavior - just to output it's own local node knowledge. Downstream clients should then build their own heuristics. We can keep horizon tx submission opinionatedly simple but give developers that are interested access to more fine grained information

As mentioned before, there's a significant amount of prior art in the blockchain space of queryable transactions queues (or "mempools"). Stellar is "cursed" by being much simpler and faster so clients usually don't need these capabilities. But when they do - there is very little.

As a bonus, I think that exposing this information will allow ecosystem devs to build better network analysis tools.

leighmcculloch commented 3 years ago

The way that transaction submission is implemented in Horizon and promoted in guides/docs it is described like any other HTTP call to a non-distributed application where a result is determined. In some ways this is a lie, because if you cover all edge cases that is never sufficient and Horizon provides very little help past the happy path. It would be great to keep it the way it is if we can improve the handling of edge cases.


Is T in the queue?

Do submitters of transactions want their transactions to hang around in the queue? Why do transactions get kept in the transaction queue? What problem is that solving? Is it creating more problems than it is solving?

I think some of the pain comes from transaction being submitted, failing or timing out, then finding out later it succeeded. Obviously there's no way to 100% prevent that because anyone who saw the transaction could resubmit it even if it didn't make it into a ledger, but we could limit the problem under normal operation by not keeping transactions in the queue.

If transactions are not kept in the queue this also limits the questions @tomerweller raised. Is it in the queue? No. If submission failed, its not in the queue.

We could also maybe limit these cases by encouraging small single ledger sized timebounds in our guides and SDKs, but even after conversation with @jonjove and others I don't feel like I completely understand how to set a small enough timebounds to guarantee it will only get into the 1 or 2 ledgers.


Is T banned?

Has T been dumped from the queue for becoming invalid? What's the reason?

It would be great if Horizon knew this and could immediately respond to the submitter saying it is rejected because it is banned.


Has T been flooded?

What does this mean?


We should probably work our way backwards and propose the changes we want to the Horizon transaction submission API, and then identify what information we need from Core. cc @ire-and-curses @stellar/horizon-committers

tomerweller commented 3 years ago

Thanks @leighmcculloch. I agree that the immediate goal should be improving the Horizon tx-submission API. The low hanging fruit that I see are:

  1. Fail immediately when submitting a banned tx .
  2. Fail fast when a tx gets dropped/banned and provide reasoning. If the tx is invalid then the wallet can respond faster. If it got banned because of insufficient fee offering the wallet can immediately prompt the user to suggest a higher fee offering.
  3. Disambiguate slow ledgers from surge pricing. If a wallet is routinely seeing slow success it can be either be because of long ledger processing times or their txs are getting pushed because of surge pricing and insufficient fee offering. In the latter case the wallet can increase fee offering or prompt the user to do that.

With that said, I would consider taking a bottom up approach and just expose all the information that stellar-core has about the queue. This can allow alternative tx submission APIs to have their own heuristics and facilitate external analysis tools.

Interested to hear what @jonjove, @ire-and-curses and @bartekn think.

jonjove commented 3 years ago

@tomerweller I think "low hanging fruit" 1 and 2 are interesting ideas.

With that said, I would consider taking a bottom up approach and just expose all the information that stellar-core has about the queue. This can allow alternative tx submission APIs to have their own heuristics and facilitate external analysis tools.

I kind of see this from the exact opposite perspective. Let's get a concrete vision of what we want Horizon to implement (as a reference and test case, not because there can't be other transaction submission APIs) and then figure out how to expose the information. I don't like the idea of exposing everything with the hope that people will eventually consume it, because it's hard to evaluate the merit of that work relative to other work (one could argue that it's useless since no one is planning to consume it immediately, or one could argue that the upside is huge because people might make great use of the information) and prioritize it.

Perhaps let's start from what @ire-and-curses @bartekn would actually want to implement in Horizon in the short term? Then we can make a plan for exposing the right information in the right way.

MonsieurNicolas commented 3 years ago

I realized the same thing while discussing with @Shaptic in parallel to this issue. Right now clients are not doing the right thing when hitting much simpler (and more likely to happen) glitches than surge pricing (like what can happen when calling Horizon from a mobile device). What I proposed to him was that we should figure out what the overall pattern should be client side, once we understand that, we can have the Horizon team see what can be done to move the complexity away from the client so that we get back to Horizon being the thing that turns "gnarly async patterns" (what we get when dealing with the core network) into patterns that are easier to deal with from the client point of view (synchronous/idempotent/retry forever type of contracts).

What I suspect will happen is that dealing with "surge pricing" related issues will be much simpler to integrate safely after that iteration.

ire-and-curses commented 3 years ago

We spent some time discussing this in the Horizon team. We identified some problems with the Horizon API contract, as well as places where Core's non-determinism remains a problem.

The way I think about it, a user should always be able to make clear decisions if they always use sequence numbers as the source of truth, bounded in time. To do that safely they need confirmation on the ledger.

When Horizon responds to a client's tx submission, they get a kind of "optimised" response - they are told earlier than is strictly correct about what's going on, and they can sometimes infer the wrong things.

The problem is, that response hides a lot of complexity, and when there is no response (i.e. timeout), people sometimes are at a loss to proceed. Behind the scenes many things could be happening.

The other problem is that in non-surge pricing times, dealing exclusively with the response (and not checking the ledger) can get you a performant solution. But that strategy doesn't hold up during surge pricing.

This is compounded by certain historical decisions in the Horizon API, such as queuing of submissions (so that you can submit N, N+1, N+2 seqnums even out of order, and they can be submitted and succeed), and a lot of latitude in how you craft submissions (e.g. allowing txs with no timebounds).

Taking this view, I think what we want is 1) a way to tighten the expectations in the API to shrink the rope with which the client can hang themselves 2) a little less abstraction around the distributed systems problem so that clients are aware that e.g. crashing their wallet on a mobile device is something they must code for 3) improving the API contract with Core wherever possible to reduce ambiguity (e.g. failing fast) 4) exposing additional information from Core that Horizon can use to encourage correct client-side behaviour 5) clear concise guidelines for client best practices 6) some way to test your stuff progammatically against a Horizon or simulated Horizon that is experiencing surge pricing

I realise that's pretty hand-wavy but I think it's a reasonable way to describe the abstract problem.

I'll let @bartekn post a specific detailed proposal that we can use as a starting point for implementation discussions. Then I think we should meet. We'll need expertise from both Core and Horizon to work through edge cases and understand where we can make tradeoffs on the design, and Partner Engineering to keep our use cases honest.

bartekn commented 3 years ago

Thanks @ire-and-curses!

I was thinking about how we can improve the txsub experience for clients. I started with listing common mistakes users are making when submitting transactions:

With this in mind I came up with an idea for the extra txsub endpoint designed for very simple apps like web/mobile wallets (ex. our Account Viewer) with built-in checks that would prevent many common mistakes in transaction submission. In general:

An extra table to temporarily store tx hashes (account, sequence, txhash, max_time) is required. Here’s how it would work for a transaction T. Please note that I didn't take banning into account - need some help from the Stallar-Core team to better understand it.

How does it protect from common mistakes?

There’s one change on the client side that must be added for this to work: clients should keep track of the last txsub by the user (they need to save tx envelope somewhere, can be even a cookie) and disallow sending another tx until the one last is resolved.

ire-and-curses commented 3 years ago

@MonsieurNicolas and I discussed this offline. We think the next step is to lay out a flow diagram that captures the different core, Horizon and SDK paths. @MonsieurNicolas is going to take a first pass at that diagram from the core perspective, aiming for something by the end of the month.

MonsieurNicolas commented 3 years ago

OK, so I think we can discuss this some more.

I put together a diagram that describes more or less the existing flow for submitting transactions in a safe way.

This version is supposed to deal with the following failures correctly:

Here is a slightly scary dot file with my findings (there are comments both in the diagram and in the source code).

I had to add a new field returned by core's tx endpoint and by Horizon (called as_of_ledger for now), as I could not make it work reliably without it.

I also don't know if clients have access to "other errors" returned by core's tx endpoint, like DUPLICATE, TRY_AGAIN_LATER, etc

A few observations:

I think that the client side code can be simplified a lot if the transaction submission flow was "first classed" from an API flow point of view.

Here are some ideas:

leighmcculloch commented 3 years ago
* information needed to craft a transaction for an account should be accessible with a single call to Horizon

This is already the case for simple transactions where only the state of a single account is required. Anything beyond a single account or list of accounts the Horizon API doesn't support querying in a single call. Given that the state of the ledger is consistent for a single ledger it would be interesting if Horizon supported querying any data at specific ledgers, but that's way out-of-scope for what Horizon does today I think, and I'm not sure it is required for the majority of cases.


the transaction submission endpoint should

* take an array of transactions as a parameter.

This could complicate Horizon's submission process, and I think tx submission through Horizon is already more complex than it should be.

We need to find a way to simplify it.

Applications that need to deal with tx submission at any non-trivial scale will probably want to submit using an asynchronous submission pattern, e.g. something akin to https://github.com/stellar/go/pull/3564, and use patterns such as:


After considering the above, I'm not sure visibility into the queue would really help me write an application to handle these edge cases. While a transaction is valid, even if it isn't in the queue, and even if I haven't resubmitted it, there is technically nothing stopping it showing up again due to some other actor on the network.

MonsieurNicolas commented 3 years ago

This is already the case for simple transactions where only the state of a single account is required. Anything beyond a single account or list of accounts the Horizon API doesn't support querying in a single call. Given that the state of the ledger is consistent for a single ledger it would be interesting if Horizon supported querying any data at specific ledgers, but that's way out-of-scope for what Horizon does today I think, and I'm not sure it is required for the majority of cases.

This is not the case: today's accounts endpoint only returns information about the account but not about the latest ledger, this introduces a race condition (see the diagram's source code that has some commentary about this).

leighmcculloch commented 3 years ago

returns information about the account but not about the latest ledger

Are you meaning that the information might be delayed due to the time it takes to ingest, or do you mean there isn't information like the ledger seq number in the account response?

tamirms commented 3 years ago

This is not the case: today's accounts endpoint only returns information about the account but not about the latest ledger, this introduces a race condition (see the diagram's source code that has some commentary about this).

there's a last_modified_ledger field in the account response json body which is the last sequence number which modified the account. Horizon also sets a Latest-Ledger header on account responses. The Latest-Ledger header value is the sequence number of the most recently ingested ledger.

MonsieurNicolas commented 3 years ago

Ah cool @tamirms : Latest-ledger is the kind of thing that would be needed yes! I looked at https://developers.stellar.org/api/ (and laboratory) and didn't see this. Not sure if it's documented. Should it be first classed into the main reponse?

Also, we would need to add this information to core's responses note that this is especially important on failure to propagate core's ledger number not Horizon's ingested number (this is when transaction submission errors out).

tamirms commented 3 years ago

@MonsieurNicolas I don't think the header is documented anywhere. We added the header for testing purposes. When we have a release candidate we deploy it to staging and compare responses between staging and production to ensure they have the same response if both responses have the same Latest-ledger value.