`chainHead_call` (et al): clarify that response from this comes back before any events on the `follow` subscription.

Raising on behalf of https://github.com/paritytech/polkadot-sdk/issues/2992#issuecomment-1900285167

In our current substrate implementation of these methods, the response from a call to eg chainHead_call may arrive back after one or more events associated with it arrive on the corresponding chainHead_follow subscription.

The spec doesn't make it clear that we expect a specific order of events (eg, subscription ID comes back on call before any events come back on follow). Eg on chainHead_call it just says of the "started" response:

This return value indicates that the request has successfully started.

operationId is a string containing an opaque value representing the operation.

@tomaka, do you see any issue with us specifying that we'll always see a response to call and related before seeing any corresponding events on the follow subscription?

The spec doesn't make it clear that we expect a specific order of events (eg, subscription ID comes back on call before any events come back on follow). Eg on chainHead_call it just says of the "started" response

Perhaps we could make it more explicit. However, I think that anyone reading the spec would immediately understand that the order in which the message arrive is important, and that a notification that represents an event shouldn't be sent before the client has received its identifier. If we need to make this even more explicit, that is fine by me, but please notice that even @lexnv proposed changing the spec to allow the messages to come out of order. So, he clearly also assumed that the spec was setting this expectation.

IMO the "Bounded Queues" section of the spec in combination with the docs of the chainHead group of functions already make this pretty clear.

When I implemented this stuff in Subxt, I'm pretty sure I didn't assume anything about the ordering and did it to handle race conditions (though in Rust I think that is a possibility regardless, so it was always a safer bet). IMO none of those things make the relationship between outputs different calls clear, but I'm not going to argue this point if others think that it was obvious enough.

I'm pretty sure I didn't assume anything about the ordering and did it to handle race conditions

It's very possible that this is due to the fact that for a looong time the public JSON-RPC of substrate was delivering unordered messages. So, you probably got used to dealing with that anomaly. I also had custom logic back in the days to deal with that anomaly. I could, of course, bring those shenanigans back. However, I'm of the opinion that we have to aim for better standards.

though in Rust I think that is a possibility regardless

No, it isn't.

so it was always a safer bet

🤷

Messages over a WebSocket connection are always guaranteed to come ordered due to the fact that the WebSocket protocol works over a single TCP connection.

The statement "in Rust I think that is a possibility regardless, so it was always a safer bet" isn't really logical. Rust, along with all other Turing-complete programming languages, has the ability to create a FIFO data structure. This type of data structure is key to managing how data is handled, particularly in ensuring that messages are processed in the same sequence as they were received.

You implemented a complex process on the consumer side, which is completely unnecessary. My understanding is that this decision was influenced by concerns about the producer side, which your team also developed. Specifically, it appears you were worried that the producer might not reliably maintain the correct order of messages. To compensate for this potential issue, you added extra, complicated logic to the consumer side as a precaution.

This approach might stem from a couple of misunderstandings or gaps in knowledge. First, it's possible that you weren't fully aware that WebSocket connections, which are built on TCP, inherently ensure message order. Alternatively, it could be due to a lack of familiarity with utilizing Rust's features to enforce message order on the producer side.

In essence, this situation reflects a problem where unnecessary complexity is introduced into a system due to a poorly implemented abstraction layer. This complexity could have been avoided with a better understanding of the underlying technologies or a more efficient use of the programming tools at your disposal.

It is blatantly obvious that this JSON-RPC spec is designed to work over a full-duplex connection capable of guaranteeing ordering and correctness of messages, and it is also blatantly obvious that there is no need to:

clarify that response from this comes back before any events on the follow subscription

Because the documentation of the event reads:

operationId is a string returned by chainHead_unstable_storage.

Ergo, the operationId must have been returned by the time that the event happens.

It doesn't say:

operationId is a string that will eventually arrive in response to the chainHead_unstable_storage call that triggered the event.

The spec clearly states that operationId is a string returned, as in: "it has already been returned".

To add some color to the discussion it's "technically" possible that these chainHead_follow and chainHead_unstable_storage calls can be made on different connections and then we can't enforce such ordering.

However, I think the common use-case is to use the API with single-connection and yes then such ordering is possible.

Thus, it wouldn't hurt to specify such behavior in the spec because it's not clear to me at least...

To add some color to the discussion it's "technically" possible that these chainHead_follow and chainHead_unstable_storage calls can be made on different connections

It would be wrong to do so. We are in a world with reverse proxies and load balancers and DNS caches, and it is wrong to assume that opening two connections to the same IP will actually connect to the same machine.

We could specify in spec something long these lines: _"When the method call is being made from the same connection, it is guaranteed that the operationId will be delivered before the response is generated on the chainHead_follow."_

It might be technically possible, although unlikely, to have a developer target a local-node / test-node by IP address and mistakenly create a different connection. I expect that the spec clarification should be enough to bring some light on this.

Do you guys think we should restrict the "different connection" edge-case from the spec? Specifying that users must reuse the same connection to make method calls that will be answered by chainHead?

In essence, this situation reflects a problem where unnecessary complexity is introduced into a system due to a poorly implemented abstraction layer. This complexity could have been avoided with a better understanding of the underlying technologies or a more efficient use of the programming tools at your disposal.

Please don't jump to making a lot of assumptions about my understanding of things; it's not at all constructive and not something I will engage you on. I'm actually quite shcoked that a simple request to add a little precision/clarity to the spec has devolved to this.

The reason that I said that it is possible for a race to occur in Rust has nothing to do with TCP or the nature of websockets. Rust is a multi threaded language, and so it's entriely possible that messages are being handled on one thread (ie because receiving messages from a subscription is spawned onto a tokio task), and then:

chainHead_call request made, and we wait for response
Operation ID message comes back on websocket, followed right after by operation on follow subscription.
While this thread is starting subscription with operation ID, the second message containing some details for that operation is already being processed by the other thread
Subscription starts, but only after other thread has processed the firsat message related to that operation, and so it is missed.

In JS, nothing can happen between handling operation ID and creating subscription because it's single threaded. In Rust, it can. We have to cater for these sorts of things, else we end up with rare but possible problems.

you added extra, complicated logic to the consumer side as a precaution

Did you look at our logic that we wrote and decide that it was complicated, or just want to take a jab at it? Because in reality, all that we needed to do to handle this is move one line of code before the other instead of after.

We could specify in spec something long these lines: _"When the method call is being made from the same connection, it is guaranteed that the operationId will be delivered before the response is generated on the chainHead_follow."_

It might be technically possible, although unlikely, to have a developer target a local-node / test-node by IP address and mistakenly create a different connection. I expect that the spec clarification should be enough to bring some light on this.

Do you guys think we should restrict the "different connection" edge-case from the spec? Specifying that users must reuse the same connection to make method calls that will be answered by chainHead?

It is obvious that this spec is designed to work on top of full-duplex, event-driven and bidirectional connection capable of guaranteeing ordering and correctness of messages.

In other words: This spec can not be implemented on top of a REST API because REST APIs (unlike WebSocket APIs) are meant to be stateless and resource based, while WebSocket connections are stateful and socket-based.

Sure, it is technically possible to build a hacky protocol that enables a similar behaviour on a HTTP based API by doing long-polling on a HTTP request, while sending POST requests on another HTTP resource, with the purpose of emulating the properties of an event-driven and bidirectional connection. This is what some Web Applications had to do back in the days before WebSockets where a thing. However, please notice that's just a hacky workaround for emulating the properties of a biderectional, event-driven and stateful connection. Meaning that HTTP connections were not designed to be used like that.

It's worth reiterating the fact that WebSocket connections -unlike REST APIs- are stateful, and also -unlike REST APIs- they are not resource based. Therefore, the state of a WebSocket connection should be assumed to be relevant only within the context of that connection.

With all this in mind, let's answer the following question: what should be the correct behaviour that should occur when an opaque-id that was returned inside the context of WebSocket connection A was used in the context of WebSocket connection B?

Well, since WebSocket connections are stateful, there are only 2 different possibilities: 1) There is the case where WebSocket connection B has received the same opaque-id in the past, in which case, the answer to the expected behaviour is: it would behave just the same as it would behave if the connection used that opaque-id without knowing that it's also being used in a different connection. 2) If WeSocket connection B has never seen that opaque-id, then the request should fail with one of the well-known errors described in the spec for when a wrong opaque-id is used.

In other words: since WebSocket connections are stateful, the opaque-ids that are returned are only relevant within the context of those connections.

There is no need for the spec to change anything, because that's how stateful connections work.

Please don't jump to making a lot of assumptions about my understanding of things;

I don't think that it's fair to say that I'm jumping to conclusions. You made some statements that lead me to believe that your understanding of some important topics is not correct. In particular the following comments:

You stated that the ordering of messages can't be guaranteed over a WebSocket connection:

Even if we make sure that messages are sent in the right order, they may not be received in the right order.

You also stated that:

When I implemented this stuff in Subxt, I'm pretty sure I didn't assume anything about the ordering and did it to handle race conditions (though in Rust I think that is a possibility regardless, so it was always a safer bet)

I am of the opinion that those statements reflect a gap in your understanding of some important topics. If you don't want to engage with me because I'm pointing out these things, well... I'm sorry to hear that's the case. However, please know that I think that it's important that we are all in the same page when it comes to these things.

The reason that I said that it is possible for a race to occur in Rust has nothing to do with TCP or the nature of websockets. Rust is a multi threaded language, and so it's entriely possible that messages are being handled on one thread

I insist: the only reason why you have to account for race-conditions in the context of the events of chainHead_call, has nothing to do with Rust. It has to do with the way that Subxt has been architected, which -based on your comments- its architecture did not leverage the fact that messages are guaranteed to come in ordered. Also, based on your previous comments, it seems that the architectural choices of Subxt couldn't have been made taking this into account, because you were not aware of this fact.

I'm sure that you will agree with me on the fact that in Rust it is entirely possible to leverage the fact that when messages are guaranteed to arrive in a certain order, then it's possible to architect things so that the code doesn't have to deal with impossible race-conditions.

If your argument is that -in the context of Subxt- it is irrelevant whether the messages arrive ordered or not, because in the context of Subxt is "a safer bet" not to architect the library based on those guarantees... Well, then let's agree to disagree on that specific architectural choice. However, let's please at least agree on the fact that it's entirely feasible in Rust to leverage the correct ordering of messages, so that the rest of your code doesn't have to account for impossible race conditions.

I'm actually quite shcoked that a simple request to add a little precision/clarity to the spec has devolved to this.

😅 Let me remind you the context that lead to this:

On November 27th 2023 I reached out to the Subxt team letting you know that we were experiencing what seemed to be a regression of this issue against Substrate nodes, within the context of the nested subscriptions of chainHead_unstable_follow. As a response I was told that if this was actually happening, that would certainly be a bug. However, back then, I was told that this was likely happening because of some poorly configured proxies or other issues related with the infrastructure. Which I always thought didn't make a lot sense.

No one ever told me anything like: "oh, yeah, the spec is not clear enough with regards of the ordering of arrivalr", or "why isn't your client accounting for the fact that the messages could come out of order like the Subxt client does?". No, nothing like that, back then we all seemed to be on the same page about the fact that if this was happening, then this had to be considered a bug. Back then, no one seemed to think that the spec was not clear enough.

Also, if I wanted for the Subxt team to look into this any further, then it was on me to provide further evidence into a reproducible environment... Sigh...
On Januray 17th 2024 I bring up this issue again to the attention of the Subxt team. This time I'm constantly seeing messages arrive out of order, and I'm also seeing this happening on the rpc.polkadot.ioendpoint.

Once again, I'm told that this must be a bug. Up until this moment no one had ever questioned whether the spec was clear enough, or whether the client was supposed to handle messages coming out of order. It is in fact so clear to everyone that this is a bug, that the Subxt requires me to provide a reproducible environment so that they can investigate. To be honest, I find that a bit annoying because I have quite a lot on my plate. Nevertheless, since we all seem to be on the same page about the fact that this is an annoying bug, I decide to spend a long time setting up that environment and I opened this issue. Again, up until this point no one has questioned the clarity of the spec with regards of the order.
Once the Subxt team looks into the issue and they confirm that this behaviour is in fact occurring, then they reach the conclusion that the solution is to change the spec so that the messages are allowed to come out of order, because they seem to be under the impression that the order of arrival can't be guaranteed 🤯. Once again, up until this point no one has ever questioned whether the spec is clear enough with regards of the ordering of the messages and everyone seems to be on the same page about the fact that the spec clearly sets this spectation. I mean, the initial suggestion is to "change the spec".
I quickly respond explaining why what the Subxt team is proposing is IMO unacceptable and @bkchr backs me up.
Then, in a completely unexpected turn of events, @jsdw opens this issue implying that the spec is not clear enough about the order of the events, and presents as evidence the fact that Subxt was implemented without taking into account the fact that the events come in order 😮.

Believe me @jsdw , you are not the only one who is shocked in this discussion.

James: Even if we make sure that messages are sent in the right order, they may not be received in the right order.

This was me erroneously thinking about separate connections and had nothing to do with WS characteristics, and in my response I acknowledged that given we use just one WS connection this doesn't matter.

James: (re Subxt implementation) I'm pretty sure I didn't assume anything about the ordering and did it to handle race conditions.

This was not about websocket message ordering. It was about the order of getting a response from chainHead_call and operations coming back on chainHead_follow.

Josep: If you don't want to engage with me because I'm pointing out these things, well... I'm sorry to hear that's the case. However, please know that I think that it's important that we are all in the same page when it comes to these things.

I always welcome having conversations that lead to learning and improving my knowledge. What I don't appreciate are somewhat inflammatory assumptions about my (or others) knowledge or ability.

Things like...

Josep: You implemented a complex process on the consumer side, which is completely unnecessary.

The "complex process" is just your assumption. In reality, it was one line of code being above, rather than below, another.

Josep: Specifically, it appears you were worried that the producer might not reliably maintain the correct order of messages...

An assumption. I have never been worried that the correct order of messages wouldn't be maintained, but just that operations could start being sent before chainHead_call sends a response, or a multi-threaded race where I process messages on one thread but subscribe on another. (Maybe both; I can't remember exactly my thoughts when I wrote the code).

Josep: ...To compensate for this potential issue, you added extra, complicated logic to the consumer side as a precaution.

To compensate for such potential issues, I put one line of code above, rather than below, another. It wasn't complicated at all, but you have assumed that it is.

Josep: In essence, this situation reflects a problem where unnecessary complexity is introduced into a system due to a poorly implemented abstraction layer.

Same again.

And then, have you looked at our code for this? Do you have any specific issues with our code? Or are you just assuming that it's "poorly implemented" (thanks for the vote of confidence)?

Josep: This complexity could have been avoided with a better understanding of the underlying technologies or a more efficient use of the programming tools at your disposal.

Same again. This is somewhat implying that I have a lot to learn. All of this said about some assumed complexity that doesn't exist.

Reminder: it was one line of code above, rather than below another, to address any potential concerns that I had. Even if there was no issue at all, it is trivial and sensible to write the code in this way.

Josep: I insist: the only reason why you have to account for race-conditions in the context of the events of chainHead_call, has nothing to do with Rust. It has to do with the way that Subxt has been architected

Josep: If your argument is that -in the context of Subxt- it is irrelevant whether the messages arrive ordered or not, because in the context of Subxt is "a safer bet" not to architect the library based on those guarantees... Well, then let's agree to disagree on that specific architectural choice

Rust, being multi threaded, exposes the possibility of a race that can't exist in JS. Of course things could be architected in a rust app to prevent this possibility (ie do everything on one thread).

In my view, Subxt is architected quite well, and the architecture has to take into account a number of things. It is also multi threaded, and allows users to use multiple threads (or tasks) to do things if they wish. It promises to work properly in these cases.

If you have any concrete thoughts on how to improve the Subxt code base and architecture, I'm all ears and we can discuss these things. But here, you are being overly dismissive of it, souring the discussion and ensuring that nobody benefits.

To go back to the beginning of this issue: I asked to make the spec more precise/explicit re chainHead_call response being sent before corresponding messages come back on chainHead_follow needed to go down this path. That is all.

If the consensus is that this is already obvious enough then we can close this and I don't care to argue the point.

To go back to the beginning of this issue: I asked to make the spec more precise/explicit re chainHead_call response being sent before corresponding messages come back on chainHead_follow needed to go down this path. That is all.

If the consensus is that this is already obvious enough then we can close this and I don't care to argue the point.

I personally think that the spec is already perfectly clear, given that:

1) The documentation of all the operation events state that:

> `operationId` is a string **returned** by...

So, IMO it's already clear that `operationId` was already "**returned**" before the event came.

It also states things like:

> The `operationCallDone` event indicates that **an operation started** with `chainHead_unstable_call` was successful.

Which IMO clearly indicates that the operation has already "**started**".

Also, if a clarification was needed, then IMO the clarification wouldn't belong into the docs of the `operationHead_call` (et al), but rather in the documentation of the events, because those are the ones that must come in the correct order after their `operationId` has already been sent. In other words: it's easier to document it on the side of the events, because they are the ones that must get both: the timing and the order right.

However, once again, I'm of the opinion that the spec is super clear and self-evident as it is.

2) The payload of the response of the chainHead_call (et al) is "result": "started", which is IMO pretty obvious that it doesn't make sense for it to arrive after an event named operationCallDone.

That being said, as I already mentioned in my very first comment: if you want to make the spec "more precise/explicit" then, of course, go for it! Just open a PR and make it "more precise/explicit", for sure! I mean, I think that fixing the bug that I reported on Polkadot-SDK should be a higher priority, but... 🤷

So, in a nutshell: the fact that I completely disagree with this statement of yours:

The spec doesn't make it clear that we expect a specific order of events (eg, subscription ID comes back on call before any events come back on follow).

Doesn't mean that I'm against improving the spec, of course not.

@jsdw what is the status here? I assume that fixing this should be fairly simple?

@bkchr

When https://github.com/paritytech/jsonrpsee/pull/1281 is merged we can fix it properly. Should be ready this week

Ty for the update @niklasad1!

paritytech / json-rpc-interface-spec

`chainHead_call` (et al): clarify that response from this comes back before any events on the `follow` subscription. #125