paritytech / json-rpc-interface-spec

30 stars 3 forks source link

`chainHead_call` (et al): clarify that response from this comes back before any events on the `follow` subscription. #125

Closed jsdw closed 9 months ago

jsdw commented 10 months ago

Raising on behalf of https://github.com/paritytech/polkadot-sdk/issues/2992#issuecomment-1900285167

In our current substrate implementation of these methods, the response from a call to eg chainHead_call may arrive back after one or more events associated with it arrive on the corresponding chainHead_follow subscription.

The spec doesn't make it clear that we expect a specific order of events (eg, subscription ID comes back on call before any events come back on follow). Eg on chainHead_call it just says of the "started" response:

This return value indicates that the request has successfully started.

operationId is a string containing an opaque value representing the operation.

@tomaka, do you see any issue with us specifying that we'll always see a response to call and related before seeing any corresponding events on the follow subscription?

josepot commented 10 months ago

The spec doesn't make it clear that we expect a specific order of events (eg, subscription ID comes back on call before any events come back on follow). Eg on chainHead_call it just says of the "started" response

Perhaps we could make it more explicit. However, I think that anyone reading the spec would immediately understand that the order in which the message arrive is important, and that a notification that represents an event shouldn't be sent before the client has received its identifier. If we need to make this even more explicit, that is fine by me, but please notice that even @lexnv proposed changing the spec to allow the messages to come out of order. So, he clearly also assumed that the spec was setting this expectation.

IMO the "Bounded Queues" section of the spec in combination with the docs of the chainHead group of functions already make this pretty clear.

jsdw commented 10 months ago

When I implemented this stuff in Subxt, I'm pretty sure I didn't assume anything about the ordering and did it to handle race conditions (though in Rust I think that is a possibility regardless, so it was always a safer bet). IMO none of those things make the relationship between outputs different calls clear, but I'm not going to argue this point if others think that it was obvious enough.

josepot commented 10 months ago

I'm pretty sure I didn't assume anything about the ordering and did it to handle race conditions

It's very possible that this is due to the fact that for a looong time the public JSON-RPC of substrate was delivering unordered messages. So, you probably got used to dealing with that anomaly. I also had custom logic back in the days to deal with that anomaly. I could, of course, bring those shenanigans back. However, I'm of the opinion that we have to aim for better standards.

josepot commented 10 months ago

though in Rust I think that is a possibility regardless

No, it isn't.

so it was always a safer bet

🤷

Messages over a WebSocket connection are always guaranteed to come ordered due to the fact that the WebSocket protocol works over a single TCP connection.

The statement "in Rust I think that is a possibility regardless, so it was always a safer bet" isn't really logical. Rust, along with all other Turing-complete programming languages, has the ability to create a FIFO data structure. This type of data structure is key to managing how data is handled, particularly in ensuring that messages are processed in the same sequence as they were received.

You implemented a complex process on the consumer side, which is completely unnecessary. My understanding is that this decision was influenced by concerns about the producer side, which your team also developed. Specifically, it appears you were worried that the producer might not reliably maintain the correct order of messages. To compensate for this potential issue, you added extra, complicated logic to the consumer side as a precaution.

This approach might stem from a couple of misunderstandings or gaps in knowledge. First, it's possible that you weren't fully aware that WebSocket connections, which are built on TCP, inherently ensure message order. Alternatively, it could be due to a lack of familiarity with utilizing Rust's features to enforce message order on the producer side.

In essence, this situation reflects a problem where unnecessary complexity is introduced into a system due to a poorly implemented abstraction layer. This complexity could have been avoided with a better understanding of the underlying technologies or a more efficient use of the programming tools at your disposal.

It is blatantly obvious that this JSON-RPC spec is designed to work over a full-duplex connection capable of guaranteeing ordering and correctness of messages, and it is also blatantly obvious that there is no need to:

clarify that response from this comes back before any events on the follow subscription

Because the documentation of the event reads:

operationId is a string returned by chainHead_unstable_storage.

Ergo, the operationId must have been returned by the time that the event happens.

It doesn't say:

operationId is a string that will eventually arrive in response to the chainHead_unstable_storage call that triggered the event.

The spec clearly states that operationId is a string returned, as in: "it has already been returned".

niklasad1 commented 10 months ago

To add some color to the discussion it's "technically" possible that these chainHead_follow and chainHead_unstable_storage calls can be made on different connections and then we can't enforce such ordering.

However, I think the common use-case is to use the API with single-connection and yes then such ordering is possible.

Thus, it wouldn't hurt to specify such behavior in the spec because it's not clear to me at least...

tomaka commented 10 months ago

To add some color to the discussion it's "technically" possible that these chainHead_follow and chainHead_unstable_storage calls can be made on different connections

It would be wrong to do so. We are in a world with reverse proxies and load balancers and DNS caches, and it is wrong to assume that opening two connections to the same IP will actually connect to the same machine.

lexnv commented 10 months ago

We could specify in spec something long these lines: _"When the method call is being made from the same connection, it is guaranteed that the operationId will be delivered before the response is generated on the chainHead_follow."_

It might be technically possible, although unlikely, to have a developer target a local-node / test-node by IP address and mistakenly create a different connection. I expect that the spec clarification should be enough to bring some light on this.

Do you guys think we should restrict the "different connection" edge-case from the spec? Specifying that users must reuse the same connection to make method calls that will be answered by chainHead?

jsdw commented 10 months ago

In essence, this situation reflects a problem where unnecessary complexity is introduced into a system due to a poorly implemented abstraction layer. This complexity could have been avoided with a better understanding of the underlying technologies or a more efficient use of the programming tools at your disposal.

Please don't jump to making a lot of assumptions about my understanding of things; it's not at all constructive and not something I will engage you on. I'm actually quite shcoked that a simple request to add a little precision/clarity to the spec has devolved to this.

The reason that I said that it is possible for a race to occur in Rust has nothing to do with TCP or the nature of websockets. Rust is a multi threaded language, and so it's entriely possible that messages are being handled on one thread (ie because receiving messages from a subscription is spawned onto a tokio task), and then:

  1. chainHead_call request made, and we wait for response
  2. Operation ID message comes back on websocket, followed right after by operation on follow subscription.
  3. While this thread is starting subscription with operation ID, the second message containing some details for that operation is already being processed by the other thread
  4. Subscription starts, but only after other thread has processed the firsat message related to that operation, and so it is missed.

In JS, nothing can happen between handling operation ID and creating subscription because it's single threaded. In Rust, it can. We have to cater for these sorts of things, else we end up with rare but possible problems.

you added extra, complicated logic to the consumer side as a precaution

Did you look at our logic that we wrote and decide that it was complicated, or just want to take a jab at it? Because in reality, all that we needed to do to handle this is move one line of code before the other instead of after.

josepot commented 10 months ago

We could specify in spec something long these lines: _"When the method call is being made from the same connection, it is guaranteed that the operationId will be delivered before the response is generated on the chainHead_follow."_

It might be technically possible, although unlikely, to have a developer target a local-node / test-node by IP address and mistakenly create a different connection. I expect that the spec clarification should be enough to bring some light on this.

Do you guys think we should restrict the "different connection" edge-case from the spec? Specifying that users must reuse the same connection to make method calls that will be answered by chainHead?

It is obvious that this spec is designed to work on top of full-duplex, event-driven and bidirectional connection capable of guaranteeing ordering and correctness of messages.

In other words: This spec can not be implemented on top of a REST API because REST APIs (unlike WebSocket APIs) are meant to be stateless and resource based, while WebSocket connections are stateful and socket-based.

Sure, it is technically possible to build a hacky protocol that enables a similar behaviour on a HTTP based API by doing long-polling on a HTTP request, while sending POST requests on another HTTP resource, with the purpose of emulating the properties of an event-driven and bidirectional connection. This is what some Web Applications had to do back in the days before WebSockets where a thing. However, please notice that's just a hacky workaround for emulating the properties of a biderectional, event-driven and stateful connection. Meaning that HTTP connections were not designed to be used like that.

It's worth reiterating the fact that WebSocket connections -unlike REST APIs- are stateful, and also -unlike REST APIs- they are not resource based. Therefore, the state of a WebSocket connection should be assumed to be relevant only within the context of that connection.

With all this in mind, let's answer the following question: what should be the correct behaviour that should occur when an opaque-id that was returned inside the context of WebSocket connection A was used in the context of WebSocket connection B?

In other words: since WebSocket connections are stateful, the opaque-ids that are returned are only relevant within the context of those connections.

There is no need for the spec to change anything, because that's how stateful connections work.

josepot commented 10 months ago

Please don't jump to making a lot of assumptions about my understanding of things;

I don't think that it's fair to say that I'm jumping to conclusions. You made some statements that lead me to believe that your understanding of some important topics is not correct. In particular the following comments:

Even if we make sure that messages are sent in the right order, they may not be received in the right order.

When I implemented this stuff in Subxt, I'm pretty sure I didn't assume anything about the ordering and did it to handle race conditions (though in Rust I think that is a possibility regardless, so it was always a safer bet)

I am of the opinion that those statements reflect a gap in your understanding of some important topics. If you don't want to engage with me because I'm pointing out these things, well... I'm sorry to hear that's the case. However, please know that I think that it's important that we are all in the same page when it comes to these things.

The reason that I said that it is possible for a race to occur in Rust has nothing to do with TCP or the nature of websockets. Rust is a multi threaded language, and so it's entriely possible that messages are being handled on one thread

I insist: the only reason why you have to account for race-conditions in the context of the events of chainHead_call, has nothing to do with Rust. It has to do with the way that Subxt has been architected, which -based on your comments- its architecture did not leverage the fact that messages are guaranteed to come in ordered. Also, based on your previous comments, it seems that the architectural choices of Subxt couldn't have been made taking this into account, because you were not aware of this fact.

I'm sure that you will agree with me on the fact that in Rust it is entirely possible to leverage the fact that when messages are guaranteed to arrive in a certain order, then it's possible to architect things so that the code doesn't have to deal with impossible race-conditions.

If your argument is that -in the context of Subxt- it is irrelevant whether the messages arrive ordered or not, because in the context of Subxt is "a safer bet" not to architect the library based on those guarantees... Well, then let's agree to disagree on that specific architectural choice. However, let's please at least agree on the fact that it's entirely feasible in Rust to leverage the correct ordering of messages, so that the rest of your code doesn't have to account for impossible race conditions.

I'm actually quite shcoked that a simple request to add a little precision/clarity to the spec has devolved to this.

😅 Let me remind you the context that lead to this:

Believe me @jsdw , you are not the only one who is shocked in this discussion.

jsdw commented 10 months ago

James: Even if we make sure that messages are sent in the right order, they may not be received in the right order.

This was me erroneously thinking about separate connections and had nothing to do with WS characteristics, and in my response I acknowledged that given we use just one WS connection this doesn't matter.

James: (re Subxt implementation) I'm pretty sure I didn't assume anything about the ordering and did it to handle race conditions.

This was not about websocket message ordering. It was about the order of getting a response from chainHead_call and operations coming back on chainHead_follow.

Josep: If you don't want to engage with me because I'm pointing out these things, well... I'm sorry to hear that's the case. However, please know that I think that it's important that we are all in the same page when it comes to these things.

I always welcome having conversations that lead to learning and improving my knowledge. What I don't appreciate are somewhat inflammatory assumptions about my (or others) knowledge or ability.

Things like...

Josep: You implemented a complex process on the consumer side, which is completely unnecessary.

The "complex process" is just your assumption. In reality, it was one line of code being above, rather than below, another.

Josep: Specifically, it appears you were worried that the producer might not reliably maintain the correct order of messages...

An assumption. I have never been worried that the correct order of messages wouldn't be maintained, but just that operations could start being sent before chainHead_call sends a response, or a multi-threaded race where I process messages on one thread but subscribe on another. (Maybe both; I can't remember exactly my thoughts when I wrote the code).

Josep: ...To compensate for this potential issue, you added extra, complicated logic to the consumer side as a precaution.

To compensate for such potential issues, I put one line of code above, rather than below, another. It wasn't complicated at all, but you have assumed that it is.

Josep: In essence, this situation reflects a problem where unnecessary complexity is introduced into a system due to a poorly implemented abstraction layer.

Same again.

And then, have you looked at our code for this? Do you have any specific issues with our code? Or are you just assuming that it's "poorly implemented" (thanks for the vote of confidence)?

Josep: This complexity could have been avoided with a better understanding of the underlying technologies or a more efficient use of the programming tools at your disposal.

Same again. This is somewhat implying that I have a lot to learn. All of this said about some assumed complexity that doesn't exist.

Reminder: it was one line of code above, rather than below another, to address any potential concerns that I had. Even if there was no issue at all, it is trivial and sensible to write the code in this way.

Josep: I insist: the only reason why you have to account for race-conditions in the context of the events of chainHead_call, has nothing to do with Rust. It has to do with the way that Subxt has been architected

Josep: If your argument is that -in the context of Subxt- it is irrelevant whether the messages arrive ordered or not, because in the context of Subxt is "a safer bet" not to architect the library based on those guarantees... Well, then let's agree to disagree on that specific architectural choice

Rust, being multi threaded, exposes the possibility of a race that can't exist in JS. Of course things could be architected in a rust app to prevent this possibility (ie do everything on one thread).

In my view, Subxt is architected quite well, and the architecture has to take into account a number of things. It is also multi threaded, and allows users to use multiple threads (or tasks) to do things if they wish. It promises to work properly in these cases.

If you have any concrete thoughts on how to improve the Subxt code base and architecture, I'm all ears and we can discuss these things. But here, you are being overly dismissive of it, souring the discussion and ensuring that nobody benefits.


To go back to the beginning of this issue: I asked to make the spec more precise/explicit re chainHead_call response being sent before corresponding messages come back on chainHead_follow needed to go down this path. That is all.

If the consensus is that this is already obvious enough then we can close this and I don't care to argue the point.

josepot commented 10 months ago

To go back to the beginning of this issue: I asked to make the spec more precise/explicit re chainHead_call response being sent before corresponding messages come back on chainHead_follow needed to go down this path. That is all.

If the consensus is that this is already obvious enough then we can close this and I don't care to argue the point.

I personally think that the spec is already perfectly clear, given that:

1) The documentation of all the operation events state that:

> `operationId` is a string **returned** by...

So, IMO it's already clear that `operationId` was already "**returned**" before the event came.

It also states things like:

> The `operationCallDone` event indicates that **an operation started** with `chainHead_unstable_call` was successful.

Which IMO clearly indicates that the operation has already "**started**".

Also, if a clarification was needed, then IMO the clarification wouldn't belong into the docs of the `operationHead_call` (et al), but rather in the documentation of the events, because those are the ones that must come in the correct order after their `operationId` has already been sent. In other words: it's easier to document it on the side of the events, because they are the ones that must get both: the timing and the order right.

However, once again, I'm of the opinion that the spec is super clear and self-evident as it is.

2) The payload of the response of the chainHead_call (et al) is "result": "started", which is IMO pretty obvious that it doesn't make sense for it to arrive after an event named operationCallDone.

That being said, as I already mentioned in my very first comment: if you want to make the spec "more precise/explicit" then, of course, go for it! Just open a PR and make it "more precise/explicit", for sure! I mean, I think that fixing the bug that I reported on Polkadot-SDK should be a higher priority, but... 🤷

So, in a nutshell: the fact that I completely disagree with this statement of yours:

The spec doesn't make it clear that we expect a specific order of events (eg, subscription ID comes back on call before any events come back on follow).

Doesn't mean that I'm against improving the spec, of course not.

bkchr commented 9 months ago

@jsdw what is the status here? I assume that fixing this should be fairly simple?

niklasad1 commented 9 months ago

@bkchr

When https://github.com/paritytech/jsonrpsee/pull/1281 is merged we can fix it properly. Should be ready this week

bkchr commented 9 months ago

Ty for the update @niklasad1!