Closed Disperito closed 1 year ago
Thanks for this report!
Other than spamming the error console, does this interfere with the operation of your app?
Thanks for this report!
Other than spamming the error console, does this interfere with the operation of your app?
Of course. It caused infinite recursion and made CPU usage soar to 100%, so that my app cannot run as normal. @steveluscher
I was also seeing this issue on v1.47.3 and reverted to v1.43.5 and it went away.
Totally, I get it, @Disperito. Unbounded recursion is super bad news, no matter which way you slice it. What I was trying to get at is that there are two kinds of infinite recursion.
One leads to the JavaScript thread literally locking, and freezing the browser:
while (true) {
// Browser's dead.
}
The other results in code running as often as it can, and high CPU usage, but doesn't lock the thread; other code can continue to run in the period between yields.
while (true) {
// Between the time this yields and resumes, other work can get done on the JS thread.
await someAsyncWork();
}
In any case, we need a general solution for this.
Can I ask you a question? Can you count the number of subscriptions that you're making to the RPC in total? I've seen situations where RPCs limit the number of open subscriptions you may hold. When you exceed a certain number, they murder your connection abruptly. See https://github.com/solana-labs/solana/issues/25322#issuecomment-1138168590.
@Disperito. Unbounded recursion is super bad news, no matter which way you slice it. What I was trying to get at is that there are two kinds of infinite recursion.
One leads to the JavaScript thread literally locking, and freezing the browser:
Sorry I have no definite answer. But I have reviewed the logs near when this error occurs, I think there were only less than 10 subscriptions at the same time, even less.
I don't think the large number of subscriptions is the root cause of the bug. It is true that a large number of subscriptions can lead to poor connnection with RPCs, but the point is: your library should recognize this situation and do special handling, such as throwing an exception, or actively unsubscribing
Still getting this error with @solana/web3.js@1.56.2 @steveluscher. Spammed logs until server was killed.
signatureSubscribe error for argument [
'{SOME_SIGNATURE}',
{ commitment: 'confirmed' }
] socket not ready
@Jac0xb If you have a reliable repro of this, can you debug it a bit to see if this is the root cause? https://github.com/solana-labs/solana/issues/25578
Not sure the repro steps but happens often with sendConfirmation and possibly log lived connection objects. Haven't seen it before. Happened in Node.js. Looks like there is an onSignature call in there
To anyone who's still experiencing this problem, can you try to upgrade to an experimental release, by running yarn add @solana/web3.js@pr-29195
? That's a package built from PR solana-labs/solana#29195.
Let me know if that fixes anything.
@Jac0xb, @Disperito, @gallynaut do you have time to give this another try, if you still have a repro? See the message above.
I am seeing this on latest 1.73.2. Anything that I can do to help resolve this? It's random and happens frequently after some transactions are made
signatureSubscribe error for argument ["...", {"commitment": "confirmed"}] INVALID_STATE_ERR
Update: after some digging, the error is caused on this line: https://github.com/solana-labs/solana-web3.js/blob/67f3bf855325daec91489763471cdda535b29e25/packages/library-legacy/src/connection.ts#L5990-L5992
There's definitely something strange going on here, because upon debugging uncaught exceptions, this._rpcWebSocket.socket
has readyState: 0
, but that somehow doesn't reject the promise on a conditional in _call
.
Looks like this:
All of this can lead to corruption of the internal state of the RPC; it can think that it's connected, able to send messages, and is owed responses, when it is not.
from the PR description might still be happening.
I have left some additional comments on the PR https://github.com/solana-labs/solana/pull/29195 to further continue discussion. I hope with my unfortunate stable reproduction we can get this fixed.
Update: this might be false positive, but after reading this issue again, one thing from OP caught my immediate attention. We both execute a series of transactions, one after another. In my case, I have an array and for each item in an array, I execute 3 transactions. I will eventually refactor to a single one (and batch those calls), but I think that could be highly related.
I added a setTimeout of 15 seconds between each call (just to stay on the safe side) and my script executed successfully. Are we dealing with a race condition here?
Update: unfortunately, it errored again, but I have a feeling the error happens less frequently.
Update: I have managed to find a fix and posted in the PR https://github.com/solana-labs/solana/pull/29195#issuecomment-1427600481. Hope we can land this quick!
I'm having the same issue. Application is getting rekt by subscriptions.
@dfl-zeke, do you have a repro that you can share? I'm happy to check out a repository and reproduce the problem over here.
@dfl-zeke, do you have a repro that you can share? I'm happy to check out a repository and reproduce the problem over here.
I don't have it, but I can suggest how you can reproduce it. Subscribe to 5K random accounts via Websocket and you will get this error surely. I was testing how much subscriptions I could handle but was unable to test, as sometimes it starts throwing errors from 2K subs, sometimes from 14K so it's kind of random. I've even added sleeps between chunk subs and it did not help at all, same error. If you try to subscribe all of them at the same time, it will start throwing this error immediately. I even tried to have multiple socket connections, but still got this error.
Also, while I'm here and you might see this, I would suggest to add retry limits on websocket (because now it always retry to reconnect in case of fail) and ping turn on/off, it would made this package much more comfortable to work with.
Subscribe to 5K random accounts via Websocket and you will get this error surely.
I see! Can you drum me up a quick repro of exactly that and post it on GitHub, @dfl-zeke? Also, let me know exactly which RPC URL you were able to reproduce the problem with because, as noted here, different RPCs have different subscription limits, and take different actions when you overflow that limit. I want to avoid a situation where I try to stand up a repro, can't trigger the bug, and have to come back to you for more information[^1].
[^1]: Believe me, I've tried to create a repro before, to no avail!
Subscribe to 5K random accounts via Websocket and you will get this error surely.
I see! Can you drum me up a quick repro of exactly that and post it on GitHub, @dfl-zeke? Also, let me know exactly which RPC URL you were able to reproduce the problem with because, as noted here, different RPCs have different subscription limits, and take different actions when you overflow that limit. I want to avoid a situation where I try to stand up a repro, can't trigger the bug, and have to come back to you for more information1.
Footnotes
- Believe me, I've tried to create a repro before, to no avail! ↩
Will try to create one. I'm using rpcpool endpoint, so I can't share it with u publicly at least. Also, if I get limited, I should get 429 (which I got sometimes to be honest). I'll share quicknode one with free tier subscription. (I have same problem with them too).
Subscribe to 5K random accounts via Websocket and you will get this error surely.
I see! Can you drum me up a quick repro of exactly that and post it on GitHub, @dfl-zeke? Also, let me know exactly which RPC URL you were able to reproduce the problem with because, as noted here, different RPCs have different subscription limits, and take different actions when you overflow that limit. I want to avoid a situation where I try to stand up a repro, can't trigger the bug, and have to come back to you for more information1.
Footnotes
- Believe me, I've tried to create a repro before, to no avail! ↩
Here you go: https://github.com/dfl-zeke/solana-web3js-subs . If you have any questions, ping me.
How long do you have to leave this running before you see the problem, @dfl-zeke? I'm at index 8300 and haven't seen it yet.
Nevermind. There it is.
Here's what I discovered about your repro:
First, you're subscribing to the same RPC node 100 times, and fanning out subscriptions to each one of those connections. RPCs really hate when you do this, and will start to limit you and/or drop your connection (with a 1006
‘abnormal closure’ code) real fast. When I limited this to one RPC node, or even 10, I was able to make all ~20K subscriptions with no problem.
Now, let's talk about the actual problem that results when the RPC does unceremoniously drop you. This code is correctly thowing an exception that results when a message heads for a socket that's actually in a bad (non OPEN
) state:
And this code is catching it in the subscriptions state machine:
Now, when the socket goes down like this, two things happen: its readyState
changes, and its close
event fires. Problem is, sometimes you'll detect one before the other.
close
event fires before we detect a change in the readyState
, we're good. The onClose
handler will increment the _rpcWebsocketGeneration
and the state machine will abandon processing.readyState
is detected not to be OPEN
before the close
event fires, we're in trouble. The exception will be caught, but the state machine will immediately try to set the subscription up again. The fact that it tries to set the subscription back up is not, in itself, the problem. It's that it tries to set it up synchronously. The synchronous nature of the loop blocks the onClose
handler from being able to work. Because the onClose
handler never fires, the state machine never becomes aware that the socket is dead. That's the problem.Some possible solutions:
I have a few half-baked implementations of this on the go. Give me some time to find the best one.
I'm pretty sure that https://github.com/elpheria/rpc-websockets/pull/138 fixes the source of corruption of the subscriptions state machine here.
@steveluscher I understand, that 20K subscription is not a normal thing to have, but I get it on 500 subscriptions sometimes and it's annoying. If I have too many subs, then okay, I can understand that host can limit me and give me error on the call, but do not crash the program.
Yeah, I doubt it has anything to do with the number of subscriptions. It probably has everything to do with which box the load balancer shunts you to, and how many open subscriptions it currently has.
To me, it happens pretty much all the time I try to wait for a transaction results. I think it has to do with React Native and the way it handles web sockets. I still didn't have time to provide a reproduction because for me, the workaround I pasted here temporarily fixes the problem (but introduces other side effects, so not really production ready lol).
I will try to provide a reproduction soon, but maybe that helps too. I see you also posted another PR, so I will wait till its shipped to double check.
Meanwhile, @steveluscher, how do you feel about https://github.com/solana-labs/solana/pull/29195 in general? If that didn't fix the original error, do you still consider it a valid improvement or you'd say it's something that could be reverted? I left some comments here: https://github.com/solana-labs/solana/pull/29195#issuecomment-1427551590
:tada: This issue has been resolved in version 1.73.3 :tada:
The release is available on:
Your semantic-release bot :package::rocket:
Try @solana/web3.js@1.73.3
.
@dfl-zeke, I upgraded your repro repo to 1.73.3 and can no longer kill it.
Because there has been no activity on this issue for 7 days since it was closed, it has been automatically locked. Please open a new issue if it requires a follow up.
Problem
I'm using the version 1.44.1 of @solana/web3.js, to develop an application with high-frequency calling. And the connection between network on my machine and Solana' Network seems to be not so stable: it's often to throw error about transaction confirmation timeout In this situation, it suddenly to fastly and infinitely throw errors like follow:
Here is the key log:
As you can see, the connection is frantically trying to build up subscriptions on TWO transaction signatures.
Then I found out, the problem was caused in function _updateSubscriptions() in connection.ts:
Proposed Solution
Maybe you team should fix the logic of error handling. Here's my humble opinion: if there are some subsciptions failing for too many times(whatever the reason is), they should be directly removed from subscriptions list.