solana-labs / solana-web3.js

Solana JavaScript SDK
https://solana-labs.github.io/solana-web3.js
MIT License
2.21k stars 877 forks source link

[web3.js] Infinite recursion caused by _updateSubscriptions #1106

Closed Disperito closed 1 year ago

Disperito commented 2 years ago

Problem

I'm using the version 1.44.1 of @solana/web3.js, to develop an application with high-frequency calling. And the connection between network on my machine and Solana' Network seems to be not so stable: it's often to throw error about transaction confirmation timeout In this situation, it suddenly to fastly and infinitely throw errors like follow:

image

Here is the key log:

signatureSubscribe error for argument [
  '5R9WXhnHtz5umHHh9LpXXz9CSXjucTbrUPCJV2iQAkUgBX4jQ4UcQYRyta3LbBFZwuMGcKMaqCPiPkyuPT2JPKSr',
  { commitment: 'processed' }
] WebSocket is not open: readyState 0 (CONNECTING)

As you can see, the connection is frantically trying to build up subscriptions on TWO transaction signatures.

Then I found out, the problem was caused in function _updateSubscriptions() in connection.ts: image

Proposed Solution

Maybe you team should fix the logic of error handling. Here's my humble opinion: if there are some subsciptions failing for too many times(whatever the reason is), they should be directly removed from subscriptions list.

steveluscher commented 2 years ago

Thanks for this report!

Other than spamming the error console, does this interfere with the operation of your app?

Disperito commented 2 years ago

Thanks for this report!

Other than spamming the error console, does this interfere with the operation of your app?

Of course. It caused infinite recursion and made CPU usage soar to 100%, so that my app cannot run as normal. @steveluscher

gallynaut commented 2 years ago

I was also seeing this issue on v1.47.3 and reverted to v1.43.5 and it went away.

steveluscher commented 2 years ago

Totally, I get it, @Disperito. Unbounded recursion is super bad news, no matter which way you slice it. What I was trying to get at is that there are two kinds of infinite recursion.

One leads to the JavaScript thread literally locking, and freezing the browser:

while (true) {
  // Browser's dead.
}

The other results in code running as often as it can, and high CPU usage, but doesn't lock the thread; other code can continue to run in the period between yields.

while (true) {
  // Between the time this yields and resumes, other work can get done on the JS thread.
  await someAsyncWork();
}

In any case, we need a general solution for this.

steveluscher commented 2 years ago

Can I ask you a question? Can you count the number of subscriptions that you're making to the RPC in total? I've seen situations where RPCs limit the number of open subscriptions you may hold. When you exceed a certain number, they murder your connection abruptly. See https://github.com/solana-labs/solana/issues/25322#issuecomment-1138168590.

Disperito commented 2 years ago

@Disperito. Unbounded recursion is super bad news, no matter which way you slice it. What I was trying to get at is that there are two kinds of infinite recursion.

One leads to the JavaScript thread literally locking, and freezing the browser:

Sorry I have no definite answer. But I have reviewed the logs near when this error occurs, I think there were only less than 10 subscriptions at the same time, even less.

I don't think the large number of subscriptions is the root cause of the bug. It is true that a large number of subscriptions can lead to poor connnection with RPCs, but the point is: your library should recognize this situation and do special handling, such as throwing an exception, or actively unsubscribing

Jac0xb commented 2 years ago

Still getting this error with @solana/web3.js@1.56.2 @steveluscher. Spammed logs until server was killed.

image
signatureSubscribe error for argument [
'{SOME_SIGNATURE}',
{ commitment: 'confirmed' }
] socket not ready
steveluscher commented 2 years ago

@Jac0xb If you have a reliable repro of this, can you debug it a bit to see if this is the root cause? https://github.com/solana-labs/solana/issues/25578

Jac0xb commented 2 years ago

Not sure the repro steps but happens often with sendConfirmation and possibly log lived connection objects. Haven't seen it before. Happened in Node.js. Looks like there is an onSignature call in there

steveluscher commented 1 year ago

To anyone who's still experiencing this problem, can you try to upgrade to an experimental release, by running yarn add @solana/web3.js@pr-29195? That's a package built from PR solana-labs/solana#29195.

Let me know if that fixes anything.

steveluscher commented 1 year ago

@Jac0xb, @Disperito, @gallynaut do you have time to give this another try, if you still have a repro? See the message above.

grabbou commented 1 year ago

I am seeing this on latest 1.73.2. Anything that I can do to help resolve this? It's random and happens frequently after some transactions are made

signatureSubscribe error for argument ["...", {"commitment": "confirmed"}] INVALID_STATE_ERR

Update: after some digging, the error is caused on this line: https://github.com/solana-labs/solana-web3.js/blob/67f3bf855325daec91489763471cdda535b29e25/packages/library-legacy/src/connection.ts#L5990-L5992

There's definitely something strange going on here, because upon debugging uncaught exceptions, this._rpcWebSocket.socket has readyState: 0, but that somehow doesn't reject the promise on a conditional in _call.

Looks like this:

All of this can lead to corruption of the internal state of the RPC; it can think that it's connected, able to send messages, and is owed responses, when it is not.

from the PR description might still be happening.

grabbou commented 1 year ago

I have left some additional comments on the PR https://github.com/solana-labs/solana/pull/29195 to further continue discussion. I hope with my unfortunate stable reproduction we can get this fixed.

grabbou commented 1 year ago

Update: this might be false positive, but after reading this issue again, one thing from OP caught my immediate attention. We both execute a series of transactions, one after another. In my case, I have an array and for each item in an array, I execute 3 transactions. I will eventually refactor to a single one (and batch those calls), but I think that could be highly related.

I added a setTimeout of 15 seconds between each call (just to stay on the safe side) and my script executed successfully. Are we dealing with a race condition here?

Update: unfortunately, it errored again, but I have a feeling the error happens less frequently.

grabbou commented 1 year ago

Update: I have managed to find a fix and posted in the PR https://github.com/solana-labs/solana/pull/29195#issuecomment-1427600481. Hope we can land this quick!

dfl-zeke commented 1 year ago

I'm having the same issue. Application is getting rekt by subscriptions.

steveluscher commented 1 year ago

@dfl-zeke, do you have a repro that you can share? I'm happy to check out a repository and reproduce the problem over here.

dfl-zeke commented 1 year ago

@dfl-zeke, do you have a repro that you can share? I'm happy to check out a repository and reproduce the problem over here.

I don't have it, but I can suggest how you can reproduce it. Subscribe to 5K random accounts via Websocket and you will get this error surely. I was testing how much subscriptions I could handle but was unable to test, as sometimes it starts throwing errors from 2K subs, sometimes from 14K so it's kind of random. I've even added sleeps between chunk subs and it did not help at all, same error. If you try to subscribe all of them at the same time, it will start throwing this error immediately. I even tried to have multiple socket connections, but still got this error.

Also, while I'm here and you might see this, I would suggest to add retry limits on websocket (because now it always retry to reconnect in case of fail) and ping turn on/off, it would made this package much more comfortable to work with.

steveluscher commented 1 year ago

Subscribe to 5K random accounts via Websocket and you will get this error surely.

I see! Can you drum me up a quick repro of exactly that and post it on GitHub, @dfl-zeke? Also, let me know exactly which RPC URL you were able to reproduce the problem with because, as noted here, different RPCs have different subscription limits, and take different actions when you overflow that limit. I want to avoid a situation where I try to stand up a repro, can't trigger the bug, and have to come back to you for more information[^1].

[^1]: Believe me, I've tried to create a repro before, to no avail!

dfl-zeke commented 1 year ago

Subscribe to 5K random accounts via Websocket and you will get this error surely.

I see! Can you drum me up a quick repro of exactly that and post it on GitHub, @dfl-zeke? Also, let me know exactly which RPC URL you were able to reproduce the problem with because, as noted here, different RPCs have different subscription limits, and take different actions when you overflow that limit. I want to avoid a situation where I try to stand up a repro, can't trigger the bug, and have to come back to you for more information1.

Footnotes

  1. Believe me, I've tried to create a repro before, to no avail!

Will try to create one. I'm using rpcpool endpoint, so I can't share it with u publicly at least. Also, if I get limited, I should get 429 (which I got sometimes to be honest). I'll share quicknode one with free tier subscription. (I have same problem with them too).

dfl-zeke commented 1 year ago

Subscribe to 5K random accounts via Websocket and you will get this error surely.

I see! Can you drum me up a quick repro of exactly that and post it on GitHub, @dfl-zeke? Also, let me know exactly which RPC URL you were able to reproduce the problem with because, as noted here, different RPCs have different subscription limits, and take different actions when you overflow that limit. I want to avoid a situation where I try to stand up a repro, can't trigger the bug, and have to come back to you for more information1.

Footnotes

  1. Believe me, I've tried to create a repro before, to no avail!

Here you go: https://github.com/dfl-zeke/solana-web3js-subs . If you have any questions, ping me.

steveluscher commented 1 year ago

How long do you have to leave this running before you see the problem, @dfl-zeke? I'm at index 8300 and haven't seen it yet.

steveluscher commented 1 year ago

Nevermind. There it is.

steveluscher commented 1 year ago

Here's what I discovered about your repro:

First, you're subscribing to the same RPC node 100 times, and fanning out subscriptions to each one of those connections. RPCs really hate when you do this, and will start to limit you and/or drop your connection (with a 1006 ‘abnormal closure’ code) real fast. When I limited this to one RPC node, or even 10, I was able to make all ~20K subscriptions with no problem.

Now, let's talk about the actual problem that results when the RPC does unceremoniously drop you. This code is correctly thowing an exception that results when a message heads for a socket that's actually in a bad (non OPEN) state:

https://github.com/solana-labs/solana-web3.js/blob/33429ca4116930d3c2c0704e99867ccd05ac932d/packages/library-legacy/src/rpc-websocket.ts#L49-L60

And this code is catching it in the subscriptions state machine:

https://github.com/solana-labs/solana-web3.js/blob/33429ca4116930d3c2c0704e99867ccd05ac932d/packages/library-legacy/src/connection.ts#L6001-L6018

Now, when the socket goes down like this, two things happen: its readyState changes, and its close event fires. Problem is, sometimes you'll detect one before the other.

  1. If the close event fires before we detect a change in the readyState, we're good. The onClose handler will increment the _rpcWebsocketGeneration and the state machine will abandon processing.
  2. If the readyState is detected not to be OPEN before the close event fires, we're in trouble. The exception will be caught, but the state machine will immediately try to set the subscription up again. The fact that it tries to set the subscription back up is not, in itself, the problem. It's that it tries to set it up synchronously. The synchronous nature of the loop blocks the onClose handler from being able to work. Because the onClose handler never fires, the state machine never becomes aware that the socket is dead. That's the problem.

Some possible solutions:

  1. Make the state machine yield the main thread on exceptions. This breathing room will allow asynchronous events from the socket to be handled, righting the internal state of the state machine. Not great. Adds asynchronicity to an already complex system.
  2. Enable the state machine to infer, from the error, that the socket has died, and reset its own socket-related state.

I have a few half-baked implementations of this on the go. Give me some time to find the best one.

steveluscher commented 1 year ago

I'm pretty sure that https://github.com/elpheria/rpc-websockets/pull/138 fixes the source of corruption of the subscriptions state machine here.

dfl-zeke commented 1 year ago

@steveluscher I understand, that 20K subscription is not a normal thing to have, but I get it on 500 subscriptions sometimes and it's annoying. If I have too many subs, then okay, I can understand that host can limit me and give me error on the call, but do not crash the program.

steveluscher commented 1 year ago

Yeah, I doubt it has anything to do with the number of subscriptions. It probably has everything to do with which box the load balancer shunts you to, and how many open subscriptions it currently has.

grabbou commented 1 year ago

To me, it happens pretty much all the time I try to wait for a transaction results. I think it has to do with React Native and the way it handles web sockets. I still didn't have time to provide a reproduction because for me, the workaround I pasted here temporarily fixes the problem (but introduces other side effects, so not really production ready lol).

I will try to provide a reproduction soon, but maybe that helps too. I see you also posted another PR, so I will wait till its shipped to double check.

Meanwhile, @steveluscher, how do you feel about https://github.com/solana-labs/solana/pull/29195 in general? If that didn't fix the original error, do you still consider it a valid improvement or you'd say it's something that could be reverted? I left some comments here: https://github.com/solana-labs/solana/pull/29195#issuecomment-1427551590

github-actions[bot] commented 1 year ago

:tada: This issue has been resolved in version 1.73.3 :tada:

The release is available on:

Your semantic-release bot :package::rocket:

steveluscher commented 1 year ago

Try @solana/web3.js@1.73.3.

steveluscher commented 1 year ago

@dfl-zeke, I upgraded your repro repo to 1.73.3 and can no longer kill it.

github-actions[bot] commented 1 year ago

Because there has been no activity on this issue for 7 days since it was closed, it has been automatically locked. Please open a new issue if it requires a follow up.