paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.78k stars 638 forks source link

rpc server: Subscription buffer limit exceeded #4184

Open stakeworld opened 5 months ago

stakeworld commented 5 months ago

I'm not sure if it is a bug or intended behaviour; sometimes i'm getting: Subscription buffer limit=16 exceeded for subscription=state_storage conn_id=(...); dropping subscription on my public rpc server nodes. I cannot find were the limit=16 comes from. I can find some limits in:

https://github.com/paritytech/jsonrpsee/blob/master/client/ws-client/src/lib.rs

max_concurrent_requests: 256, max_buffer_capacity_per_subscription: 1024,

or https://github.com/paritytech/jsonrpsee/blob/master/client/http-client/src/client.rs

max_concurrent_requests: 256,

were I notice that the http-client version does not have a max_buffer_capacity_per_subscription element. And there is not a limit of 16 somewere.

Anyone knows if i'm missing a setting somewere or if the error is a problem at all?

niklasad1 commented 5 months ago

It comes from https://github.com/paritytech/polkadot-sdk/blob/master/substrate/client/rpc/src/utils.rs#L130 it means that client couldn't keep up with server and the server only allows to store 16 messages in memory per subscription (this may be too small)

It could be either a really slow client, busy connection or something similar that causes that but you could increase the server buffer by --rpc-message-buffer-capacity-per-connection <VAL> if this happens very often but it may increase the memory usage of your server. It's essentially a way to bound the number of items each subscription/connection are allowed to keep in memory

I don't think this is an issue rather than a slow client but interesting to know whether this happens often.... or that it's a subscription that is leaked or something in some client library...

stakeworld commented 5 months ago

@niklasad1, thanks for the explanation! It happens around 1-2 times per day on different nodes (they are load balanced polkadot public rpc endpoints). Resources (memory/network etc) are sufficient. It could of course be a slow client. I will try to increase the buffer-capacity for a test to see if it helps.

stakeworld commented 5 months ago

@niklasad1, I put --rpc-message-buffer-capacity-per-connection 32 in the startup options but i'm still getting Subscription buffer limit=16 exceeded for subscription=chain_newHead conn_id=(...); dropping subscription errors... Do you have any pointers?

niklasad1 commented 5 months ago

You need to increase to something bigger because --rpc-message-buffer-capacity-per-connection 64 is the default.

I suggest to start with 128 :)

stakeworld commented 5 months ago

You need to increase to something bigger because --rpc-message-buffer-capacity-per-connection 64 is the default.

I got an increase in errors so that makes sense ;)

Increasing to 128.

stakeworld commented 4 months ago

@niklasad1 : With 128 512 I haven't seen the error coming back so it seems a more stable value, at least for a busy rpc node, thanks for the pointers.

stakeworld commented 4 months ago

Hi @niklasad1, i've been monitoring for a while now and even bumped the buffer capacity to 512 (--rpc-message-buffer-capacity-per-connection 512) but the buffer limit errors keep coming (and the buffer limit in the error stays on 16). Do you maybe have some insights where to look?

2024-05-17 07:30:06 Subscription buffer limit=16 exceeded for subscription=state_storage conn_id=905687; dropping subscription

niklasad1 commented 4 months ago

Yes, it's still hardcoded to max 16 per subscription but you are increasing the global message buffer on the server with --rpc-message-buffer-per-connection so it will allow more buffered messages but if you get many subscriptions that are slow these global buffer may not work.

Lemme write some prometheus metrics for this so we can monitor it better and perhaps the hardcoded limit is simply wrong.

I still think it's just client's that are too slow....

stakeworld commented 4 months ago

Lemme write some prometheus metrics for this so we can monitor it better and perhaps the hardcoded limit is simply wrong.

It could be its fine for a normal node but maybe an active rpc node needs something higher?

I still think it's just client's that are too slow....

If the conclusion is that it's just a slow client problem and doesn't have a negative influence on rpc functioning than I will just filter them out.

Let me know if I can help with providing some prometheus or log metrics from an active node.