Firehose connections hang when more than 100 subgraphs are deployed

leoyvens commented 2 years ago

See this commit for a reproduction example. It exposes a few causes that seem to be conspiring:

The log below can be obtained by running the example with h2 tracing level at DEBUG. It shows that the firehose endpoint sets the http2 setting max_concurrent_streams to 100, thereby responding with REFUSED_STREAM to any additional streams. This limits the number of block streams to 100 on a single connection.
```
DEBUG Connection{peer=Client}: h2::codec::framed_read: received frame=Settings { flags: (0x0), max_concurrent_streams: 100, initial_window_size: 1048576, max_header_list_size: 65536 }
```
This setting is per connection, and we were supposed to be using connection polling, but the example seems to use a single connection no matter what conn_pool_size is set to, so that connection polling implementation must not be working or not be balancing based on number of streams (least loaded, round-robin or randomly would work).

Tonic seems to not retry establishing the stream after receiving REFUSED_STREAM. On retry it logs the below and then goes silent. This might be related to this issue https://github.com/hyperium/tonic/issues/515.

2022-08-26T12:18:13.083821Z TRACE tonic::transport::service::reconnect: poll_ready; connected
2022-08-26T12:18:13.083864Z TRACE tonic::transport::service::reconnect: poll_ready; not ready
2022-08-26T12:18:13.083905Z TRACE hyper::client::dispatch: send_when canceled

On the potential fix: If Firehose could reliably set max_concurrent_streams to a high value, that would be great. But afaik Firehose itself does not set a limit, so proxies with draconian defaults must be to blame (nginx for example defaults to 128). Since we don't want to add configuration pitfalls to operators or require specific proxies, we cannot rely on this being set higher than 100 which is the RFC recommended minimum.

The tonic bug seems related but even if tonic retried, the stream would probably be refused again.

So the most reliable fix would be to get connection pooling working with an algorithm that seeks to balance the number of streams per connection.

leoyvens commented 2 years ago

Mystery solved on why connection pooling wasn't working, tonic uses the URI as the key so repeating the same endpoint doesn't really work.

maoueh commented 2 years ago

That is super helpful great investigation thank you very much about all this great knowledge.

And it's great you also found why the connection pooling was not working, weird it's not "reported" somehow would have been easily catchable. Have you raised an issue on the tonic repo about that, I would do it if it's not the case.

Is the connection pooling active on Ethereum also, would be good to retry a shootout to see how it behaves with real connection pooling :)

I assume we shall close this issue now?

leoyvens commented 2 years ago

We can close this issue after getting the conn_pool_size config added in https://github.com/graphprotocol/graph-node/pull/3833 to work as intended. Probably by implementing a pool of tonic::Channels rather than using tonic for balancing. Meanwhile this can be worked around by configuring multiple providers with the same url, each provider will correspond to a connection.

graphprotocol / graph-node

Firehose connections hang when more than 100 subgraphs are deployed #3879