solana-labs / solana

Web-Scale Blockchain for fast, secure, scalable, decentralized apps and marketplaces.
https://solanalabs.com
Apache License 2.0
13.17k stars 4.29k forks source link

Bigtable timeouts leads to permanent disconnection ( Continued ) #29692

Closed brandon-j-roberts closed 11 months ago

brandon-j-roberts commented 1 year ago

Problem

Bigtable authentication token is failing to refresh causing nodes to disconnect from bigtable occasionally, which forces a restart of the node in order to reconnect.

This is a continuation of an issue previously closed.

This PR mostly prevented the issue, but some of our nodes that receive significant RPC load still occasionally disconnect from bigtable.

I have found that this disconnect is due to the existing timeout failing to actually resolve, so the the refresh_active variable stays true. Thus, any future attempts to refresh the token are stopped because the code believes a refresh is still active.

Proposed Solution

A possible solution which I have implemented and tested in this PR is to use an additional check after 2x the timeout period, which will set refresh_active back to false. This would allow future attempts to refresh the token to go through, and signify that the timeout failed to stop the function get_token.

If the 1st ( thought to be failing ) timeout ends up succeeding after the 2nd get_token call starts, you will end up getting a 2nd newer token once the 2nd returns.

After applying this patch to 4 troubled nodes, it has prevented 20+ disconnects. We applied this patch onto 1.13.5 mainnet for this test.

RdeWilde commented 1 year ago

[..] due to the existing timeout failing to actually resolve [..]

Any clue how this is possible? As this might be the root cause, it would make sense to try and fix that instead of the fallback mechanism

RdeWilde commented 1 year ago

I think the closed MR should be merged nevertheless as there's no other solution yet, and this is high priority for heavily used RPC nodes

maestroi commented 1 year ago

At Blockdaemon, we've encountered a recurring issue that necessitates the creation of a customer-specific version. This custom build combines the version detailed here with pull 29685.

2023-06-17T14:05:46.087754187Z WARN  solana_storage_bigtable::access_token] Requesting token for BigTableDataReadOnly scope
[2023-06-17T14:05:46.087786118Z WARN  solana_storage_bigtable::access_token] Creating JWT claim
[2023-06-17T14:05:46.090215850Z WARN  solana_storage_bigtable::access_token] Requesting token from endpoint

The issue at hand pertains to token acquisition for BigTable. Occasionally, the process for requesting a new token commences as expected but then stalls, leading to a total failure in establishing a connection with BigTable. The process doesn't repeat after this failure.

a normal request looks like:

[2023-06-17T13:35:46.840716176Z WARN  solana_storage_bigtable::access_token] Requesting token for BigTableDataReadOnly scope
[2023-06-17T13:35:46.849892464Z WARN  solana_storage_bigtable::access_token] Creating JWT claim
[2023-06-17T13:35:46.858238558Z WARN  solana_storage_bigtable::access_token] Requesting token from endpoint
[2023-06-17T13:35:46.987365809Z WARN  solana_storage_bigtable::access_token] Token expires in 3599 seconds
[2023-06-17T13:35:46.987402028Z WARN  solana_storage_bigtable::access_token] Token write lock acquired:

Comparatively, a typical token request process proceeds smoothly and is reflected in the code found at this GitHub link: https://github.com/solana-labs/solana/blob/4d19f1d01930e5508851354dea9be5e802d38622/storage-bigtable/src/access_token.rs#L85

What appears to be obstructing the goauth::get_token call is unknown, as it fails to return any result, be it successful or erroneous. Unfortunately, we are unable to insert custom logging into this underlying call, given it's a third-party library. While considering a blocking variant might be an option, it poses a certain risk.

RUST_LOG="solana=warn,reqwest=debug,hyper=debug,solana_storage_bigtable::access_token=debug,solana::solana_metrics=error,solana_runtime::message_processor=error,solana_rpc::rpc_health=error,solana_core::warm_quic_cache_service=error"

With this adjustment, we can gain more insight into what happens during the token request process.

we have a bit more data:

[2023-06-18T15:13:56.381735288Z WARN  solana_storage_bigtable::access_token] Requesting token for BigTableDataReadOnly scope
[2023-06-18T15:13:56.381789903Z WARN  solana_storage_bigtable::access_token] Creating JWT claim
[2023-06-18T15:13:56.384592197Z WARN  solana_storage_bigtable::access_token] Building client
[2023-06-18T15:13:56.391540226Z WARN  solana_storage_bigtable::access_token] Requesting token
[2023-06-18T15:13:56.393855912Z TRACE hyper::client::pool] checkout waiting for idle connection: ("https", oauth2.googleapis.com)
[2023-06-18T15:13:56.393898884Z DEBUG reqwest::connect] starting new connection: https://oauth2.googleapis.com/
[2023-06-18T15:13:56.393949441Z TRACE hyper::client::connect::http] Http::connect; scheme=Some("https"), host=Some("oauth2.googleapis.com"), port=None
[2023-06-18T15:13:56.394127139Z TRACE hyper::client::pool] checkout dropped for ("https", oauth2.googleapis.com)

if we look at the data when it's hung with the extra logging it looks like:

[2023-06-18T15:13:56.381735288Z WARN  solana_storage_bigtable::access_token] Requesting token for BigTableDataReadOnly scope
[2023-06-18T15:13:56.381789903Z WARN  solana_storage_bigtable::access_token] Creating JWT claim
[2023-06-18T15:13:56.384592197Z WARN  solana_storage_bigtable::access_token] Building client
[2023-06-18T15:13:56.391540226Z WARN  solana_storage_bigtable::access_token] Requesting token
[2023-06-18T15:13:56.393855912Z TRACE hyper::client::pool] checkout waiting for idle connection: ("https", oauth2.googleapis.com)
[2023-06-18T15:13:56.393898884Z DEBUG reqwest::connect] starting new connection: https://oauth2.googleapis.com/
[2023-06-18T15:13:56.393949441Z TRACE hyper::client::connect::http] Http::connect; scheme=Some("https"), host=Some("oauth2.googleapis.com"), port=None
[2023-06-18T15:13:56.394127139Z TRACE hyper::client::pool] checkout dropped for ("https", oauth2.googleapis.com)

If we examine the data during the period it hangs with the additional logging, it mirrors the typical request process. However, a successful request proceeds further with additional steps and outputs.

[2023-06-18T14:43:56.806509500Z WARN  solana_storage_bigtable::access_token] Refreshing token: Bearer ya29.c.b<data>
[2023-06-18T14:43:56.806881550Z WARN  solana_storage_bigtable::access_token] Requesting token for BigTableDataReadOnly scope
[2023-06-18T14:43:56.807559283Z WARN  solana_storage_bigtable::access_token] Creating JWT claim
[2023-06-18T14:43:56.808976967Z WARN  solana_storage_bigtable::access_token] Building client
[2023-06-18T14:43:56.817984006Z WARN  solana_storage_bigtable::access_token] Requesting token
[2023-06-18T14:43:56.819288734Z TRACE hyper::client::pool] checkout waiting for idle connection: ("https", oauth2.googleapis.com)
[2023-06-18T14:43:56.822628384Z DEBUG reqwest::connect] starting new connection: https://oauth2.googleapis.com/
[2023-06-18T14:43:56.822663931Z TRACE hyper::client::connect::http] Http::connect; scheme=Some("https"), host=Some("oauth2.googleapis.com"), port=None
[2023-06-18T14:43:56.822792877Z DEBUG hyper::client::connect::dns] resolving host="oauth2.googleapis.com"
[2023-06-18T14:43:56.823688405Z DEBUG hyper::client::connect::http] connecting to [2607:f8b0:4023:140d::5f]:443
[2023-06-18T14:43:56.849190175Z DEBUG hyper::client::connect::http] connected to [2607:f8b0:4023:140d::5f]:443
[2023-06-18T14:43:56.875990070Z TRACE hyper::client::conn] client handshake Http1
[2023-06-18T14:43:56.876001061Z TRACE hyper::client::client] handshake complete, spawning background dispatcher task
[2023-06-18T14:43:56.876026580Z TRACE hyper::client::pool] checkout dropped for ("https", oauth2.googleapis.com)
[2023-06-18T14:43:56.876051457Z TRACE reqwest::connect::verbose] 4b3a2478 write: b"POST /token HTTP/1.1\r\<data>
[2023-06-18T14:43:56.937929195Z TRACE reqwest::connect::verbose] 4b3a2478 read: b"HTTP/1.1 200 OK\r\nContent-Type: application/json; <data>
[2023-06-18T14:43:56.938142061Z WARN  solana_storage_bigtable::access_token] Token expires in 3599 seconds
[2023-06-18T14:43:56.938169413Z WARN  solana_storage_bigtable::access_token] Token write lock acquired
[2023-06-18T14:43:56.938171928Z WARN  solana_storage_bigtable::access_token] Token written: ya29.c<data>
[2023-06-18T14:43:56.938218267Z TRACE reqwest::connect::verbose] 4b3a2478 read: b"0\r\n\r\n"
[2023-06-18T14:43:56.938242413Z TRACE hyper::client::pool] pool dropped, dropping pooled (("https", oauth2.googleapis.com))
[2023-06-18T14:43:58.893502742Z TRACE hyper::client::connect::http] Http::connect; scheme=Some("https"), host=Some("bigtable.googleapis.com"), port=None
[2023-06-18T14:43:58.893544892Z DEBUG hyper::client::connect::dns] resolving host="bigtable.googleapis.com"
[2023-06-18T14:43:58.894669177Z DEBUG hyper::client::connect::http] connecting to [2607:f8b0:4023:1401::5f]:443
[2023-06-18T14:43:58.920182358Z DEBUG hyper::client::connect::http] connected to [2607:f8b0:4023:1401::5f]:443
[2023-06-18T14:43:58.946487479Z TRACE hyper::client::conn] client handshake Http2

And after a few hours:

[2023-06-18T17:46:56.240413238Z WARN  solana_storage_bigtable::access_token] Requesting token for BigTableDataReadOnly scope
[2023-06-18T17:46:56.247280847Z WARN  solana_storage_bigtable::access_token] Creating JWT claim
[2023-06-18T17:46:56.251116497Z WARN  solana_storage_bigtable::access_token] Building client
[2023-06-18T17:46:56.257533657Z WARN  solana_storage_bigtable::access_token] Requesting token
[2023-06-18T17:46:56.261613813Z TRACE hyper::client::pool] checkout waiting for idle connection: ("https", oauth2.googleapis.com)
[2023-06-18T17:46:56.263867168Z DEBUG reqwest::connect] starting new connection: https://oauth2.googleapis.com/
[2023-06-18T17:46:56.263906563Z TRACE hyper::client::connect::http] Http::connect; scheme=Some("https"), host=Some("oauth2.googleapis.com"), port=None
[2023-06-18T17:46:56.264025800Z DEBUG hyper::client::connect::dns] resolving host="oauth2.googleapis.com"
[2023-06-18T17:46:56.264047952Z TRACE hyper::client::pool] checkout dropped for ("https", oauth2.googleapis.com)
[2023-06-18T17:47:04.589392869Z TRACE hyper::client::connect::http] Http::connect; scheme=Some("https"), host=Some("bigtable.googleapis.com"), port=None
[2023-06-18T17:47:04.597589260Z DEBUG hyper::client::connect::dns] resolving host="bigtable.googleapis.com"
[2023-06-18T17:47:04.598939443Z DEBUG hyper::client::connect::http] connecting to [2607:f8b0:4023:1407::5f]:443
[2023-06-18T17:47:04.624507893Z DEBUG hyper::client::connect::http] connected to [2607:f8b0:4023:1407::5f]:443
[2023-06-18T17:47:04.650880576Z TRACE hyper::client::conn] client handshake Http2
[2023-06-18T17:47:06.168594464Z WARN  solana_core::replay_stage] PARTITION DETECTED waiting to join heaviest fork: 194633601 last vote: 194633592, reset slot: 194633593
[2023-06-18T17:47:06.702022422Z WARN  solana_core::replay_stage] PARTITION resolved heaviest fork: 194633594 last vote: 194633594, reset slot: 194633594
[2023-06-18T17:47:08.952291637Z TRACE hyper::client::dispatch] send_when canceled

After a few hours, the logging illustrates how the token request process commences as usual but eventually fails to recover.

[2023-06-19T16:41:41.021936319Z WARN  solana_storage_bigtable::access_token] Requesting token for BigTableDataReadOnly scope
[2023-06-19T16:41:41.029164180Z WARN  solana_storage_bigtable::access_token] Creating JWT claim
[2023-06-19T16:41:41.036498164Z WARN  solana_storage_bigtable::access_token] Building client
[2023-06-19T16:41:41.044716393Z WARN  solana_storage_bigtable::access_token] Requesting token
[2023-06-19T16:41:41.049209821Z DEBUG reqwest::connect] starting new connection: https://oauth2.googleapis.com/
[2023-05-19T16:41:41.049253705Z TRACE hyper::client::connect::http] Http::connect; scheme=Some("https"), host=Some("oauth2.googleapis.com"), port=None
<dead>

Potential underlying issues that might contribute to this problem are being investigated and are related to specific issues on GitHub:

  1. https://github.com/seanmonstar/reqwest/issues/1323
  2. https://github.com/seanmonstar/reqwest/issues/1276
  3. https://github.com/seanmonstar/reqwest/issues/1256
  4. https://github.com/durch/rust-goauth/blob/ae52647f8f1a71c46d064fd3ff259a76cdbeec1d/src/lib.rs#L130
  5. https://github.com/durch/rust-goauth/blob/ae52647f8f1a71c46d064fd3ff259a76cdbeec1d/src/lib.rs#L174
  6. https://github.com/durch/rust-goauth/blob/ae52647f8f1a71c46d064fd3ff259a76cdbeec1d/src/lib.rs#L180

Fixes:

  1. https://github.com/solana-labs/solana/pull/29685

Hopefully we can get this resolved so no one has to deal with this issue anymore.

joeaba commented 1 year ago

Hey @maestroi , I've reopened https://github.com/solana-labs/solana/pull/29685 Feel free to takeover it so we can get it land

PeaStew commented 1 year ago

https://github.com/solana-labs/solana/pull/26217 I created this long ago and have started patching my nodes again with my fix which absolutely does work and it works because of the google auth issues listed above. It may be 'inelegant' but it works and has not caused any issues with sync despite all words to the contrary. YMMV sometimes a nail is just a nail and needs a hammer

fanatid commented 11 months ago

we can close this issue, #34213 merged