Closed Juanito87 closed 2 years ago
@mvines , should BigTableConnection
attempt to reconnect if a client receives a request and the connection has dropped or timed out?
It is quite onerous to have to reboot an rpc node to reconnect.
Yeah for sure, restarting in this case is terrible
@CriesofCarrots is this one addressed - likely to come in 1.8 or some backported 1.7 release? It's a regular occurrence (minimum 1-2 nodes per day, some regions seem more prone to it then others) now and causes quite a lot of headache. We've tested both with foundation's bigtable and our own. Same issues.
@linuskendall, not yet addressed, but hoping to work on it today/tomorrow. Should come in v1.8 for sure, but may also backport to v1.7, depending on cluster upgrade timing.
Looking at our logs currently, it seems that it is actually the token that expires and isn't renewed:
Dec 04 09:57:48 ...: [2021-12-04T09:57:48.215614219Z WARN solana_rpc::rpc] BigTableError(Rpc(Status { code: Unauthenticated, message: "Request had invalid authentication credentials. Expected OAuth 2 access token, login cookie or other valid authentication credential. See https://developers.google.com/identity/sign-in/web/devconsole-project.", details: b"\x08\x10\x12\xd0\x01Request had invalid authentication credentials. Expected OAuth 2 access token, login cookie or other valid authentication credential. See https://developers.google.com/identity/sign-in/web/devconsole-project.\x1a\xa6\x01\n(type.googleapis.com/google.rpc.ErrorInfo\x12z\n\x14ACCESS_TOKEN_EXPIRED\x12\x0egoogleapis.com\x1a.\n\x06method\x12$google.bigtable.v2.Bigtable.ReadRows\x1a\"\n\x07service\x12\x17bigtable.googleapis.com", metadata: MetadataMap { headers: {"grpc-server-stats-bin": "AAD7LQUAAAAAAA", "google.rpc.errorinfo-bin": "ChRBQ0NFU1NfVE9LRU5fRVhQSVJFRBIOZ29vZ2xlYXBpcy5jb20aLgoGbWV0aG9kEiRnb29nbGUuYmlndGFibGUudjIuQmlndGFibGUuUmVhZFJvd3MaIgoHc2VydmljZRIXYmlndGFibGUuZ29vZ2xlYXBpcy5jb20"} }, source: None }))
Any updates on this issue? @CriesofCarrots
@stakeconomy , it looks like this is caused by a stale access token. The access token should be automatically refreshed -- it's checked on every query that would hit bigtable -- so there's something amiss. If you can provide a full log, that will enable me to find out more about when the access token is expiring and why it isn't being refreshed.
@CriesofCarrots this is still a regular occurrence, did you ever figure out what might be causing this?
No, still a mystery!
I have solved this issue in the sense that I have logged and and have code that will fix it, this is what happens, the call to refresh happens, and the request fails or doesn't return in such a way that the value of refresh_active remains = true and so the node never does another token refresh. I have written code to solve this but I'm not a rust dev so it's probably not the most elegant solution.
My logging indicates that the Auth service can be down for over 30 seconds at times and the node is just unable to get a token during that time, my fix definitely solves this issue after auth is available again but the entire logic really needs to be examined in the block at https://github.com/solana-labs/solana/blob/a5cf72e446f6391ee4a2f0373c08240df1d26074/storage-bigtable/src/access_token.rs#L93
my fix
pub async fn refresh(&self) {
//Check if it's time to try a token refresh
{
let token_r = self.token.read().unwrap();
if token_r.1.elapsed().as_secs() < token_r.0.expires_in() as u64 / 2 {
info!("Token not ready to be refreshed");
return;
}
warn!("Token ready to be refreshed");
warn!("Current Token: {:#?}", self.token);
#[allow(deprecated)]
if self
.refresh_active
.compare_and_swap(false, true, Ordering::Relaxed)
{
// Refresh already pending
let wait_time: u64 = 2;
let wait_time_millis = std::time::Duration::from_millis(wait_time * 1000);
warn!("Refresh already pending... waiting {} seconds before trying again...", wait_time);
thread::sleep(wait_time_millis);
self.refresh_active.store(false, Ordering::Relaxed);
return;
}
}
warn!("Refreshing token");
let new_token = Self::get_token(&self.credentials, &self.scope).await;
{
let mut token_w = self.token.write().unwrap();
match new_token {
Ok(new_token) => *token_w = new_token,
Err(err) => warn!("{}", err),
}
self.refresh_active.store(false, Ordering::Relaxed);
}
warn!("New Token: {:#?}", self.token);
}
Happy to submit as a PR, but maybe someone has a more elegant way to solve this, my thought was to set a linux time stamp on the token refresh token, and then test if 30 seconds had passed before clearing that token, but I think the issue is that the thread generated for teh refresh just never returns and is eventually garbage collected.
Problem
While running the solana service, it can happen that the service loses the connection against the bigtable service.
Sample outputs from the log when connection is lost:
WARN solana_core::rpc] BigTableError(RpcError(Status { code: Cancelled, message: "Timeout expired", source: None }))
Sample output for normal operation when starting solana:
After losing connection, the service needs to be restarted to recover the connection to bigtable.
Proposed Solution
The service needs to reconnect automatically when connection to bigtable is lost.