Bigtable timeouts leads to permanent disconnection

Juanito87 commented 3 years ago

Problem

While running the solana service, it can happen that the service loses the connection against the bigtable service.

Sample outputs from the log when connection is lost:

WARN solana_core::rpc] BigTableError(RpcError(Status { code: Cancelled, message: "Timeout expired", source: None }))

Sample output for normal operation when starting solana:

INFO  solana_storage_bigtable::access_token] Requesting token for BigTableDataReadOnly scope
INFO  solana_storage_bigtable::access_token] Token expires in 3599 seconds

After losing connection, the service needs to be restarted to recover the connection to bigtable.

Proposed Solution

The service needs to reconnect automatically when connection to bigtable is lost.

CriesofCarrots commented 3 years ago

@mvines , should BigTableConnection attempt to reconnect if a client receives a request and the connection has dropped or timed out? It is quite onerous to have to reboot an rpc node to reconnect.

mvines commented 3 years ago

Yeah for sure, restarting in this case is terrible

linuskendall commented 3 years ago

@CriesofCarrots is this one addressed - likely to come in 1.8 or some backported 1.7 release? It's a regular occurrence (minimum 1-2 nodes per day, some regions seem more prone to it then others) now and causes quite a lot of headache. We've tested both with foundation's bigtable and our own. Same issues.

CriesofCarrots commented 3 years ago

@linuskendall, not yet addressed, but hoping to work on it today/tomorrow. Should come in v1.8 for sure, but may also backport to v1.7, depending on cluster upgrade timing.

linuskendall commented 3 years ago

Looking at our logs currently, it seems that it is actually the token that expires and isn't renewed:

Dec 04 09:57:48 ...: [2021-12-04T09:57:48.215614219Z WARN  solana_rpc::rpc] BigTableError(Rpc(Status { code: Unauthenticated, message: "Request had invalid authentication credentials. Expected OAuth 2 access token, login cookie or other valid authentication credential. See https://developers.google.com/identity/sign-in/web/devconsole-project.", details: b"\x08\x10\x12\xd0\x01Request had invalid authentication credentials. Expected OAuth 2 access token, login cookie or other valid authentication credential. See https://developers.google.com/identity/sign-in/web/devconsole-project.\x1a\xa6\x01\n(type.googleapis.com/google.rpc.ErrorInfo\x12z\n\x14ACCESS_TOKEN_EXPIRED\x12\x0egoogleapis.com\x1a.\n\x06method\x12$google.bigtable.v2.Bigtable.ReadRows\x1a\"\n\x07service\x12\x17bigtable.googleapis.com", metadata: MetadataMap { headers: {"grpc-server-stats-bin": "AAD7LQUAAAAAAA", "google.rpc.errorinfo-bin": "ChRBQ0NFU1NfVE9LRU5fRVhQSVJFRBIOZ29vZ2xlYXBpcy5jb20aLgoGbWV0aG9kEiRnb29nbGUuYmlndGFibGUudjIuQmlndGFibGUuUmVhZFJvd3MaIgoHc2VydmljZRIXYmlndGFibGUuZ29vZ2xlYXBpcy5jb20"} }, source: None }))

stakeconomy commented 2 years ago

Any updates on this issue? @CriesofCarrots

CriesofCarrots commented 2 years ago

@stakeconomy , it looks like this is caused by a stale access token. The access token should be automatically refreshed -- it's checked on every query that would hit bigtable -- so there's something amiss. If you can provide a full log, that will enable me to find out more about when the access token is expiring and why it isn't being refreshed.

linuskendall commented 2 years ago

@CriesofCarrots this is still a regular occurrence, did you ever figure out what might be causing this?

CriesofCarrots commented 2 years ago

No, still a mystery!

PeaStew commented 2 years ago

I have solved this issue in the sense that I have logged and and have code that will fix it, this is what happens, the call to refresh happens, and the request fails or doesn't return in such a way that the value of refresh_active remains = true and so the node never does another token refresh. I have written code to solve this but I'm not a rust dev so it's probably not the most elegant solution.

My logging indicates that the Auth service can be down for over 30 seconds at times and the node is just unable to get a token during that time, my fix definitely solves this issue after auth is available again but the entire logic really needs to be examined in the block at https://github.com/solana-labs/solana/blob/a5cf72e446f6391ee4a2f0373c08240df1d26074/storage-bigtable/src/access_token.rs#L93

my fix

pub async fn refresh(&self) {
        //Check if it's time to try a token refresh
        {
            let token_r = self.token.read().unwrap();            

            if token_r.1.elapsed().as_secs() < token_r.0.expires_in() as u64 / 2 {
                info!("Token not ready to be refreshed");                
                return;
            }
            warn!("Token ready to be refreshed");
            warn!("Current Token: {:#?}", self.token);

            #[allow(deprecated)]
            if self
                .refresh_active
                .compare_and_swap(false, true, Ordering::Relaxed)
            {
                // Refresh already pending
                let wait_time: u64 = 2;
                let wait_time_millis = std::time::Duration::from_millis(wait_time * 1000);
                warn!("Refresh already pending... waiting {} seconds before trying again...", wait_time);

                thread::sleep(wait_time_millis);
                self.refresh_active.store(false, Ordering::Relaxed);
                return;
            }
        }

        warn!("Refreshing token");

        let new_token = Self::get_token(&self.credentials, &self.scope).await;
        {
            let mut token_w = self.token.write().unwrap();
            match new_token {
                Ok(new_token) => *token_w = new_token,
                Err(err) => warn!("{}", err),
            }
            self.refresh_active.store(false, Ordering::Relaxed);
        }
        warn!("New Token: {:#?}", self.token);
    }

Happy to submit as a PR, but maybe someone has a more elegant way to solve this, my thought was to set a linux time stamp on the token refresh token, and then test if 30 seconds had passed before clearing that token, but I think the issue is that the thread generated for teh refresh just never returns and is eventually garbage collected.

solana-labs / solana

Bigtable timeouts leads to permanent disconnection #20336

Problem

Proposed Solution