mozilla-services / syncstorage-rs

Sync Storage server in Rust
Mozilla Public License 2.0
973 stars 49 forks source link

Investigate Spanner's reported session count #1493

Open data-sync-user opened 1 year ago

data-sync-user commented 1 year ago

The Spanner Session metric (the number of sessions reported from Spanner itself) radically differs from syncstorage’s own internal count of sessions. We’ve previously had an issue where the session count was too high to the point of it degrading Spanner’s performance, so it’s important that we’re confident in these numbers.

We should contact GCP support for advise and to look into why these metrics don’t line up.

┆Issue is synchronized with this Jira Task

data-sync-user commented 4 months ago

➤ Philip Jenvey commented:

Per https://mozilla-hub.atlassian.net/browse/SYNC-3350 ( https://mozilla-hub.atlassian.net/browse/SYNC-3350|smart-link )

The support case for when we had too many open sessions (from 2020-10):

https://console.cloud.google.com/support/cases/detail/v2/25438594?authuser=0&cloudshell=false&organizationId=442341870013&project=moz-fx-sync-prod-3f0c ( https://console.cloud.google.com/support/cases/detail/v2/25438594?authuser=0&cloudshell=false&organizationId=442341870013&project=moz-fx-sync-prod-3f0c )

excerpts from it:

question from us:

“Our metrics and Spanner's session metrics mostly track pretty well, but not always. Can you gives us some more information on what the spanner session metrics represent? Yesterday evening, for instance, we showed Spanner sessions in the 5k to 6k range, while our connection pool metrics show 1100 to 1200 connections.

Also, how long does a closed out gRPC connection take to be reflected in the Spanner session metrics? What about an abandoned / improperly closed gRPC connection? We have theorized that we may not be closing them in correctly pod eviction events.”

answer:

“The sessions metric counts each "communication channel" regardless of the number of session cache(connection pools) they have. Generally, Stackdriver metrics uses a delta window of 1 minute to capture incremental values.

The answer to the second question is more open ended - and depends on a variety of factors (networking, latency, how the client handles connections and session pools, etc) including how the client is implemented. Correct me if I'm wrong, I see that you're using a rust client [1]. Based on other client implementations, this is the following logic used for managing session pools:

Use BatchCreateSessions to init the pool with min sessions.

For subsequent session increases (when min is not enough), increase these in batches as well. In some clients, these are increased in batches of 10. By doing this, you also have to ensure sessions are roughly equally distributed across the num channels (gRPC channels).”