googleapis / google-cloud-go

Google Cloud Client Libraries for Go.
https://cloud.google.com/go/docs/reference
Apache License 2.0
3.78k stars 1.3k forks source link

auth: Significant performance degradation with new authentication library in Spanner client (100x slower) #10927

Closed toga4 closed 1 month ago

toga4 commented 1 month ago

Client

Spanner. probably all clients which uses new auth lib.

Environment

Apple M3

% go version
go version go1.23.1 darwin/arm64
% go env
GO111MODULE=''
GOARCH='arm64'
GOBIN=''
GOCACHE='/Users/user/Library/Caches/go-build'
GOENV='/Users/user/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/user/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='darwin'
GOPATH='/Users/user/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/opt/homebrew/Cellar/go/1.23.1/libexec'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='local'
GOTOOLDIR='/opt/homebrew/Cellar/go/1.23.1/libexec/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='go1.23.1'
GODEBUG=''
GOTELEMETRY='local'
GOTELEMETRYDIR='/Users/user/Library/Application Support/go/telemetry'
GCCGO='gccgo'
GOARM64='v8.0'
AR='ar'
CC='cc'
CXX='c++'
CGO_ENABLED='1'
GOMOD='/Users/user/work/20240927_spanner-bench/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/09/jr07h_m17kvbbwjgzdckhrfc0000gn/T/go-build4164472710=/tmp/go-build -gno-record-gcc-switches -fno-common'

Code and Dependencies

package main

import (
    "context"
    "log"
    "os"
    "testing"

    "cloud.google.com/go/spanner"
)

func init() {
    os.Setenv("SPANNER_EMULATOR_HOST", "localhost:9010")
}

func Benchmark_WithAuth(b *testing.B) {
    ctx := context.Background()

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        test(ctx)
    }
}

func Benchmark_WithoutAuth(b *testing.B) {
    ctx := context.Background()

    b.Setenv("GOOGLE_API_GO_EXPERIMENTAL_DISABLE_NEW_AUTH_LIB", "true")
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        test(ctx)
    }
}

func test(ctx context.Context) {
    client, err := spanner.NewClient(ctx, "projects/foo/instances/bar/databases/baz") // Doesn't matter if it doesn't exist.
    if err != nil {
        log.Fatal(err)
    }
    defer client.Close()
    _, _ = client.Apply(ctx, nil)
}
go.mod ```text module m go 1.23.1 require cloud.google.com/go/spanner v1.68.0 require ( cel.dev/expr v0.16.0 // indirect cloud.google.com/go v0.115.1 // indirect cloud.google.com/go/auth v0.9.3 // indirect cloud.google.com/go/auth/oauth2adapt v0.2.4 // indirect cloud.google.com/go/compute/metadata v0.5.0 // indirect cloud.google.com/go/monitoring v1.21.0 // indirect github.com/GoogleCloudPlatform/grpc-gcp-go/grpcgcp v1.5.0 // indirect github.com/GoogleCloudPlatform/opentelemetry-operations-go/detectors/gcp v1.24.1 // indirect github.com/census-instrumentation/opencensus-proto v0.4.1 // indirect github.com/cespare/xxhash/v2 v2.3.0 // indirect github.com/cncf/xds/go v0.0.0-20240822171458-6449f94b4d59 // indirect github.com/envoyproxy/go-control-plane v0.13.0 // indirect github.com/envoyproxy/protoc-gen-validate v1.1.0 // indirect github.com/felixge/httpsnoop v1.0.4 // indirect github.com/go-logr/logr v1.4.2 // indirect github.com/go-logr/stdr v1.2.2 // indirect github.com/golang/groupcache v0.0.0-20210331224755-41bb18bfe9da // indirect github.com/google/s2a-go v0.1.8 // indirect github.com/google/uuid v1.6.0 // indirect github.com/googleapis/enterprise-certificate-proxy v0.3.4 // indirect github.com/googleapis/gax-go/v2 v2.13.0 // indirect github.com/planetscale/vtprotobuf v0.6.1-0.20240319094008-0393e58bdf10 // indirect go.opencensus.io v0.24.0 // indirect go.opentelemetry.io/contrib/detectors/gcp v1.29.0 // indirect go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc v0.54.0 // indirect go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.54.0 // indirect go.opentelemetry.io/otel v1.29.0 // indirect go.opentelemetry.io/otel/metric v1.29.0 // indirect go.opentelemetry.io/otel/sdk v1.29.0 // indirect go.opentelemetry.io/otel/sdk/metric v1.29.0 // indirect go.opentelemetry.io/otel/trace v1.29.0 // indirect golang.org/x/crypto v0.27.0 // indirect golang.org/x/net v0.29.0 // indirect golang.org/x/oauth2 v0.23.0 // indirect golang.org/x/sync v0.8.0 // indirect golang.org/x/sys v0.25.0 // indirect golang.org/x/text v0.18.0 // indirect golang.org/x/time v0.6.0 // indirect google.golang.org/api v0.197.0 // indirect google.golang.org/genproto v0.0.0-20240903143218-8af14fe29dc1 // indirect google.golang.org/genproto/googleapis/api v0.0.0-20240903143218-8af14fe29dc1 // indirect google.golang.org/genproto/googleapis/rpc v0.0.0-20240903143218-8af14fe29dc1 // indirect google.golang.org/grpc v1.66.2 // indirect google.golang.org/protobuf v1.34.2 // indirect ) ```

Expected behavior

equivalent performance.

Actual behavior

% go test -bench . -benchtime 3s
goos: darwin
goarch: arm64
pkg: m
cpu: Apple M3
Benchmark_WithAuth-8                  19        1008749257 ns/op
Benchmark_WithoutAuth-8             3176           1136428 ns/op
PASS
ok      m       26.054s

Additional context

After upgrading to a recent version of the Google Cloud Go SDK, we observed a significant slowdown in the execution time of our end-to-end tests using Cloud Spanner Emulator. The slowdown occurs at the point of the first communication after creating the Client, which led us to suspect that the issue might be related to authentication.

Upon investigating the differences between versions, we discovered that a new authentication library is being used in the updated SDK. The same issue occurs with the DatabaseAdminClient and InstanceAdminClient. We think that the problem likely affects all clients that rely on the new authentication library.

By switching back to the previous authentication library via environment variables GOOGLE_API_GO_EXPERIMENTAL_DISABLE_NEW_AUTH_LIB, we were able to restore the previous execution times.

While we understand that in actual Google Cloud environments, factors like network latency and the frequency of authentication may make this performance difference less impactful, the issue becomes significant in cases where multiple Client instances are generated, such as in test code using the Emulator. In such scenarios, the slowdown can have a substantial impact, and we believe addressing this would be a meaningful improvement.

rahul2393 commented 1 month ago

cc: @quartzmo

toga4 commented 1 month ago

Upon further investigation, we found that the latency is due to differences in how the grpc.ClientConn is created between the old and new authentication libraries.

In the previous authentication library, grpc.DialContext is used to create the connection: https://github.com/googleapis/google-api-go-client/blob/1cd11321a77d6d42ce745edc35c1b9b81a28381f/transport/grpc/dial.go#L354

On the other hand, in the new authentication library, grpc.NewClient is used: https://github.com/googleapis/google-cloud-go/blob/7f61cd579f7e4ed4f1ac161f2c2a28e931406f16/auth/grpctransport/grpctransport.go#L300

The key difference between grpc.DialContext and grpc.NewClient is default schemes of resolver. The grpc.DialContext sets the default scheme to passthrough, the grpc.NewClient uses dns as the default.

By analyzing the runtime/trace tool, we discovered that in the new authentication library, takes longer time for resolving DNS. When we tried setting the default scheme to passthrough before running the benchmarks using grpc/resolver.SetDefaultScheme("passthrough"), the latency improved, even with the new authentication library.

I hope this additional investigation is helpful for addressing the issue.

quartzmo commented 1 month ago

@toga4 Thank you for reporting this issue, and for your thorough and detailed analysis. We will prioritize a fix.

quartzmo commented 1 month ago

@toga4 After investigating this issue a bit, I am not certain that the current implementation default (using the dns scheme instead of passthrough) is detrimental to all users in all environments. As I understand it, the passthrough scheme, being quite literal, prevents users from benefiting from client-side load balancing across all DNS results for an address. The dns scheme, despite the initial slowdown observed, supports such load balancing.

Will it work for you to use grpc/resolver.SetDefaultScheme("passthrough") to improve latency in your test (emulator) environment, and take a wait-and-see approach to evaluating performance in production environments such as GKE?

toga4 commented 1 month ago

@quartzmo Thank you for considering about this issue.

You’re absolutely right that some environments could benefit from the client-side load balancing benefits using the dns scheme.

We can take the approach you suggested. Additionaly, I’ve confirmed that using passthrough:///localhost:9010 instead of localhost:9010 in the SPANNER_EMULATOR_HOST environment variable ensures that the passthrough scheme is applied. We’ll proceed with this solution for now.

When using the emulator, I think the benefits of client-side load balancing are limited in most cases. It might be worth considering a change where the passthrough scheme is used by default when the emulator is detected and no scheme is specified in the endpoint configuration.

https://github.com/googleapis/google-cloud-go/blob/ece7426a331d0528e0b01003569228359170869c/spanner/client.go#L405-L413

That said, I understand that there are several places in the SDK source code where the emulator is detected, so implementing this change might require updates across multiple locations...

quartzmo commented 1 month ago

It might be worth considering a change where the passthrough scheme is used by default when the emulator is detected and no scheme is specified in the endpoint configuration.

Thanks for this suggestion! We will investigate it and leave this issue open to track it.

quartzmo commented 1 month ago

@harshachinta Per the decision above to address this issue by changing the default scheme of the host to passthrough ONLY for the Spanner library emulator support, I believe this work will be best handled by your team. I will communicate the suggestion to maintainers of the other vertical client libraries that also support emulators with gRPC.

quartzmo commented 1 month ago

Reassigning to @rahul2393 since @harshachinta hasn't responded.