googleapis / google-cloud-go

Google Cloud Client Libraries for Go.
https://cloud.google.com/go/docs/reference
Apache License 2.0
3.76k stars 1.29k forks source link

bigquery: `AppendRows` eventually deadlocks / high rate of context deadline exceeded #9660

Closed benjaminRomano closed 6 months ago

benjaminRomano commented 7 months ago

Client Bigquery v1.60.0

Environment GKE

Go Environment go v1.22

Code


// Stream creation
featureSetStream, err := writeClient.NewManagedStream(ctx,
   managedwriter.WithDestinationTable(conf.FeatureSetTableID),
   managedwriter.WithType(managedwriter.CommittedStream),
   managedwriter.WithSchemaDescriptor(featureProtoDescriptor),
   managedwriter.EnableWriteRetries(true),
)

// We poll from a pubsub queue, do some processing that generates `FeatureSets` proto 

// Convert protos to serialized rows (~100-200 quite small rows produced)
for _, featureSet := range results.FeatureSets {
item := &inferno.FeatureBQ{
    TraceId:      traceID,
    Properties:   featureSet.Properties,
    Features:     featureSet.Features,
    FeatureQuery: featureSet.FeatureQuery,
    TraceIdStr:   ptr.String(id_generator.ToString(traceID)),
}

serialized, err := proto.Marshal(item)
if err != nil {
    return err
}

items = append(items, serialized)
}

// Append the rows to BigQuery
//
// After a couple hours this stream degrades 
// and we consistently hit context deadline timeouts (context deadline is ~3 minutes)
_, err := featureSetStream.AppendRows(ctx, serializedRows)

Expected behavior

We'd expect that the AppendRows call complete quickly and not eventually get stuck.

Actual behavior

Actual behavior is that AppendRows degrades and gets stuck. We need to restart the process to recover. This degrades after a couple hours.

Screenshots Below is a screenshot of the latency of the AppendRows call.

Screenshot 2024-03-28 at 10 35 21 AM

Other Context

In the interim, we are restarting the process when we get alerted on this issue.

What's interesting is that this process is writing to another Bigquery stream in parallel and never hits this degradation issue. There should be a 1:1 relationship with how many AppendRows calls are being made to each stream. For each message processed from pubsub, we write to both streams. The number of serialized rows being appended per AppendRows call for that other stream should be quite substantially larger (~1k to 5k rows) while each row is slightly smaller.

I suspect there is a remaining deadlock somewhere within the managedwriter. I have Google Cloud Profiler enabled for this service, but I didn't see anything noticeable from the CPU / threads profile view.

Other things i tried was switching from committed stream to default stream, but that had no impact.

Let me know if there is any debugging logs I can provide @shollyman

benjaminRomano commented 7 months ago

Another thing I notice is that we do have high rates of EOF errors on the connection, which is what led to the detection of this prior issue: https://github.com/googleapis/google-cloud-go/issues/9540

Screenshot 2024-03-28 at 11 05 08 AM
shollyman commented 6 months ago

Starting to look at this.

EOF on send should trigger us to mark the connection as being unhealthy, so the next send that uses the connection should cause reconnect. Possibly we're doing something unusual with a received response? Is there any other noteworthy events leading up to the spike in latency?

One way to get more information here might be to add a grpc streaming interceptor that logs more detail about the stream interactions. I've got some old code that does this, but likely needs to be refreshed. I'll work on that next.

benjaminRomano commented 6 months ago

Ah forgot to update this ticket. I root caused this issue last week with GCP Support.

So the issue was that we were sending malformed proto descriptor and receiving the following error from upstream:

Invalid proto schema: BqMessage.proto: FeatureBQ.properties: ".snap.inferno.Property" is not defined. BqMessage.proto: FeatureBQ.features: ".snap.inferno.Feature" is not defined. Entity: projects/sc-inferno/datasets/inferno/tables/feature-sets/streams/Cic3ZDhiOWI4NC0wMDAwLTI2NmEtYjc0OC1mNDAzMDQ1ZDJiYjA6czQ

In our client code, we had the following:

// Example proto
m := &myprotopackage.MyCompiledMessage{}
descriptorProto := protodesc.ToDescriptorProto(m.ProtoReflect().Descriptor())

instead of this:

m := &myprotopackage.MyCompiledMessage{}
// Ensure noralize descriptor to collect nested messages, fix-up oneOfs, etc.
descriptorProto, err := adapt.NormalizeDescriptor(m.ProtoReflect().Descriptor())

if err != nil {
    // TODO: Handle error.
}

The NormalizeDescriptor was accidentally removed from a refactor.

Ultimately, we were sending batches of bad data to the backend and that would get rejected and lead to the EOF errors. We did not see the underlying bad proto error because we were not calling result.GetResult(ctx) which would've returned the error message to us.

Couple thoughts:

shollyman commented 6 months ago

Thanks for the additional details.

Detecting malformed DescriptorProto messages at instantiation time would require us to mimic the same descriptor validation and building logic in both the clients and the service, which is problematic. That said, I've opened a potential FR (internally b/333755890) against the service to expose some kind of validation RPC that would allow us to check compatibility before sending data.

The managedwriter package will currently produce opencensus metrics (I've got some feature work to enable the same on the opentelemetry side), so it's another avenue for monitoring ingestion health. The current metrics/views can be seen in the package docs: https://pkg.go.dev/cloud.google.com/go/bigquery/storage/managedwriter#pkg-variables

shollyman commented 6 months ago

Given there's no immediately actionable elements for this current issue, I'm going to go ahead and close it out at this time. However, if you have more feedback please let me know here or via another issue.