Closed benjaminRomano closed 6 months ago
Another thing I notice is that we do have high rates of EOF
errors on the connection, which is what led to the detection of this prior issue: https://github.com/googleapis/google-cloud-go/issues/9540
Starting to look at this.
EOF on send should trigger us to mark the connection as being unhealthy, so the next send that uses the connection should cause reconnect. Possibly we're doing something unusual with a received response? Is there any other noteworthy events leading up to the spike in latency?
One way to get more information here might be to add a grpc streaming interceptor that logs more detail about the stream interactions. I've got some old code that does this, but likely needs to be refreshed. I'll work on that next.
Ah forgot to update this ticket. I root caused this issue last week with GCP Support.
So the issue was that we were sending malformed proto descriptor and receiving the following error from upstream:
Invalid proto schema: BqMessage.proto: FeatureBQ.properties: ".snap.inferno.Property" is not defined. BqMessage.proto: FeatureBQ.features: ".snap.inferno.Feature" is not defined. Entity: projects/sc-inferno/datasets/inferno/tables/feature-sets/streams/Cic3ZDhiOWI4NC0wMDAwLTI2NmEtYjc0OC1mNDAzMDQ1ZDJiYjA6czQ
In our client code, we had the following:
// Example proto
m := &myprotopackage.MyCompiledMessage{}
descriptorProto := protodesc.ToDescriptorProto(m.ProtoReflect().Descriptor())
instead of this:
m := &myprotopackage.MyCompiledMessage{}
// Ensure noralize descriptor to collect nested messages, fix-up oneOfs, etc.
descriptorProto, err := adapt.NormalizeDescriptor(m.ProtoReflect().Descriptor())
if err != nil {
// TODO: Handle error.
}
The NormalizeDescriptor
was accidentally removed from a refactor.
Ultimately, we were sending batches of bad data to the backend and that would get rejected and lead to the EOF errors. We did not see the underlying bad proto error because we were not calling result.GetResult(ctx)
which would've returned the error message to us.
Couple thoughts:
WithSchemaDescriptor
) instead of when we send the proto descriptor up to the server?OK
. Ideally, the invalid proto schema
message would be logged by the server. GetResult
we were dependent on GCP Support to send us the underlying BQ server error message to usThanks for the additional details.
Detecting malformed DescriptorProto messages at instantiation time would require us to mimic the same descriptor validation and building logic in both the clients and the service, which is problematic. That said, I've opened a potential FR (internally b/333755890) against the service to expose some kind of validation RPC that would allow us to check compatibility before sending data.
The managedwriter package will currently produce opencensus metrics (I've got some feature work to enable the same on the opentelemetry side), so it's another avenue for monitoring ingestion health. The current metrics/views can be seen in the package docs: https://pkg.go.dev/cloud.google.com/go/bigquery/storage/managedwriter#pkg-variables
Given there's no immediately actionable elements for this current issue, I'm going to go ahead and close it out at this time. However, if you have more feedback please let me know here or via another issue.
Client Bigquery v1.60.0
Environment GKE
Go Environment go v1.22
Code
Expected behavior
We'd expect that the AppendRows call complete quickly and not eventually get stuck.
Actual behavior
Actual behavior is that
AppendRows
degrades and gets stuck. We need to restart the process to recover. This degrades after a couple hours.Screenshots Below is a screenshot of the latency of the
AppendRows
call.Other Context
In the interim, we are restarting the process when we get alerted on this issue.
What's interesting is that this process is writing to another Bigquery stream in parallel and never hits this degradation issue. There should be a 1:1 relationship with how many
AppendRows
calls are being made to each stream. For each message processed from pubsub, we write to both streams. The number of serialized rows being appended perAppendRows
call for that other stream should be quite substantially larger (~1k to 5k rows) while each row is slightly smaller.I suspect there is a remaining deadlock somewhere within the
managedwriter
. I have Google Cloud Profiler enabled for this service, but I didn't see anything noticeable from the CPU / threads profile view.Other things i tried was switching from committed stream to default stream, but that had no impact.
Let me know if there is any debugging logs I can provide @shollyman