numaproj / numaflow

Kubernetes-native platform to run massively parallel data/streaming jobs
https://numaflow.numaproj.io/
Apache License 2.0
1.07k stars 111 forks source link

Encountered error in sinkFn - CANCELLED: client cancelled #1652

Open nagarajatantry opened 5 months ago

nagarajatantry commented 5 months ago

Update numaflow controller from rc1 to rc4. I see this error message in the sink vertex. Sink Pods remained in Running State.

Error in numa container

{"level":"error","ts":"2024-04-08T18:38:21.170233545Z","logger":"numaflow.Sink-processor","caller":"forward/forward.go:415","msg":"Retrying failed messages","pipeline":"kafka-test-pipeline-1","vertex":"custom-out","errors":{"gRPC client.SinkFn failed, failed to execute stream.Send(value:\"..."  event_time:{seconds:1712601333  nanos:777000000}  watermark:{seconds:-62135596800}  id:\"\\x00\\x00\\x00\\x00\\x00\\xe25\\xc5-input-0\"): rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8":98},"pipeline":"kafka-test-pipeline-1","vertex":"custom-out","partition_name":"custom-out","stacktrace":"github.com/numaproj/numaflow/pkg/sinks/forward.(*DataForward).writeToBuffer\n\t/Users/yhl01/Documents/numaproj/numaflow/pkg/sinks/forward/forward.go:415\ngithub.com/numaproj/numaflow/pkg/sinks/forward.(*DataForward).forwardAChunk\n\t/Users/yhl01/Documents/numaproj/numaflow/pkg/sinks/forward/forward.go:271\ngithub.com/numaproj/numaflow/pkg/sinks/forward.(*DataForward).Start.func1\n\t/Users/yhl01/Documents/numaproj/numaflow/pkg/sinks/forward/forward.go:133"}

error in custom sink container

2024-04-08T18:38:21,173+0000-ERROR-"grpc-default-executor-0" -i.n.n.sinker.Service-68-Encountered error in sinkFn - CANCELLED: client cancelled 
vigith commented 5 months ago

This is because of the stale messages in the ISB. I am assuming that the error count should have spiked up and alerted the user. We should think of a better user experience?

nagarajatantry commented 5 months ago

this was in a nonprod environment with very low tps, so it would have been difficult to catch with an alert. We may need a better way to detect from the platform perspective since the id field is managed internally by the platform