twmb / franz-go

franz-go contains a feature complete, pure Go library for interacting with Kafka from 0.8.0 through 3.6+. Producing, consuming, transacting, administrating, etc.
BSD 3-Clause "New" or "Revised" License
1.6k stars 158 forks source link

ProduceSync returning "context canceled" error when parent context hasn't been cancelled #756

Open saholman opened 1 month ago

saholman commented 1 month ago

Many times throughout the day the following sequence of logs occur:

These ProduceSync errors are impacting our system. Is there an internal context that is being cancelled? Is there a reason these particular errors aren't being retried?

twmb commented 4 weeks ago

Do you have debug logs of this happening?

saholman commented 2 weeks ago

"2024-06-18T20:37:48.051Z DEBUG kzap@v0.1.0/kzap.go:105 sasl expiry limit reached, reauthenticating {""broker"": ""1""}" "2024-06-18T20:37:48.051Z DEBUG kzap@v0.1.0/kzap.go:105 issuing SASLHandshakeRequest {""broker"": ""1""}" "2024-06-18T20:37:48.051Z DEBUG kzap@v0.1.0/kzap.go:105 wrote SASLHandshake v1 {""broker"": ""1"", ""bytes_written"": 30, ""write_wait"": ""12.03µs"", ""time_to_write"": ""50.431µs"", ""err"": null}" "2024-06-18T20:37:48.052Z DEBUG kzap@v0.1.0/kzap.go:105 read SASLHandshake v1 {""broker"": ""1"", ""bytes_read"": 40, ""read_wait"": ""42.972µs"", ""time_to_read"": ""1.091305ms"", ""err"": null}" "2024-06-18T20:37:48.052Z DEBUG kzap@v0.1.0/kzap.go:105 beginning sasl authentication {""broker"": ""1"", ""addr"": ""b-1-public.REDACTED.amazonaws.com:9198"", ""mechanism"": ""AWS_MSK_IAM"", ""authenticate"": true}" "2024-06-18T20:37:48.052Z DEBUG kzap@v0.1.0/kzap.go:105 issuing SASLAuthenticate {""broker"": ""1"", ""version"": 2, ""step"": 0}" "2024-06-18T20:37:48.053Z DEBUG kzap@v0.1.0/kzap.go:105 wrote SASLAuthenticate v2 {""broker"": ""1"", ""bytes_written"": 1755, ""write_wait"": ""8.298µs"", ""time_to_write"": ""45.558µs"", ""err"": null}" "2024-06-18T20:37:48.065Z DEBUG kzap@v0.1.0/kzap.go:105 read SASLAuthenticate v2 {""broker"": ""1"", ""bytes_read"": 98, ""read_wait"": ""31.063µs"", ""time_to_read"": ""12.384996ms"", ""err"": null}" "2024-06-18T20:37:48.065Z DEBUG kzap@v0.1.0/kzap.go:105 sasl has a limited lifetime {""broker"": ""1"", ""session_lifetime"": ""936ms"", ""lifetime_pessimism"": ""1s"", ""reauthenticate_in"": ""-64ms""}" "2024-06-18T20:37:48.065Z DEBUG kzap@v0.1.0/kzap.go:105 wrote Produce v9 {""broker"": ""1"", ""bytes_written"": 643, ""write_wait"": ""14.047337ms"", ""time_to_write"": ""47.9µs"", ""err"": null}" "2024-06-18T20:37:48.069Z DEBUG kzap@v0.1.0/kzap.go:105 read Produce v9 {""broker"": ""1"", ""bytes_read"": 94, ""read_wait"": ""48.54µs"", ""time_to_read"": ""4.039864ms"", ""err"": null}" "2024-06-18T20:37:48.069Z DEBUG kzap@v0.1.0/kzap.go:105 produced {""broker"": ""1"", ""to"": ""REDACTED[0{778238=>778239}]""}" "2024-06-18T20:37:48.088Z DEBUG kzap@v0.1.0/kzap.go:105 sasl expiry limit reached, reauthenticating {""broker"": ""1""}" "2024-06-18T20:37:48.088Z DEBUG kzap@v0.1.0/kzap.go:105 issuing SASLHandshakeRequest {""broker"": ""1""}" "2024-06-18T20:37:48.088Z DEBUG kzap@v0.1.0/kzap.go:105 wrote SASLHandshake v1 {""broker"": ""1"", ""bytes_written"": 30, ""write_wait"": ""12.261µs"", ""time_to_write"": ""67.692µs"", ""err"": null}" "2024-06-18T20:37:48.106Z DEBUG kzap@v0.1.0/kzap.go:105 read SASLHandshake v1 {""broker"": ""1"", ""bytes_read"": 14, ""read_wait"": ""57.922µs"", ""time_to_read"": ""17.648534ms"", ""err"": null}" "2024-06-18T20:37:48.106Z WARN kzap@v0.1.0/kzap.go:111 random error while producing, requeueing unattempted request {""broker"": ""1"", ""err"": ""ILLEGAL_SASL_STATE: Request is not valid given the current SASL state.""}" github.com/twmb/franz-go/plugin/kzap.(Logger).Log /go/pkg/mod/github.com/twmb/franz-go/plugin/kzap@v0.1.0/kzap.go:111 github.com/twmb/franz-go/pkg/kgo.(wrappedLogger).Log /go/pkg/mod/github.com/twmb/franz-go@v1.16.1/pkg/kgo/logger.go:123 github.com/twmb/franz-go/pkg/kgo.(sink).handleReqClientErr /go/pkg/mod/github.com/twmb/franz-go@v1.16.1/pkg/kgo/sink.go:531 github.com/twmb/franz-go/pkg/kgo.(sink).handleReqResp /go/pkg/mod/github.com/twmb/franz-go@v1.16.1/pkg/kgo/sink.go:581 github.com/twmb/franz-go/pkg/kgo.(sink).produce.func3 /go/pkg/mod/github.com/twmb/franz-go@v1.16.1/pkg/kgo/sink.go:369 github.com/twmb/franz-go/pkg/kgo.(sink).handleSeqResps /go/pkg/mod/github.com/twmb/franz-go@v1.16.1/pkg/kgo/sink.go:411 "2024-06-18T20:37:48.106Z INFO kzap@v0.1.0/kzap.go:109 produce request failed, triggering metadata update {""broker"": ""1"", ""err"": ""ILLEGAL_SASL_STATE: Request is not valid given the current SASL state.""}" "2024-06-18T20:37:48.106Z DEBUG kzap@v0.1.0/kzap.go:105 retry batches processed {""wanted_metadata_update"": true, ""triggering_metadata_update"": true, ""should_backoff"": false}" "2024-06-18T20:37:48.106Z INFO kzap@v0.1.0/kzap.go:109 metadata update triggered {""why"": ""failed produce request triggered metadata update""}" "2024-06-18T20:37:48.191Z DEBUG kzap@v0.1.0/kzap.go:105 wrote Produce v9 {""broker"": ""1"", ""bytes_written"": 777, ""write_wait"": ""33.683µs"", ""time_to_write"": ""61.676µs"", ""err"": null}" "2024-06-18T20:37:48.192Z DEBUG kzap@v0.1.0/kzap.go:105 read Produce v9 {""broker"": ""1"", ""bytes_read"": 94, ""read_wait"": ""95.167µs"", ""time_to_read"": ""1.037898ms"", ""err"": null}" "2024-06-18T20:37:48.192Z DEBUG kzap@v0.1.0/kzap.go:105 produced {""broker"": ""1"", ""to"": ""REDACTED[0{778240=>778241}]""}" "2024-06-18T20:37:48.389Z DEBUG kzap@v0.1.0/kzap.go:105 wrote Produce v9 {""broker"": ""1"", ""bytes_written"": 780, ""write_wait"": ""36.824µs"", ""time_to_write"": ""45.222µs"", ""err"": null}" "2024-06-18T20:37:48.390Z DEBUG kzap@v0.1.0/kzap.go:105 read Produce v9 {""broker"": ""1"", ""bytes_read"": 94, ""read_wait"": ""69.801µs"", ""time_to_read"": ""768.197µs"", ""err"": null}" "2024-06-18T20:37:48.390Z DEBUG kzap@v0.1.0/kzap.go:105 produced {""broker"": ""1"", ""to"": ""REDACTED[0{778243=>778244}]""}" "2024-06-18T20:37:48.591Z DEBUG kzap@v0.1.0/kzap.go:105 wrote Produce v9 {""broker"": ""1"", ""bytes_written"": 780, ""write_wait"": ""23.634µs"", ""time_to_write"": ""46.818µs"", ""err"": null}" "2024-06-18T20:37:48.592Z DEBUG kzap@v0.1.0/kzap.go:105 read Produce v9 {""broker"": ""1"", ""bytes_read"": 94, ""read_wait"": ""61.724µs"", ""time_to_read"": ""786.672µs"", ""err"": null}" "2024-06-18T20:37:48.592Z DEBUG kzap@v0.1.0/kzap.go:105 produced {""broker"": ""1"", ""to"": ""REDACTED[0{778246=>778247}]""}" "2024-06-18T20:37:48.650Z DEBUG kzap@v0.1.0/kzap.go:105 wrote Produce v9 {""broker"": ""1"", ""bytes_written"": 644, ""write_wait"": ""27.785µs"", ""time_to_write"": ""46.696µs"", ""err"": null}" "2024-06-18T20:37:48.650Z DEBUG kzap@v0.1.0/kzap.go:105 read Produce v9 {""broker"": ""1"", ""bytes_read"": 94, ""read_wait"": ""86.61µs"", ""time_to_read"": ""694.89µs"", ""err"": null}" "2024-06-18T20:37:48.650Z DEBUG kzap@v0.1.0/kzap.go:105 produced {""broker"": ""1"", ""to"": ""REDACTED[0{778248=>778249}]""}" "2024-06-18T20:37:48.893Z DEBUG kzap@v0.1.0/kzap.go:105 wrote Produce v9 {""broker"": ""1"", ""bytes_written"": 780, ""write_wait"": ""30.491µs"", ""time_to_write"": ""53.437µs"", ""err"": null}" "2024-06-18T20:37:48.894Z DEBUG kzap@v0.1.0/kzap.go:105 read Produce v9 {""broker"": ""1"", ""bytes_read"": 94, ""read_wait"": ""80.824µs"", ""time_to_read"": ""780.472µs"", ""err"": null}" "2024-06-18T20:37:48.894Z DEBUG kzap@v0.1.0/kzap.go:105 produced {""broker"": ""1"", ""to"": ""REDACTED[0{778251=>778252}]""}" "2024-06-18T20:37:48.953Z DEBUG kzap@v0.1.0/kzap.go:105 wrote Produce v9 {""broker"": ""1"", ""bytes_written"": 644, ""write_wait"": ""24.65µs"", ""time_to_write"": ""44.14µs"", ""err"": null}" "2024-06-18T20:37:48.954Z DEBUG kzap@v0.1.0/kzap.go:105 read Produce v9 {""broker"": ""1"", ""bytes_read"": 94, ""read_wait"": ""69.11µs"", ""time_to_read"": ""775.095µs"", ""err"": null}" "2024-06-18T20:37:48.954Z DEBUG kzap@v0.1.0/kzap.go:105 produced {""broker"": ""1"", ""to"": ""REDACTED[0{778253=>778254}]""}" "2024-06-18T20:37:49.203Z DEBUG kzap@v0.1.0/kzap.go:105 wrote Produce v9 {""broker"": ""1"", ""bytes_written"": 777, ""write_wait"": ""32.71µs"", ""time_to_write"": ""48.865µs"", ""err"": null}" "2024-06-18T20:37:49.204Z DEBUG kzap@v0.1.0/kzap.go:105 read Produce v9 {""broker"": ""1"", ""bytes_read"": 94, ""read_wait"": ""83.708µs"", ""time_to_read"": ""851.99µs"", ""err"": null}" "2024-06-18T20:37:49.204Z DEBUG kzap@v0.1.0/kzap.go:105 produced {""broker"": ""1"", ""to"": ""REDACTED[0{778256=>778257}]""}" "2024-06-18T20:37:49.267Z DEBUG kzap@v0.1.0/kzap.go:105 wrote Produce v9 {""broker"": ""1"", ""bytes_written"": 641, ""write_wait"": ""25.185µs"", ""time_to_write"": ""45.943µs"", ""err"": null}" "2024-06-18T20:37:49.268Z DEBUG kzap@v0.1.0/kzap.go:105 read Produce v9 {""broker"": ""1"", ""bytes_read"": 94, ""read_wait"": ""75.01µs"", ""time_to_read"": ""752.085µs"", ""err"": null}" "2024-06-18T20:37:49.268Z DEBUG kzap@v0.1.0/kzap.go:105 produced {""broker"": ""1"", ""to"": ""REDACTED[0{778258=>778259}]""}" "2024-06-18T20:37:49.490Z DEBUG kzap@v0.1.0/kzap.go:105 wrote Produce v9 {""broker"": ""1"", ""bytes_written"": 780, ""write_wait"": ""25.02µs"", ""time_to_write"": ""42.43µs"", ""err"": null}" "2024-06-18T20:37:49.491Z DEBUG kzap@v0.1.0/kzap.go:105 read Produce v9 {""broker"": ""1"", ""bytes_read"": 94, ""read_wait"": ""70.766µs"", ""time_to_read"": ""848.823µs"", ""err"": null}" "2024-06-18T20:37:49.491Z DEBUG kzap@v0.1.0/kzap.go:105 produced {""broker"": ""1"", ""to"": ""REDACTED[0{778260=>778261}]""}" "2024-06-18T20:37:49.549Z DEBUG kzap@v0.1.0/kzap.go:105 wrote Produce v9 {""broker"": ""1"", ""bytes_written"": 644, ""write_wait"": ""25.806µs"", ""time_to_write"": ""45.988µs"", ""err"": null}" "2024-06-18T20:37:49.549Z DEBUG kzap@v0.1.0/kzap.go:105 read Produce v9 {""broker"": ""1"", ""bytes_read"": 94, ""read_wait"": ""86.452µs"", ""time_to_read"": ""694.985µs"", ""err"": null}" "2024-06-18T20:37:49.549Z DEBUG kzap@v0.1.0/kzap.go:105 produced {""broker"": ""1"", ""to"": ""REDACTED[0{778262=>778263}]""}" "2024-06-18T20:37:49.800Z DEBUG kzap@v0.1.0/kzap.go:105 wrote Produce v9 {""broker"": ""1"", ""bytes_written"": 780, ""write_wait"": ""24.503µs"", ""time_to_write"": ""44.782µs"", ""err"": null}" "2024-06-18T20:37:49.801Z DEBUG kzap@v0.1.0/kzap.go:105 read Produce v9 {""broker"": ""1"", ""bytes_read"": 94, ""read_wait"": ""76.126µs"", ""time_to_read"": ""818.682µs"", ""err"": null}" "2024-06-18T20:37:49.801Z DEBUG kzap@v0.1.0/kzap.go:105 produced {""broker"": ""1"", ""to"": ""REDACTED[0{778266=>778267}]""}" "2024-06-18T20:37:49.859Z DEBUG kzap@v0.1.0/kzap.go:105 wrote Produce v9 {""broker"": ""1"", ""bytes_written"": 644, ""write_wait"": ""32.997µs"", ""time_to_write"": ""47.074µs"", ""err"": null}" "2024-06-18T20:37:49.860Z DEBUG kzap@v0.1.0/kzap.go:105 read Produce v9 {""broker"": ""1"", ""bytes_read"": 94, ""read_wait"": ""85.297µs"", ""time_to_read"": ""915.296µs"", ""err"": null}" ""2024-06-18T20:37:50.052Z DEBUG kzap@v0.1.0/kzap.go:105 sasl expiry limit reached, reauthenticating {""broker"": ""3""}" "2024-06-18T20:37:50.052Z DEBUG kzap@v0.1.0/kzap.go:105 issuing SASLHandshakeRequest {""broker"": ""3""}" "2024-06-18T20:37:50.052Z DEBUG kzap@v0.1.0/kzap.go:105 wrote SASLHandshake v1 {""broker"": ""3"", ""bytes_written"": 30, ""write_wait"": ""11.888µs"", ""time_to_write"": ""54.789µs"", ""err"": null}" "2024-06-18T20:37:50.052Z DEBUG kzap@v0.1.0/kzap.go:105 read SASLHandshake v1 {""broker"": ""3"", ""bytes_read"": 40, ""read_wait"": ""96.019µs"", ""time_to_read"": ""568.061µs"", ""err"": null}" "2024-06-18T20:37:50.052Z DEBUG kzap@v0.1.0/kzap.go:105 beginning sasl authentication {""broker"": ""3"", ""addr"": ""b-3-public.REDACTED.amazonaws.com:9198"", ""mechanism"": ""AWS_MSK_IAM"", ""authenticate"": true}" "2024-06-18T20:37:50.092Z DEBUG kzap@v0.1.0/kzap.go:105 wrote Produce v9 {""broker"": ""1"", ""bytes_written"": 777, ""write_wait"": ""22.265µs"", ""time_to_write"": ""44.407µs"", ""err"": null}" "2024-06-18T20:37:50.092Z DEBUG kzap@v0.1.0/kzap.go:105 read Produce v9 {""broker"": ""1"", ""bytes_read"": 94, ""read_wait"": ""58.667µs"", ""time_to_read"": ""877.818µs"", ""err"": null}" "2024-06-18T20:37:50.093Z DEBUG kzap@v0.1.0/kzap.go:105 produced {""broker"": ""1"", ""to"": ""REDACTED[0{778271=>778272}]""}" "2024-06-18T20:37:50.133Z DEBUG kzap@v0.1.0/kzap.go:105 issuing SASLAuthenticate {""broker"": ""3"", ""version"": 2, ""step"": 0}" "2024-06-18T20:37:50.134Z DEBUG kzap@v0.1.0/kzap.go:105 wrote SASLAuthenticate v2 {""broker"": ""3"", ""bytes_written"": 1755, ""write_wait"": ""9.053µs"", ""time_to_write"": ""95.662µs"", ""err"": null}" "2024-06-18T20:37:50.151Z DEBUG kzap@v0.1.0/kzap.go:105 wrote Produce v9 {""broker"": ""1"", ""bytes_written"": 641, ""write_wait"": ""26.78µs"", ""time_to_write"": ""41.639µs"", ""err"": null}" "2024-06-18T20:37:50.152Z DEBUG kzap@v0.1.0/kzap.go:105 read Produce v9 {""broker"": ""1"", ""bytes_read"": 94, ""read_wait"": ""61.357µs"", ""time_to_read"": ""733.487µs"", ""err"": null}" "2024-06-18T20:37:50.152Z DEBUG kzap@v0.1.0/kzap.go:105 produced {""broker"": ""1"", ""to"": ""REDACTED[0{778273=>778274}]""}" "2024-06-18T20:37:50.258Z DEBUG kzap@v0.1.0/kzap.go:105 read SASLAuthenticate v2 {""broker"": ""3"", ""bytes_read"": 294, ""read_wait"": ""50.968µs"", ""time_to_read"": ""124.369308ms"", ""err"": null}" "2024-06-18T20:37:50.258Z DEBUG kzap@v0.1.0/kzap.go:105 sasl reauth failed, retrying once on new connection {""broker"": ""3"", ""err"": ""Cannot change principals during re-authentication from IAM.arn:aws:sts::REDACTED:assumed-role/REDACTED/REDACTED: IAM.arn:aws:sts::REDACTED:assumed-role/REDACTED/REDACTED: SASL_AUTHENTICATION_FAILED: SASL Authentication failed.""}" "2024-06-18T20:37:50.258Z DEBUG kzap@v0.1.0/kzap.go:105 opening connection to broker {""addr"": ""b-3-public.REDACTED.amazonaws.com:9198"", ""broker"": ""3""}" "2024-06-18T20:37:50.267Z DEBUG kzap@v0.1.0/kzap.go:105 connection opened to broker {""addr"": ""b-3-public.REDACTED.amazonaws.com:9198"", ""broker"": ""3""}" "2024-06-18T20:37:50.267Z DEBUG kzap@v0.1.0/kzap.go:105 issuing SASLHandshakeRequest {""broker"": ""3""}" "2024-06-18T20:37:50.267Z DEBUG kzap@v0.1.0/kzap.go:105 wrote SASLHandshake v1 {""broker"": ""3"", ""bytes_written"": 30, ""write_wait"": ""8.466µs"", ""time_to_write"": ""32.194µs"", ""err"": null}" "2024-06-18T20:37:50.267Z DEBUG kzap@v0.1.0/kzap.go:105 read SASLHandshake v1 {""broker"": ""3"", ""bytes_read"": 40, ""read_wait"": ""36.113µs"", ""time_to_read"": ""458.292µs"", ""err"": null}" "2024-06-18T20:37:50.267Z DEBUG kzap@v0.1.0/kzap.go:105 beginning sasl authentication {""broker"": ""3"", ""addr"": ""b-3-public.REDACTED.amazonaws.com:9198"", ""mechanism"": ""AWS_MSK_IAM"", ""authenticate"": true}" "2024-06-18T20:37:50.267Z DEBUG kzap@v0.1.0/kzap.go:105 issuing SASLAuthenticate {""broker"": ""3"", ""version"": 2, ""step"": 0}" "2024-06-18T20:37:50.267Z DEBUG kzap@v0.1.0/kzap.go:105 wrote SASLAuthenticate v2 {""broker"": ""3"", ""bytes_written"": 1755, ""write_wait"": ""6.579µs"", ""time_to_write"": ""42.996µs"", ""err"": null}" "2024-06-18T20:37:50.279Z DEBUG kzap@v0.1.0/kzap.go:105 read SASLAuthenticate v2 {""broker"": ""3"", ""bytes_read"": 98, ""read_wait"": ""30.854µs"", ""time_to_read"": ""11.105484ms"", ""err"": null}" "2024-06-18T20:37:50.279Z DEBUG kzap@v0.1.0/kzap.go:105 sasl has a limited lifetime {""broker"": ""3"", ""session_lifetime"": ""59m59.722s"", ""lifetime_pessimism"": ""1m23.054s"", ""reauthenticate_in"": ""58m36.668s""}" "2024-06-18T20:37:50.279Z DEBUG kzap@v0.1.0/kzap.go:105 connection initialized successfully {""addr"": ""b-3-public.REDACTED.amazonaws.com:9198"", ""broker"": ""3""}" "2024-06-18T20:37:50.279Z DEBUG kzap@v0.1.0/kzap.go:105 wrote Metadata v12 {""broker"": ""3"", ""bytes_written"": 83, ""write_wait"": ""226.997748ms"", ""time_to_write"": ""47.72µs"", ""err"": null}" "2024-06-18T20:37:50.291Z DEBUG kzap@v0.1.0/kzap.go:105 read Metadata v12 {""broker"": ""3"", ""bytes_read"": 418, ""read_wait"": ""41.567µs"", ""time_to_read"": ""12.729919ms"", ""err"": null}" "2024-06-18T20:37:50.292Z DEBUG kzap@v0.1.0/kzap.go:105 metadata refresh has identical topic partition data {""topic"": ""REDACTED"", ""partition"": 0, ""leader"": 1, ""leader_epoch"": 150}" "2024-06-18T20:37:50.292Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": null}" "2024-06-18T20:37:50.293Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": ""context deadline exceeded""}" "2024-06-18T20:37:50.293Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": ""context deadline exceeded""}" "2024-06-18T20:37:50.293Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": ""context deadline exceeded""}" "2024-06-18T20:37:50.293Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": ""context deadline exceeded""}" "2024-06-18T20:37:50.293Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": ""context deadline exceeded""}" "2024-06-18T20:37:50.293Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": ""context deadline exceeded""}" "2024-06-18T20:37:50.293Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": ""context deadline exceeded""}" "2024-06-18T20:37:50.294Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": ""context deadline exceeded""}" "2024-06-18T20:37:50.294Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": ""context deadline exceeded""}" "2024-06-18T20:37:50.294Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": ""context deadline exceeded""}" "2024-06-18T20:37:50.294Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": ""context deadline exceeded""}" "2024-06-18T20:37:50.294Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": ""context deadline exceeded""}" "2024-06-18T20:37:50.294Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": ""context deadline exceeded""}" "2024-06-18T20:37:50.294Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": null}" "2024-06-18T20:37:50.294Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": null}" "2024-06-18T20:37:50.294Z WARN producer/producer.go:286 failed to produce messages {""error"": ""context deadline exceeded"", ""contextCause"": null}" "2024-06-18T20:37:50.303Z DEBUG kzap@v0.1.0/kzap.go:105 opening connection to broker {""addr"": ""b-1-public.REDACTED.amazonaws.com:9198"", ""broker"": ""1""}" "2024-06-18T20:37:50.353Z DEBUG kzap@v0.1.0/kzap.go:105 connection opened to broker {""addr"": ""b-1-public.REDACTED.amazonaws.com:9198"", ""broker"": ""1""}" "2024-06-18T20:37:50.353Z DEBUG kzap@v0.1.0/kzap.go:105 issuing SASLHandshakeRequest {""broker"": ""1""}" "2024-06-18T20:37:50.353Z DEBUG kzap@v0.1.0/kzap.go:105 wrote SASLHandshake v1 {""broker"": ""1"", ""bytes_written"": 30, ""write_wait"": ""10.279µs"", ""time_to_write"": ""33.919µs"", ""err"": null}" "2024-06-18T20:37:50.354Z DEBUG kzap@v0.1.0/kzap.go:105 read SASLHandshake v1 {""broker"": ""1"", ""bytes_read"": 40, ""read_wait"": ""36.204µs"", ""time_to_read"": ""953.121µs"", ""err"": null}" "2024-06-18T20:37:50.354Z DEBUG kzap@v0.1.0/kzap.go:105 beginning sasl authentication {""broker"": ""1"", ""addr"": ""b-1-public.REDACTED.amazonaws.com:9198"", ""mechanism"": ""AWS_MSK_IAM"", ""authenticate"": true}" "2024-06-18T20:37:50.354Z DEBUG kzap@v0.1.0/kzap.go:105 issuing SASLAuthenticate {""broker"": ""1"", ""version"": 2, ""step"": 0}" "2024-06-18T20:37:50.354Z DEBUG kzap@v0.1.0/kzap.go:105 wrote SASLAuthenticate v2 {""broker"": ""1"", ""bytes_written"": 1755, ""write_wait"": ""7.747µs"", ""time_to_write"": ""47.471µs"", ""err"": null}" "2024-06-18T20:37:50.381Z DEBUG kzap@v0.1.0/kzap.go:105 read SASLAuthenticate v2 {""broker"": ""1"", ""bytes_read"": 98, ""read_wait"": ""28.946µs"", ""time_to_read"": ""26.771741ms"", ""err"": null}" "2024-06-18T20:37:50.381Z DEBUG kzap@v0.1.0/kzap.go:105 sasl has a limited lifetime {""broker"": ""1"", ""session_lifetime"": ""59m59.62s"", ""lifetime_pessimism"": ""2m44.771s"", ""reauthenticate_in"": ""57m14.849s""}" "2024-06-18T20:37:50.381Z DEBUG kzap@v0.1.0/kzap.go:105 connection initialized successfully {""addr"": ""b-1-public.REDACTED.amazonaws.com:9198"", ""broker"": ""1""}" "2024-06-18T20:37:49.860Z DEBUG kzap@v0.1.0/kzap.go:105 produced {""broker"": ""1"", ""to"": ""REDACTED[0{778268=>778269}]""}"

You can see that they are now returning a "context deadline exceeded" error instead of a "context canceled" error because we added a parent context with a timeout of 500ms. It was taking several seconds to get the "context canceled" so we added this to fail faster so we can retry.

But the flow is the same. I can't tell in this scenario if the produce request that are failing were going to broker 1 or broker 3. If it's broker 3, then we appear to cancel and not retry produce requests for the original broker connection. See https://github.com/twmb/franz-go/issues/249 for the change that creates a new connection when SASL authentication fails. If it's broker 1, then it seems more related to the ILLEGAL_SASL_STATE due to the short session.

twmb commented 1 week ago

There is no canceling in the produce path of the codebase:

@m4x3r:~/src/twmb/franz-go/pkg/kgo
$ rg 'ancel\('
client.go
457:    ctx, cancel := context.WithCancel(context.Background())
1031:   sessCloseCancel()
1036:   cl.ctxCancel()
1318:   ctx, cancel := context.WithCancel(ctx)
1322:       defer cancel()
2031:   ctx, cancel := context.WithCancel(ctx)
2032:   defer cancel()

group_test.go
184:        cancel()

txn_test.go
173:        cancel()

consumer.go
1448:   ctx, cancel := context.WithCancel(c.cl.ctx)
1614:   session.cancel()

source.go
788:        ctx, cancel = context.WithCancel(consumerSession.ctx)
790:    defer cancel()

consumer_group.go
240:        ctx, cancel = context.WithCancel(context.Background())
241:        cancel()
276:    ctx, cancel := context.WithCancel(c.cl.ctx)
497:    g.cancel()
846:    ctx, cancel := context.WithCancel(g.ctx)
848:        defer cancel() // potentially kill offset fetching
2812:   commitCtx, commitCancel := context.WithCancel(ctx) // enable ours to be canceled and waited for
2829:               commitCancel()
2837:       defer commitCancel()
2843:               priorCancel()

consumer_direct_test.go
34: defer cancel()
108:    defer cancel()
156:    cancel()
163:    defer cancel()
232:        cancel()
328:    ctx, cancel := context.WithCancel(context.Background())
344:    defer cancel()
374:            cancel()
406:    ctx, cancel := context.WithCancel(context.Background())
422:    defer cancel()
452:            cancel()
499:        cancel()
523:    cancel()

txn.go
1163:       g.commitCancel() // cancel any prior commit
1173:   commitCtx, commitCancel := context.WithCancel(g.ctx) // enable ours to be canceled and waited for
1183:               commitCancel()
1191:       defer commitCancel()
1197:               priorCancel()

The cancel calls in client.go are unrelated or irrelevant here. Everything else is related to consuming.

Without looking at your code, my guess is that you are using the same context across multiple Produce calls and are canceling it after some messages return, which causes buffered records to be canceled too.

The logs above look normal, minus the SASL problems that we've discussed in a few issues. If you really want to see where the cancel is coming from, I think you could write a wrapper context.Context interface that calls debug.PrintStack on cancel before calling the inner (original) cancel func. I'd be pretty curious to know what the debug logs say.

I'm not really sure what to do in the client at this point for the SASL problems, considering this is almost the same behavior as the Kafka Java client itself. IMO, AWS needs to have some answer to the SASL problems -- like, propose a solution, or change what they allow. Given what I saw when I implemented the AWS_MSK_IAM auth 3yr ago, the broker side seemed a bit odd at the time.