knative / eventing

Event-driven application platform for Kubernetes
https://knative.dev/docs/eventing
Apache License 2.0
1.4k stars 590 forks source link

An event sent during upgrade test shutdown might be actually delivered #7446

Closed mgencur closed 9 months ago

mgencur commented 9 months ago

Describe the bug When event sending fails at the end of an upgrade test and the test suite is re-sending the event while the sender is being shut down, it can happen that the receiver gets the event but sender thinks it was not delivered (because the sender is interrupted). The test than fails with an error similar to this one: "expecting to have 16690 unique events received, but received 16691 unique events". The code is here. The upgrade test framework should check if the last step was interrupted and in that case allow the number of received event to be "number of sent events plus 1".

Test log:

    logger.go:130: 2023-11-13T09:02:40.170-0500 INFO    WARNING: Duplicate:event #16691 received 2 times, but should be received only once
    logger.go:130: 2023-11-13T09:02:40.169-0500 INFO    WARNING: Duplicate:event #16690 received 2 times, but should be received only once
    prober.go:78: There were 16691 events propagated, but 1 errors occurred. Listing them below.
    prober.go:106: expecting to have 16690 unique events received, but received 16691 unique events
    logger.go:130: 2023-11-13T09:02:40.172-0500 DEBUG   Finished "ChannelContinualTest"

Sender:

{"level":"warn","ts":"2023-11-13T14:00:03.816696676Z","caller":"sender/services.go:102","msg":"Could not send step event 16690, retrying (0): Post \"http://sut-kn-channel.eventing-e2e0.svc.cluster.local\": EOF"}
{"level":"info","ts":"2023-11-13T14:00:03.816770975Z","caller":"sender/services.go:207","msg":"Sending step event #16691 to \"http://sut-kn-channel.eventing-e2e0.svc.cluster.local\""}
{"level":"warn","ts":"2023-11-13T14:00:03.821102525Z","caller":"sender/services.go:102","msg":"Could not send step event 16690, retrying (1): Post \"http://sut-kn-channel.eventing-e2e0.svc.cluster.local\": dial tcp 172.30.99.91:80: connect: connection refused"}
{"level":"info","ts":"2023-11-13T14:00:03.821158301Z","caller":"sender/services.go:207","msg":"Sending step event #16691 to \"http://sut-kn-channel.eventing-e2e0.svc.cluster.local\""}
{"level":"warn","ts":"2023-11-13T14:00:03.825665798Z","caller":"sender/services.go:102","msg":"Could not send step event 16690, retrying (2): Post \"http://sut-kn-channel.eventing-e2e0.svc.cluster.local\": dial tcp 172.30.99.91:80: connect: connection refused"}

Note: There's a bug in logging the exact step that failed. The message "Could not send step event 16690" should actually be "Could not send step event 16691" because the message include a number of already sent events, not the index of the current message (which is +1).

Receiver:

{"level":"error","ts":"2023-11-13T14:00:04.955956723Z","caller":"event/services.go:210","msg":"event #16690 received 2 times, but should be received only once","stacktrace":"knative.dev/eventing/test/upgrade/prober/wathola/event.(*ErrorStore).appendThrown\n\tknative.dev/eventing/test/upgrade/prober/wathola/event/services.go:210\nknative.dev/eventing/test/upgrade/prober/wathola/event.(*ErrorStore).throwDuplicated\n\tknative.dev/eventing/test/upgrade/prober/wathola/event/services.go:190\nknative.dev/eventing/test/upgrade/prober/wathola/event.(*stepStore).RegisterStep\n\tknative.dev/eventing/test/upgrade/prober/wathola/event/services.go:79\nknative.dev/eventing/test/upgrade/prober/wathola/receiver.receiver.receiveEvent\n\tknative.dev/eventing/test/upgrade/prober/wathola/receiver/services.go:73\nreflect.Value.call\n\treflect/value.go:556\nreflect.Value.Call\n\treflect/value.go:339\ngithub.com/cloudevents/sdk-go/v2/client.(*receiverFn).invoke\n\tgithub.com/cloudevents/sdk-go/v2@v2.13.0/client/receiver.go:91\ngithub.com/cloudevents/sdk-go/v2/client.(*receiveInvoker).Invoke.func2\n\tgithub.com/cloudevents/sdk-go/v2@v2.13.0/client/invoker.go:85\ngithub.com/cloudevents/sdk-go/v2/client.(*receiveInvoker).Invoke\n\tgithub.com/cloudevents/sdk-go/v2@v2.13.0/client/invoker.go:88\ngithub.com/cloudevents/sdk-go/v2/client.(*ceClient).StartReceiver.func2.1\n\tgithub.com/cloudevents/sdk-go/v2@v2.13.0/client/client.go:253\ngithub.com/cloudevents/sdk-go/v2/client.(*ceClient).StartReceiver.func2.2\n\tgithub.com/cloudevents/sdk-go/v2@v2.13.0/client/client.go:265"}
{"level":"error","ts":"2023-11-13T14:00:05.77001412Z","caller":"event/services.go:210","msg":"event #16691 received 2 times, but should be received only once","stacktrace":"knative.dev/eventing/test/upgrade/prober/wathola/event.(*ErrorStore).appendThrown\n\tknative.dev/eventing/test/upgrade/prober/wathola/event/services.go:210\nknative.dev/eventing/test/upgrade/prober/wathola/event.(*ErrorStore).throwDuplicated\n\tknative.dev/eventing/test/upgrade/prober/wathola/event/services.go:190\nknative.dev/eventing/test/upgrade/prober/wathola/event.(*stepStore).RegisterStep\n\tknative.dev/eventing/test/upgrade/prober/wathola/event/services.go:79\nknative.dev/eventing/test/upgrade/prober/wathola/receiver.receiver.receiveEvent\n\tknative.dev/eventing/test/upgrade/prober/wathola/receiver/services.go:73\nreflect.Value.call\n\treflect/value.go:556\nreflect.Value.Call\n\treflect/value.go:339\ngithub.com/cloudevents/sdk-go/v2/client.(*receiverFn).invoke\n\tgithub.com/cloudevents/sdk-go/v2@v2.13.0/client/receiver.go:91\ngithub.com/cloudevents/sdk-go/v2/client.(*receiveInvoker).Invoke.func2\n\tgithub.com/cloudevents/sdk-go/v2@v2.13.0/client/invoker.go:85\ngithub.com/cloudevents/sdk-go/v2/client.(*receiveInvoker).Invoke\n\tgithub.com/cloudevents/sdk-go/v2@v2.13.0/client/invoker.go:88\ngithub.com/cloudevents/sdk-go/v2/client.(*ceClient).StartReceiver.func2.1\n\tgithub.com/cloudevents/sdk-go/v2@v2.13.0/client/client.go:253\ngithub.com/cloudevents/sdk-go/v2/client.(*ceClient).StartReceiver.func2.2\n\tgithub.com/cloudevents/sdk-go/v2@v2.13.0/client/client.go:265"}

Expected behavior When the last step on sender side failed and the test was shutdown at that moment, the receiver should tolerate one more event to be received because there's no way to verify if the event was actually delivered or not.

To Reproduce This happens only rarely and can be reproduced by running the upgrade tests.

Knative release version

Additional context Add any other context about the problem here such as proposed priority

mgencur commented 9 months ago

/assign