kyma-project / kyma

Kyma is an opinionated set of Kubernetes-based modular building blocks, including all necessary capabilities to develop and run enterprise-grade cloud-native applications.
https://kyma-project.io
Apache License 2.0
1.52k stars 405 forks source link

Fix core-test-external-solution-event-mesh test on Gardener #8283

Closed marcobebway closed 4 years ago

marcobebway commented 4 years ago

Description

The kyma-integration-gardener-azure prow job is failing.

Job

https://status.build.kyma-project.io/view/gcs/kyma-prow-logs/logs/kyma-integration-gardener-azure/1255029318887673856.

Logs

time="2020-04-28T07:51:33Z" level=info msg="Step: 'Send Cloud Event to Mesh'"
time="2020-04-28T07:51:34Z" level=info msg="Step: 'Check counter pod'"
time="2020-04-28T07:58:01Z" level=error msg="Error in 'Check counter pod': the counter pod is not updated: All attempts fail:
#1: counter different then expected value: Got: 0 but expected 1
#120: counter different then expected value: Got: 0 but expected 1"

AC

sayanh commented 4 years ago

The cluster creation was broken with missing components which was fixed here. The test was successful after the fix. Let's keep an eye on this issue.

sayanh commented 4 years ago

As it turns out, this test is flaky. It failed again https://storage.googleapis.com/kyma-prow-logs/logs/kyma-integration-gardener-azure/1255401508933996544/build-log.txt.

sayanh commented 4 years ago

I did a quick test, it seems like the timeout is too little for the event to traverse from Azure event hub to Kyma. The counter pod does get the counter incremented as the event reaches but after the test times out. I think increasing the timeout for this step, CheckCounterPod can render robustness to this test.

a-thaler commented 4 years ago

The test still has an high error rate for https://status.build.kyma-project.io/?job=kyma-integration-gardener-azure

ghost commented 4 years ago

During my tests, I've found other errors too:

Tomasz-Smelcerz-SAP commented 4 years ago

This test also failed two times (day after day) on weekly cluster, which is deployed on GKE.

ghost commented 4 years ago

The following step should be done in #8409

anishj0shi commented 4 years ago

Comments for @radufa

This should be the second step to make the e2e tests more reliable, see for the first step: #8281.

Instead to user waiting loops and timers to wait till the system is in a stable mode, the test should use the "status" information of each "component" before sending a request (using this component). Each test step should be started only if the "precondition" are fulfilled.

ghost commented 4 years ago

It seems that we could lose at least one event. Working to reproduce the "Check counter pod" error, which appears on Event-Hub multiple times if s.waitTime = 0 (default 10s):

        testsuite.NewCreateKnativeTrigger(s.testID, defaultBrokerName, lambdaEndpoint, knativeEventingClientSet.EventingV1alpha1().Triggers(s.testID)),
        testsuite.NewSleep(s.waitTime),
        testsuite.NewSendEventToMesh(s.testID, helpers.LambdaPayload, state),

In spite of that the trigger is in "Ready" status, the event is not received in Event-Hub. Here are the results of 3 consecutive tests, the first 2 are OK and the last one NOK. The "knative-kafka-channel" istio-proxy log file:

- Test 1 -- OK --
[2020-05-15T07:02:49.431Z] "POST / HTTP/1.1" 202 - "-" "-" 7 0 84 81 "-" "Go-http-client/1.1" "511b73f5-93a8-463d-9476-2a0dc5f96fe9" "e2e-event-mesh-kafkachannel.kyma-integration.svc.cluster.local" "127.0.0.1:8080" inbound|80|http|knative-kafka-channel.knative-eventing.svc.cluster.local - 100.96.1.33:8080 100.96.2.48:38860 - default
[2020-05-15T07:02:50.616Z] "POST / HTTP/1.1" 202 - "-" "-" 7 0 76 74 "-" "Go-http-client/1.1" "f93b7f4b-c430-4341-9031-bba558de2e1a" "default-kne-trigger-kafkachannel.e2e-event-mesh.svc.cluster.local" "127.0.0.1:8080" inbound|80|http|knative-kafka-channel.knative-eventing.svc.cluster.local - 100.96.1.33:8080 100.96.0.165:51806 - default
[2020-05-15T07:02:53.578Z] "POST / HTTP/1.1" 202 - "-" "-" 9 0 31 30 "-" "Go-http-client/1.1" "79a697ba-998f-4c36-9f73-b9d7004be58e" "e2e-event-mesh-kafkachannel.kyma-integration.svc.cluster.local" "127.0.0.1:8080" inbound|80|http|knative-kafka-channel.knative-eventing.svc.cluster.local - 100.96.1.33:8080 100.96.2.48:38860 - default
[2020-05-15T07:02:55.671Z] "POST / HTTP/1.1" 202 - "-" "-" 9 0 59 59 "-" "Go-http-client/1.1" "5f14175b-30aa-4600-a99d-d3320afe6765" "default-kne-trigger-kafkachannel.e2e-event-mesh.svc.cluster.local" "127.0.0.1:8080" inbound|80|http|knative-kafka-channel.knative-eventing.svc.cluster.local - 100.96.1.33:8080 100.96.0.165:51806 - default

-- Test 2 --OK
[2020-05-15T07:10:27.675Z] "POST / HTTP/1.1" 202 - "-" "-" 7 0 67 64 "-" "Go-http-client/1.1" "b083e0ab-6618-437b-bcce-7aeacd7b96fa" "e2e-event-mesh-kafkachannel.kyma-integration.svc.cluster.local" "127.0.0.1:8080" inbound|80|http|knative-kafka-channel.knative-eventing.svc.cluster.local - 100.96.1.33:8080 100.96.2.53:44220 - default
[2020-05-15T07:10:29.618Z] "POST / HTTP/1.1" 202 - "-" "-" 7 0 76 74 "-" "Go-http-client/1.1" "82087c86-7173-4556-9069-9c21a3bd448f" "default-kne-trigger-kafkachannel.e2e-event-mesh.svc.cluster.local" "127.0.0.1:8080" inbound|80|http|knative-kafka-channel.knative-eventing.svc.cluster.local - 100.96.1.33:8080 100.96.1.218:37142 - default
[2020-05-15T07:10:31.808Z] "POST / HTTP/1.1" 202 - "-" "-" 9 0 61 61 "-" "Go-http-client/1.1" "c79d2ae9-75ce-44e9-a95b-1059cf7b2561" "e2e-event-mesh-kafkachannel.kyma-integration.svc.cluster.local" "127.0.0.1:8080" inbound|80|http|knative-kafka-channel.knative-eventing.svc.cluster.local - 100.96.1.33:8080 100.96.2.53:44220 - default
[2020-05-15T07:10:33.707Z] "POST / HTTP/1.1" 202 - "-" "-" 9 0 59 58 "-" "Go-http-client/1.1" "af5d9060-9330-4826-8403-0a21508c5c49" "default-kne-trigger-kafkachannel.e2e-event-mesh.svc.cluster.local" "127.0.0.1:8080" inbound|80|http|knative-kafka-channel.knative-eventing.svc.cluster.local - 100.96.1.33:8080 100.96.1.218:37142 - default

[Envoy (Epoch 0)] [2020-05-15 07:13:55.002][24][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gRPC config stream closed: 13,

-- Test 3 --NOK--
[2020-05-15T07:23:44.588Z] "POST / HTTP/1.1" 202 - "-" "-" 7 0 62 59 "-" "Go-http-client/1.1" "78d00e79-675f-40cd-b5bd-e38ce452d958" "e2e-event-mesh-kafkachannel.kyma-integration.svc.cluster.local" "127.0.0.1:8080" inbound|80|http|knative-kafka-channel.knative-eventing.svc.cluster.local - 100.96.1.33:8080 100.96.0.173:39420 - default
[2020-05-15T07:23:45.741Z] "POST / HTTP/1.1" 202 - "-" "-" 7 0 192 190 "-" "Go-http-client/1.1" "0c27c2a8-d24e-4cc4-a511-ad99757ee8b8" "default-kne-trigger-kafkachannel.e2e-event-mesh.svc.cluster.local" "127.0.0.1:8080" inbound|80|http|knative-kafka-channel.knative-eventing.svc.cluster.local - 100.96.1.33:8080 100.96.2.62:36006 - default

[Envoy (Epoch 0)] [2020-05-15 07:41:12.892][24][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gRPC config stream closed: 13,
[Envoy (Epoch 0)] [2020-05-15 08:03:01.617][24][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gRPC config stream closed: 13,

The corresponding Messaging status of Event-Hub: LatstTestWithError.PNG

It seems that we have here an issues with "knative-kafka-channel" connected to Event-Hub

travis-minke-sap commented 4 years ago

A couple of scenarios were uncovered while investigating potential "event loss" issues related to the e2e tests. I will attempt to describe them in detail, starting with the simpler of the two...

Issue 1 - Knative Filter 503 Errors

A scenario was witnessed whereby the dispatcher for the namespace-specific KafkaChannel / Topic logged successful transmission of the event to the Knative Trigger having received a 202 back. The subscriber never received the event however. A quick look at the Knative Filter logs via... kubectl logs -f -n e2e-event-mesh default-broker-filter-766bb5bf5f-tqc58 filter... showed the following 503 errors...

{"level":"warn","ts":1589909169.2971025,"logger":"fallback","caller":"http/transport.go:508","msg":"got an error from receiver fn","error":"error sending cloudevent: 503 Service Unavailable"}
{"level":"warn","ts":1589909169.2971194,"logger":"fallback","caller":"http/transport.go:600","msg":"error returned from invokeReceiver","error":"error sending cloudevent: 503 Service Unavailable"}
{"level":"error","ts":"2020-05-19T17:26:09.768Z","logger":"broker_filter","caller":"filter/filter_handler.go:209","msg":"Error sending the event","commit":"8a92a8c","error":"error sending cloudevent: 503 Service Unavailable","stacktrace":"knative.dev/eventing/pkg/broker/filter.(*Handler).serveHTTP\n\t/home/prow/go/src/knative.dev/eventing/pkg/broker/filter/filter_handler.go:209\nreflect.Value.call\n\t/usr/local/go/src/reflect/value.go:460\nreflect.Value.Call\n\t/usr/local/go/src/reflect/value.go:321\nknative.dev/eventing/vendor/github.com/cloudevents/sdk-go/pkg/cloudevents/client.(*receiverFn).invoke\n\t/home/prow/go/src/knative.dev/eventing/vendor/github.com/cloudevents/sdk-go/pkg/cloudevents/client/receiver.go:93\nknative.dev/eventing/vendor/github.com/cloudevents/sdk-go/pkg/cloudevents/client.(*ceClient).obsReceive\n\t/home/prow/go/src/knative.dev/eventing/vendor/github.com/cloudevents/sdk-go/pkg/cloudevents/client/client.go:133\nknative.dev/eventing/vendor/github.com/cloudevents/sdk-go/pkg/cloudevents/client.(*ceClient).Receive\n\t/home/prow/go/src/knative.dev/eventing/vendor/github.com/cloudevents/sdk-go/pkg/cloudevents/client/client.go:121\nknative.dev/eventing/vendor/github.com/cloudevents/sdk-go/pkg/cloudevents/transport/http.(*Transport).obsInvokeReceiver\n\t/home/prow/go/src/knative.dev/eventing/vendor/github.com/cloudevents/sdk-go/pkg/cloudevents/transport/http/transport.go:506\nknative.dev/eventing/vendor/github.com/cloudevents/sdk-go/pkg/cloudevents/transport/http.(*Transport).invokeReceiver\n\t/home/prow/go/src/knative.dev/eventing/vendor/github.com/cloudevents/sdk-go/pkg/cloudevents/transport/http/transport.go:490\nknative.dev/eventing/vendor/github.com/cloudevents/sdk-go/pkg/cloudevents/transport/http.(*Transport).ServeHTTP\n\t/home/prow/go/src/knative.dev/eventing/vendor/github.com/cloudevents/sdk-go/pkg/cloudevents/transport/http/transport.go:598\nnet/http.(*ServeMux).ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2387\nknative.dev/eventing/vendor/go.opencensus.io/plugin/ochttp.(*Handler).ServeHTTP\n\t/home/prow/go/src/knative.dev/eventing/vendor/go.opencensus.io/plugin/ochttp/server.go:86\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2802\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1890"}

Further investigation to determine the root cause of the 503 errors is warranted but as far as the dispatcher was concerned the broker/trigger/filter returned a success 202.

Issue 2 - Knative-Kafka Dispatcher Initialization

Before we get into the specific use case implemented by the e2e test it would be helpful to provide some background...

There are two KafkaChannels / Topics involved in the sending an event through the Kyma eventing runtime, one for ingress and another for namespace relay. The sender of the event will receive a 202/success once the event has been written to the first Topic. After that point all handling of the event is inherently asynchronous by virtue of producing/consuming to/from the Kafka Topic.

This initial KafkaChannel / Topic is created when an Application is added to the Kyma system. The secondary KafkaChannel / Topic is created when an Application is bound to a namespace. Finally, a trigger is created when a lambda is created. When the dispatcher associated with the trigger / subscription starts up for the first time it begins at the latest event offset and starts consuming NEW events only. These actions will be performed manually by customers at human-speed (clicks on various admin console pages). More importantly, there is NO relationship between these actions - meaning that an application might be bound to 0 or more namespaces, with or without any number of active subscribers.

Therefore, it is important to understand that sending an event to an application, and receiving a 202 response back, in NO WAY guarantees delivery to specific triggers/subscribers who might be about to startup, in the process of starting up, or havejust started up. The "at least once" guarantee only applies to triggers / subscriptions / dispatchers that are currently up and running. Reminder as well that the retry is configurable, but defaults to a five minute exponential backoff before dropping the event and moving on. No guarantees are made regarding WHEN a newly starting trigger / subscription / dispatcher will START receiving events. In practice it should be a very short amount of time, on the order of seconds, but that will depend on cluster load and K8S scheduling and myriad other factors. The Channel pod (Kafka Producer) does NOT perform any checking of any downstream state (Channels / Subscriptions / Triggers in READY state for example) before producing events to a Kafka Topic.

Now back the e2e tests...

The tests were experiencing failure on race condition of the startup of the dispatcher. The tests are relying on the Trigger being READY as an indication that the dispatcher has finished reconciling the subscription, and has created consumers that are actively initialized and consuming. Adding a sleep in the test before sending events allows the dispatcher enough time to be up and ready. While this feels wrong (tests should generally not be timing sensitive) we are choreographing a bunch of unrelated initialization processes that don't map to production usage (as described above).

Possible Changes...

Conclusions...

I would argue that while it's possible for the implementation to be tightened up a bit, the fundamental expectations of how the system ought to behave were just misplaced. In a production scenario a human user will be configuring these things and then will start sending events which will have allowed for plenty of time for the dispatcher / consumers to be running. Further the user has no way to see/know that a Trigger is READY, and if they send events very quickly after creating a lambda and the first couple don't flow through that ought to be within advertised operating parameters.

ghost commented 4 years ago

Closed caused by the related PR: #8489 which was merged, reopen it.