kyma-project / kyma

Kyma is an opinionated set of Kubernetes-based modular building blocks, including all necessary capabilities to develop and run enterprise-grade cloud-native applications.
https://kyma-project.io
Apache License 2.0
1.51k stars 404 forks source link

knative kafka eventing - huge delay #9699

Closed p4p4 closed 3 years ago

p4p4 commented 3 years ago

Description

After resolving the OOM issue of the kne trigger dispatcher (#9698 ) we figured out that there is still an issue with a huge delay for dispatching events for another event type. We also are still noticing many duplicats (varying from 2 over 10 to up to 40 duplicates)

event type: order.shipped publishing system: CCv2 2 subscribers on this event (named ODS + AFS) expected load: 1000s of events per day

Some example delays: image

Received traffic at the subscriber: image Note that data is received with the mentioned delay and in duplicates. It can also be seen that the traffic is bursty and that there are also times where we don't receive any events.

Expected result

We are using default parameters for the kafka based eventing

➜  ~ k -n knative-eventing describe deployment knative-eventing-kafka-channel-controller 
Name:                   knative-eventing-kafka-channel-controller
Namespace:              knative-eventing
CreationTimestamp:      Fri, 24 Jul 2020 11:53:37 +0200
Labels:                 app=knative-kafka
                        app.kubernetes.io/instance=knative-eventing-kafka
                        app.kubernetes.io/managed-by=Tiller
                        app.kubernetes.io/name=knative-kafka
                        helm.sh/chart=knative-kafka-0.0.1
Annotations:            deployment.kubernetes.io/revision: 1
Selector:               app.kubernetes.io/instance=knative-eventing-kafka,app.kubernetes.io/name=knative-kafka
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:           app=knative-kafka
                    app.kubernetes.io/instance=knative-eventing-kafka
                    app.kubernetes.io/managed-by=Tiller
                    app.kubernetes.io/name=knative-kafka
                    helm.sh/chart=knative-kafka-0.0.1
  Service Account:  knative-eventing-kafka-channel-controller
  Containers:
   channel-controller:
    Image:      eu.gcr.io/kyma-project/incubator/kafka-channel-controller:v0.12.2
    Port:       8081/TCP
    Host Port:  0/TCP
    Limits:
      cpu:     100m
      memory:  50Mi
    Requests:
      cpu:     20m
      memory:  25Mi
    Environment:
      SYSTEM_NAMESPACE:                           (v1:metadata.namespace)
      SERVICE_ACCOUNT:                            (v1:spec.serviceAccountName)
      METRICS_PORT:                              8081
      KAFKA_PROVIDER:                            azure
      KAFKA_OFFSET_COMMIT_MESSAGE_COUNT:         50
      KAFKA_OFFSET_COMMIT_DURATION_MILLIS:       5000
      KAFKA_OFFSET_COMMIT_ASYNC:                 false
      CHANNEL_IMAGE:                             eu.gcr.io/kyma-project/incubator/knative-kafka-channel:v0.12.2
      CHANNEL_REPLICAS:                          1
      DISPATCHER_IMAGE:                          eu.gcr.io/kyma-project/incubator/knative-kafka-dispatcher:v0.12.2
      DEFAULT_NUM_PARTITIONS:                    4
      DEFAULT_REPLICATION_FACTOR:                1
      DEFAULT_RETENTION_MILLIS:                  604800000
      DEFAULT_KAFKA_CONSUMERS:                   4
      DISPATCHER_REPLICAS:                       1
      DISPATCHER_RETRY_INITIAL_INTERVAL_MILLIS:  500
      DISPATCHER_RETRY_TIME_MILLIS:              300000
      DISPATCHER_RETRY_EXPONENTIAL_BACKOFF:      true
      DISPATCHER_CPU_REQUEST:                    300m
      DISPATCHER_CPU_LIMIT:                      500m
      DISPATCHER_MEMORY_REQUEST:                 50Mi
      DISPATCHER_MEMORY_LIMIT:                   128Mi
      CHANNEL_MEMORY_REQUEST:                    50Mi
      CHANNEL_MEMORY_LIMIT:                      100Mi
      CHANNEL_CPU_REQUEST:                       100m
      CHANNEL_CPU_LIMIT:                         200m
    Mounts:
      /etc/knative-kafka from logging-config (rw)
  Volumes:
   logging-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      knative-kafka-logging
    Optional:  false
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Progressing    True    NewReplicaSetAvailable
  Available      True    MinimumReplicasAvailable
OldReplicaSets:  <none>
NewReplicaSet:   knative-eventing-kafka-channel-controller-6b7bb469b5 (1/1 replicas created)
Events:          <none>
anishj0shi commented 3 years ago

From the preliminary investigation (thanks to @travis-minke-sap ) it was observed that the broker filter pod was constantly trying to reach one of the subscriber(cololrblobservice) with a systemic issue. as the broker filter pod has the same knative channel dispatcher, this can be a probable cause of the huge delay in processing events of other event-types as well. Investigation is still in progress.

k15r commented 3 years ago

closed as agreed with @p4p4