kyma-project / kyma

Kyma is an opinionated set of Kubernetes-based modular building blocks, including all necessary capabilities to develop and run enterprise-grade cloud-native applications.
https://kyma-project.io
Apache License 2.0
1.51k stars 404 forks source link

Verify dispatching behaviour to Jetstream #14965

Closed k15r closed 2 years ago

k15r commented 2 years ago

Description

Verify that EPP rejects messages in the following scenario:

Acceptance

k15r commented 2 years ago

Investigate the context deadline exceeded message that occurred during the outage.

$ nats stream info
nats: error: could not pick a Stream to operate on: context deadline exceeded
k15r commented 2 years ago

potential options to reproduce:

k15r commented 2 years ago

TB: 3d

vpaskar commented 2 years ago

Test infrastructure

Scenario:

maxReplicas: 1
minReplicas: 1
resources:
  limits:
    cpu: 100m
    memory: 128Mi
  requests:
    cpu: 50m
    memory: 64Mi

Results After around 30-40 minutes(939 MiB of storage consumed) EPP stopped to accept new events and was returning:

{"level":"INFO","timestamp":"2022-08-17T07:23:21Z","logger":"nats-handler","caller":"nats/handler.go:195","message":"Event dispatched","context":{"id":"A2111-1111-1111","source":"noapp","before":"sap.kyma.custom.noapp.order.created.v1","after":"sap.kyma.custom.noapp.order.created.v1","statusCode":500,"duration":0.504081429,"responseBody":"nats: no response from stream"}}

or in short:

statusCode: 500
nats: no response from stream

NATS CLI on the nats stream ls command returned nats: error: could not list streams: JetStream system temporarily unavailable (10008), try --help

The last time the CLI was responding, that was the output:

image

I checked all the three replicas:

eventing-nats-0: used 100% of storage:

Filesystem                Size      Used Available Use% Mounted on
overlay                  48.1G      7.7G     38.4G  17% /
tmpfs                    64.0M         0     64.0M   0% /dev
tmpfs                    14.7G         0     14.7G   0% /sys/fs/cgroup
/dev/sdg                973.4M    957.4M         0 100% /data

logs:

[32] 2022/08/16 06:07:41.318092 [INF] 100.64.0.55:54452 - rid:49 - Route connection created
[32] 2022/08/16 06:07:47.393696 [INF] JetStream cluster new metadata leader: eventing-nats-1/eventing-nats
[32] 2022/08/16 06:07:55.759515 [INF] 100.64.0.55:6222 - rid:71 - Route connection created
[32] 2022/08/16 06:07:55.759789 [INF] 100.64.0.55:6222 - rid:71 - Router connection closed: Duplicate Route
[32] 2022/08/16 06:07:58.934800 [INF] 100.64.0.56:33360 - rid:72 - Route connection created
[32] 2022/08/16 06:08:09.207150 [WRN] JetStream cluster consumer '$G > sap > 7dc03259c404acadba157faf767ecf83' has NO quorum, stalled.
[32] 2022/08/16 06:08:40.976160 [INF] 100.64.0.56:6222 - rid:74 - Route connection created
[32] 2022/08/16 06:08:40.976390 [INF] 100.64.0.56:6222 - rid:74 - Router connection closed: Duplicate Route
[32] 2022/08/16 07:58:31.069085 [ERR] JetStream failed to store a msg on stream '$G > sap': write /data/jetstream/$G/streams/sap/msgs/381.blk: no space left on device
[32] 2022/08/16 07:58:31.069109 [WRN] Error applying entries to '$G > sap': write /data/jetstream/$G/streams/sap/msgs/381.blk: no space left on device
[32] 2022/08/16 07:58:31.069120 [ERR] JetStream out of File resources, will be DISABLED
[32] 2022/08/16 07:58:31.070137 [INF] Initiating JetStream Shutdown...
[32] 2022/08/16 07:58:31.086371 [INF] JetStream Shutdown
[32] 2022/08/16 11:59:27.307609 [INF] 100.64.0.55:54452 - rid:49 - Router connection closed: Client Closed

eventing-nats-1(leader): used 96% of storage:

/ # df -h
Filesystem                Size      Used Available Use% Mounted on
overlay                  48.1G      7.7G     38.4G  17% /
tmpfs                    64.0M         0     64.0M   0% /dev
tmpfs                    14.7G         0     14.7G   0% /sys/fs/cgroup
/dev/sdj                973.4M    914.7M     42.8M  96% /data

logs:
[32] 2022/08/16 13:23:47.331671 [INF] JetStream cluster no metadata leader
[32] 2022/08/16 13:23:48.351057 [WRN] JetStream cluster consumer '$G > sap > d4bc1887b4ec2f3950a463970e636a90' has NO quorum, stalled.
[32] 2022/08/16 13:23:48.435390 [WRN] JetStream cluster consumer '$G > sap > c74c20756af53b592f87edebff67bdf8' has NO quorum, stalled.
[32] 2022/08/16 13:23:49.609318 [WRN] JetStream cluster consumer '$G > sap > cd5f48da5ce5a301f47665227286d69c' has NO quorum, stalled.
[32] 2022/08/16 13:23:50.568794 [WRN] JetStream cluster consumer '$G > sap > a8abb8e8322b06ac9922a1222341de71' has NO quorum, stalled.
[32] 2022/08/16 13:23:51.513825 [WRN] JetStream cluster consumer '$G > sap > d0a08552e41db68d73d54f2e19c4c118' has NO quorum, stalled

eventing-nats-2: used 100% of storage

Filesystem                Size      Used Available Use% Mounted on
overlay                  48.1G      7.7G     38.4G  17% /
tmpfs                    64.0M         0     64.0M   0% /dev
tmpfs                    14.7G         0     14.7G   0% /sys/fs/cgroup
/dev/sdj                973.4M    914.7M     42.8M  96% /data
[32] 2022/08/16 13:23:47.331671 [INF] JetStream cluster no metadata leader
[32] 2022/08/16 13:23:48.351057 [WRN] JetStream cluster consumer '$G > sap > d4bc1887b4ec2f3950a463970e636a90' has NO quorum, stalled.
[32] 2022/08/16 13:23:48.435390 [WRN] JetStream cluster consumer '$G > sap > c74c20756af53b592f87edebff67bdf8' has NO quorum, stalled.
[32] 2022/08/16 13:23:49.609318 [WRN] JetStream cluster consumer '$G > sap > cd5f48da5ce5a301f47665227286d69c' has NO quorum, stalled.
[32] 2022/08/16 13:23:50.568794 [WRN] JetStream cluster consumer '$G > sap > a8abb8e8322b06ac9922a1222341de71' has NO quorum, stalled.
[32] 2022/08/16 13:23:51.513825 [WRN] JetStream cluster consumer '$G > sap > d0a08552e41db68d73d54f2e19c4c118' has NO quorum, stalled

So, the problem was indeed the full storage, even though the nats SDK was reporting no response from stream instead of no space left on device.

NATS Health checks were ok:

{
    "status": "ok"
}

but the JetStream got disabled:

{
"server_id": "NAB3IMUKTXEIZ7QG3CRHYE3KXXZ3RFXHNWVZZ55DQ4P6ZIZTIIFRNAVB",
"now": "2022-08-16T13:26:58.286323634Z",
"disabled": **true**,
"config": {
"max_memory": 0,
"max_storage": 0
},
"memory": 0,
"storage": 0,
"reserved_memory": 0,
"reserved_storage": 0,
"accounts": 0,
"ha_assets": 0,
"api": {
"total": 0,
"errors": 0
},
"streams": 0,
"consumers": 0,
"messages": 0,
"bytes": 0
}

The both Subscriptions were in the false state b/c of:

conditions:
  - lastTransitionTime: "2022-08-16T09:10:19Z"
    message: 'nats: JetStream system temporarily unavailable'
    reason: NATS Subscription not active
    status: "False"
    type: Subscription active

and the stream had the following configuration:


                 Name: eventing-nats
               Leader: #
              Replica: eventing-nats-0, outdated, seen 7m5s ago, 1,851,676 operations behind
              Replica: eventing-nats-1, outdated, seen 3h8m7s ago, 1,851,676 operations behind
              Replica: eventing-nats-2, outdated, seen 5m44s ago, 1,851,676 operations behind

State:

             Messages: 1,436,508
                Bytes: 941 MiB
             FirstSeq: 29,997 @ 2022-08-17T12:48:52 UTC
              LastSeq: 1,851,674 @ 2022-08-17T14:59:25 UTC
     Deleted Messages: 385170
     Active Consumers: 20

we see, that the pods are outdated, the leader is not elected and a lot of 1400000+ undelivered messages.

here is what the dashboard displayed: image

Trying to revive the JetStream

One way to revive the JetStream is to restart the nodes. I deleted all the three pods sequentially. Then JetStream got enabled and started to retry to dispatch the messages from the stream. To test the redelivery I turned on the sinks. One thing I observed is, that sinks with their current resources

maxReplicas: 1
minReplicas: 1
resources:
  limits:
    cpu: 100m
    memory: 128Mi
  requests:
    cpu: 50m
    memory: 64Mi

cannot keep up with 1000 requests per second redelivery rate, so the sinks kept shutting down under the load and were constantly restarting. So the sinks were the bottle-neck in case of redelivery:

Eventing-controller logs:

{"level":"ERROR","timestamp":"2022-08-17T07:41:04Z","logger":"jetstream-handler","caller":"handlers/jetstream.go:473","message":"Failed to dispatch the CloudEvent","context":{"id":"ae34bd3a-b3b1-48d9-a1b7-38d59b80638b","source":"kyma","type":"sap.kyma.custom.commerce.order.created.v030","sink":"http://sink-0.tunas.svc.cluster.local"}}
{"level":"INFO","timestamp":"2022-08-17T08:41:02Z","logger":"nats-handler","caller":"nats/handler.go:195","message":"Event dispatched","context":{"id":"1c9fe2ee-b070-4338-8a25-23cd95df121b","source":"kyma","before":"sap.kyma.custom.commerce.order.updated.v020","after":"sap.kyma.custom.commerce.order.updated.v020","statusCode":500,"duration":5.000078098,"responseBody":"context deadline exceeded"}}
{"level":"ERROR","timestamp":"2022-08-17T07:41:04Z","logger":"jetstream-handler","caller":"handlers/jetstream.go:473","message":"Failed to dispatch the CloudEvent","context":{"id":"36335f7f-2882-46f7-888a-f4dac2ce6cb9","source":"kyma","type":"sap.kyma.custom.commerce.order.created.v050","sink":"http://sink-0.tunas.svc.cluster.local"}}
{"level":"ERROR","timestamp":"2022-08-17T07:41:04Z","logger":"jetstream-handler","caller":"handlers/jetstream.go:473","message":"Failed to dispatch the CloudEvent","context":{"id":"91cb4e9a-0a38-491a-914c-1b664d5118ba","source":"kyma","type":"sap.kyma.custom.commerce.order.created.v050","sink":"http://sink-0.tunas.svc.cluster.local"}}
{"level":"ERROR","timestamp":"2022-08-17T07:41:04Z","logger":"jetstream-handler","caller":"handlers/jetstream.go:473","message":"Failed to dispatch the CloudEvent","context":{"id":"e56ec64b-bb4d-4413-97a0-4c5a8bcb1efc","source":"kyma","type":"sap.kyma.custom.commerce.order.created.v050","sink":"http://sink-0.tunas.svc.cluster.local"}}
{"level":"ERROR","timestamp":"2022-08-17T07:41:04Z","logger":"jetstream-handler","caller":"handlers/jetstream.go:473","message":"Failed to dispatch the CloudEvent","context":{"id":"8eddd144-49b6-4077-b5dc-e7c5a33118ca","source":"kyma","type":"sap.kyma.custom.commerce.order.created.v080","sink":"http://sink-0.tunas.svc.cluster.local"}}

some are dispatched, but with a very slow rate, b/c the sink started getting:

Warning  Unhealthy  57m (x2 over 57m)     kubelet            Startup probe failed: HTTP probe failed with statuscode: 500
Warning  Unhealthy  56m (x4 over 56m)     kubelet            Liveness probe failed: Get "http://100.64.0.86:15020/app-health/function/livez": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Warning  Unhealthy  17m (x98 over 56m)    kubelet            Readiness probe failed: Get "http://100.64.0.86:15020/app-health/function/readyz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Scaling up the sinks to:

maxReplicas: 5
minReplicas: 1
resources:
  limits:
    cpu: 2000m
    memory: 256Mi
  requests:
    cpu: 100m
    memory: 128Mi

did the trick, and the sinks were able to receive the events without any loss.

Conclusions

  1. When the PVC Storage gets full, the JetStream gets disabled. NATS doesn’t expose the right error message. Instead of propagating the no space left on device error, nats returns no response from stream , which causes confusion. As per NATS core maintainers, this is the expected behaviour (reference).
  2. When JetStream is disabled, no events can be published to the stream, be it:
{"level":"INFO","timestamp":"2022-08-17T07:23:21Z","logger":"nats-handler","caller":"nats/handler.go:195","message":"Event dispatched","context":{"id":"A2111-1111-1111","source":"noapp","before":"sap.kyma.custom.noapp.order.created.v1","after":"sap.kyma.custom.noapp.order.created.v1","statusCode":500,"duration":0.504081429,"responseBody":"nats: no response from stream"}}

or 

{"level":"INFO","timestamp":"2022-08-17T08:41:03Z","logger":"nats-handler","caller":"nats/handler.go:195","message":"Event dispatched","context":{"id":"c7fa3968-2610-4a40-baf9-95e8c57f731a","source":"kyma","before":"sap.kyma.custom.commerce.order.updated.v030","after":"sap.kyma.custom.commerce.order.updated.v030","statusCode":500,"duration":5.00035585,"responseBody":"context deadline exceeded"}}

According to the current EPP code:

_, err = jsCtx.PublishMsg(msg)
    if err != nil {
        ...
        return http.StatusInternalServerError, err
    }

the error will get propagated and no other messages will be accepted to the stream.

Tasks:

Follow-up questions: