kyma-project / kyma

Kyma is an opinionated set of Kubernetes-based modular building blocks, including all necessary capabilities to develop and run enterprise-grade cloud-native applications.
https://kyma-project.io
Apache License 2.0
1.52k stars 405 forks source link

Losing events caused by endless subscription reconciliation with NATS backend #12891

Closed ghost closed 2 years ago

ghost commented 2 years ago

Description Eventing on Kyma 1.24.8 with multiple active subscriptions (both profiles), but it could happen in Kyma 2.0 too.

In our Kyma Clusters (Stage, Production) we see events  not getting sended to the subscription functions (e.g. 1725 events arrive through the proxy to NATS in 15 minutes but only 273 arrive at the subscribed lambda). Our suspicion is, that the controller is not forwarding all events properly.
As we are running on open source kyma 1.24.8 we do not think that the following bugs apply to us.
https://github.com/kyma-project/kyma/issues/11719
https://github.com/kyma-project/kyma/issues/11989
By searching for the issue we've realized that the eventing controller does permanent reconciliations (see video attached). We are unsure if this is proper operator loop behavior or not. (kubectl logs -f deployment/eventing-controller -n kyma-system -c controller --tail 10).
Do you have any idea regarding the lost events or direction for us to look into?
Logs from controller reconsiliation:
{"level":"INFO","timestamp":"2021-12-14T14:23:44Z","logger":"reconciler.Subscription","caller":"handlers/nats.go:129","message":"connection status","context":{"status":1}}
{"level":"INFO","timestamp":"2021-12-14T14:23:44Z","logger":"reconciler.Subscription","caller":"handlers/nats.go:145","message":"successfully unsubscribed/deleted subscription","context":{}}
{"level":"INFO","timestamp":"2021-12-14T14:23:44Z","logger":"reconciler.Subscription","caller":"eventtype/clean.go:53","message":"clean event-type","context":{"before":"sap.kyma.custom.sapcrm.account.updated.v1","after":"sap.kyma.custom.sapcrm.account.updated.v1"}}
{"level":"INFO","timestamp":"2021-12-14T14:23:44Z","logger":"reconciler.Subscription","caller":"subscription-nats/reconciler.go:221","message":"successfully created Nats subscriptions","context":{"kind":"Subscription","name":"dcp-webhook-sapcrm.account.updated.v1","namespace":"dcp-webhook","version":2}}
{"level":"INFO","timestamp":"2021-12-14T14:23:44Z","logger":"reconciler.Subscription","caller":"subscription-nats/reconciler.go:266","message":"successfully updated subscription status","context":{}}
(edited)
Bildschirmaufnahme 2021-12-14 um 15.23.16.mov 
#11719 The "multiplexer" eventing pattern doesn't work with the NATS backend
<https://github.com/kyma-project/kyma|kyma-project/kyma>kyma-project/kyma | Jul 20th | Added by a bot
#11989 Event delivery for NATS is broken with multiple subscriptions
<https://github.com/kyma-project/kyma|kyma-project/kyma>kyma-project/kyma | Aug 30th | Added by a bot

Radu Fantaziu  9:04 AM
OK, I’ll have a look.
Is this happen only on the “production profile” or in  “evaluation profile” too?

Martin Fischer  9:05 AM
In both.

Radu Fantaziu  9:47 AM
Thanks! It seems that we have a new bug in 1.24.8 :white_frowning_face:
FYI: we are focused now to get Kyma 2.0 released… I’ll install 1.24.8 and try to reproduce it.

Martin Fischer  9:50 AM
In a 5 minute timeframe we see as example in NATS 565 messages but the controller only sends 152 to the sink.
NATS
2021-12-15 09:05:00 
2021-12-15T08:05:00.769656466Z stderr F [1] 2021/12/15 08:05:00.769482 [TRC] 10.244.9.116:56616 - cid:15 - <<- MSG_PAYLOAD: ["{\"specversion\":\"1.0\",\"id\":\"0894ef23-ff11-1eec-97af-b269948681b8\",\"source\":\"sapcrm\",\"type\":\"sap.kyma.custom.sapcrm.account.updated.v1\",\"datacontenttype\":\"application/json\",\"data\":{\"AccountId\":\"0029892582\"},\"eventtypeversion\":\"v1\"}"]Show context
2021-12-15 09:05:00 
2021-12-15T08:05:00.482387524Z stderr F [1] 2021/12/15 08:05:00.482157 [TRC] 10.244.9.116:56616 - cid:15 - <<- MSG_PAYLOAD: ["{\"specversion\":\"1.0\",\"id\":\"0894ef23-ff11-1eec-97af-b269948661b8\",\"source\":\"sapcrm\",\"type\":\"sap.kyma.custom.sapcrm.account.updated.v1\",\"datacontenttype\":\"application/json\",\"data\":{\"AccountId\":\"0029880017\"},\"eventtypeversion\":\"v1\"}"]
Controller
2021-12-15 09:05:00 
2021-12-15T08:05:00.576500454Z stderr F {"level":"INFO","timestamp":"2021-12-15T08:05:00Z","logger":"reconciler.Subscription","caller":"handlers/nats.go:186","message":"Successfully dispatched event id: 0894ef23-ff11-1eec-97af-b269948541b8 to sink: http://dcp-lambda-reload-customer-cache.dcp-customer.svc.cluster.local","context":{}}Show context
2021-12-15 09:04:59 
2021-12-15T08:04:59.932974425Z stderr F {"level":"INFO","timestamp":"2021-12-15T08:04:59Z","logger":"reconciler.Subscription","caller":"handlers/nats.go:186","message":"Successfully dispatched event id: 0894ef23-ff11-1eec-97af-b2699484a1b8 to sink: http://dcp-lambda-reload-customer-cache.dcp-customer.svc.cluster.local","context":{}}

Radu Fantaziu  10:59 AM
It could happen, cause during the reconcilation of a susbcription, the event flow to the sink is broken…
And NATS doesn’t resent the event…
This is why, on Kyma 2.0 we are supporting JetStreams.

Martin Fischer  11:15 AM
So the permanent reconciliation seems to be a problem and is not "normal" behavior?

Radu Fantaziu  11:15 AM
yes, true!
11:16
The reconcilation should be triggered only if a subscription CR was changed
11:18
or it was a “touch” on it, so that K8S calls our controller to reconcile it…

Martin Fischer  11:37 AM
Do you have any idea what causes the permanent reconcilaton and how to stop it? (edited) 

Radu Fantaziu  5:48 PM
I’ve looked into your movie…
A lot of reconciliation callbacks are made by K8S. You can find the entry point for a reconciliation call for a subscription in the log file, searching for:
{"level":"INFO","timestamp":"2021-12-15T16:35:48Z","logger":"reconciler.Subscription","caller":"subscription-nats/reconciler.go:135","message":"received subscription reconciliation request","context":{"namespace":"test","name":"test"}}
where the namespace is your subscription namespace ( dcp-webhook ) and the name is the subscription name (dcp-webhook-sapcrm.account.updated.v1)
The question is why K8S generates these callbacks? Usually, only if the yaml file of a subscription is changed, the callback will be triggered .

Martin Fischer  8:33 AM
Ok.
I see these events all over in the logs. Trying to check.
{"level":"INFO","timestamp":"2021-12-16T07:24:44Z","logger":"reconciler.Subscription","caller":"subscription-nats/reconciler.go:135","message":"received subscription reconciliation request","context":{"namespace":"dcp-customer","name":"func-customer-fetch-deviatingaccount.updated"}}
{"level":"INFO","timestamp":"2021-12-16T07:24:44Z","logger":"reconciler.Subscription","caller":"subscription-nats/reconciler.go:135","message":"received subscription reconciliation request","context":{"namespace":"dcp-customer","name":"func-customer-fetch-account.created"}}
{"level":"INFO","timestamp":"2021-12-16T07:24:44Z","logger":"reconciler.Subscription","caller":"subscription-nats/reconciler.go:135","message":"received subscription reconciliation request","context":{"namespace":"dcp-webhook","name":"dcp-webhook-ldm.leadstatus.updated.v1"}}
{"level":"INFO","timestamp":"2021-12-16T07:24:44Z","logger":"reconciler.Subscription","caller":"subscription-nats/reconciler.go:135","message":"received subscription reconciliation request","context":{"namespace":"ps-customer","name":"ps-lambda-customer-selection-customer.selection"}}
{"level":"INFO","timestamp":"2021-12-16T07:24:44Z","logger":"reconciler.Subscription","caller":"subscription-nats/reconciler.go:135","message":"received subscription reconciliation request","context":{"namespace":"dcp-customer","name":"dcp-lambda-reload-user-cache-user.updated"}}
{"level":"INFO","timestamp":"2021-12-16T07:24:44Z","logger":"reconciler.Subscription","caller":"subscription-nats/reconciler.go:135","message":"received subscription reconciliation request","context":{"namespace":"dcp-transaction","name":"func-export-cards-export.carddetails"}}
{"level":"INFO","timestamp":"2021-12-16T07:24:44Z","logger":"reconciler.Subscription","caller":"subscription-nats/reconciler.go:135","message":"received subscription reconciliation request","context":{"namespace":"dcp-order","name":"dcp-lambda-reload-contract-cache-contract.updated"}}
{"level":"INFO","timestamp":"2021-12-16T07:24:44Z","logger":"reconciler.Subscription","caller":"subscription-nats/reconciler.go:135","message":"received subscription reconciliation request","context":{"namespace":"dcp-webhook","name":"dcp-webhook-ldm.deal.created.v1"}}
{"level":"INFO","timestamp":"2021-12-16T07:24:45Z","logger":"reconciler.Subscription","caller":"subscription-nats/reconciler.go:135","message":"received subscription reconciliation request","context":{"namespace":"dcp-webhook","name":"dcp-webhook-slm.product.updated.v1"}}
{"level":"INFO","timestamp":"2021-12-16T07:24:45Z","logger":"reconciler.Subscription","caller":"subscription-nats/reconciler.go:135","message":"received subscription reconciliation request","context":{"namespace":"slm-integration","name":"func-unpublish-scpi-product.unpublished"}}
{"level":"INFO","timestamp":"2021-12-16T07:24:45Z","logger":"reconciler.Subscription","caller":"subscription-nats/reconciler.go:135","message":"received subscription reconciliation request","context":{"namespace":"dcp-webhook","name":"dcp-webhook-saps4.user.updated.v1"}}
{"level":"INFO","timestamp":"2021-12-16T07:24:45Z","logger":"reconciler.Subscription","caller":"subscription-nats/reconciler.go:135","message":"received subscription reconciliation request","context":{"namespace":"dcp-webhook","name":"dcp-webhook-sapcrm.contract.updated.v1"}}
{"level":"INFO","timestamp":"2021-12-16T07:24:45Z","logger":"reconciler.Subscription","caller":"subscription-nats/reconciler.go:135","message":"received subscription reconciliation request","context":{"namespace":"dcp-transaction","name":"dcp-lambda-store-sap-order-consignment.process.s
ghost commented 2 years ago

Reproduced on Kyma 1.24.8, Azure/K8S 1.20.12

ghost commented 2 years ago

Reproduced on Kyma 2.0.0, AWS/K8S 1.20.12 Steps to reproduce it: