vmware-samples / vcenter-event-broker-appliance

The VMware Event Broker Appliance Fling enables customers to unlock the hidden potential of events in their SDDC to easily create event-driven automation.
Other
157 stars 68 forks source link

[BUG] event router stops responding after a couple of days #1054

Closed jm66 closed 4 months ago

jm66 commented 1 year ago

Describe the bug

vmware-event-router stops working after two days. No errors registered in logs.

To Reproduce N/A

Expected behavior vmware-event-router to continuously run without interruption.

Screenshots N/A

Version (please complete the following information):

Additional context

This behaviour is only shown in our prod instance which events volume is greater than our testing instance.

github-actions[bot] commented 1 year ago

Howdy ๐Ÿ–   jm66 ! Thank you for your interest in this project. We value your feedback and will respond soon.

lamw commented 1 year ago

Hi @jm66 - When you say the vmware-event-router stops working, could you elaborate what you're seeing? Is the pod running or has it gone into a crash / restart state? Does this happen immediately after deployment or after a period of time? How large is your vCenter inventory, we'll need a bit more information to understand the issue.

Also, if you're on VEBA Slack channel, we can also chat further to diagnose the issue

jm66 commented 1 year ago

Our env is around 3-4K VMs. The pod looks healthy but no events are shown in the pod logs nor the /events resource. Tried months ago joining but no luck. Would you mind sending an invite for the slack channel to Jm.Lopez (at) utoronto.ca?

lamw commented 1 year ago

OK, lets take a look at the setup to see whats going on. I've added your email to invite, you should get an email to complete the signup and you can join https://vmwarecode.slack.com/archives/CQLT9B5AA

jm66 commented 1 year ago

Just joined. Thanks. Any particular channel?

lamw commented 1 year ago

Yes, the one I linked above (that should take you to our VEBA channel)

embano1 commented 1 year ago

I wonder if it could be related to this? https://github.com/vmware-samples/vcenter-event-broker-appliance/issues/809

rguske commented 1 year ago

I wonder if it could be related to this? https://github.com/vmware-samples/vcenter-event-broker-appliance/issues/809

Most likely ๐Ÿ‘๐Ÿป

lamw commented 1 year ago

@embano1 That's a good point! I was actually thinking it could be that but forgot I had a blog post about it.

@jm66 Could you check https://williamlam.com/2022/07/heads-up-potential-missing-vcenter-server-events-due-to-sequence-id-overflow.html and see if this is what you're observing?

jm66 commented 1 year ago

Hey @lamw I checked the post and the chainId is positive:

chainId int 310662297

Is it possible we have experienced the issue in the past and vCenter self heals?

embano1 commented 1 year ago

vCenter doesn't "heal", but this int32 wraps over time so it could be that the underlying event collector used by the router might be stuck. Could you please try and restart the router/VEBA? Also, enabling DEBUG logging can also reveal some insights. All assuming the router is actually running correctly.

jm66 commented 1 year ago

@embano1 yeah, actually to workaround this we cron'd this /usr/bin/kubectl rollout restart deployment vmware-event-router-vcenter -n vmware-system to run daily. So far the router hasn't got stuck.

Got it. Enabled debug and disabled the cron job.

jm66 commented 1 year ago

@embano1 the event-router is stuck now:

kubectl get pods -n vmware-system
NAME                                          READY   STATUS    RESTARTS        AGE
cadvisor-rhdw8                                1/1     Running   9 (7d1h ago)    31d
fluent-bit-rzbd4                              1/1     Running   15 (7d1h ago)   31d
tinywww-669b487769-pr76s                      1/1     Running   6 (7d1h ago)    31d
veba-rabbit-server-0                          1/1     Running   9 (7d1h ago)    31d
veba-ui-846bb59f69-n7ln7                      1/1     Running   6 (7d1h ago)    31d
vmware-event-router-vcenter-ffb58667c-pwj9z   1/1     Running   0               2d
vmware-event-router-webhook-755654855-dw7mc   1/1     Running   0               2d

Last event UserLogoutSessionEvent with key 310792526 was processed at 2023-05-18T17:15:14.667Z.

embano1 commented 1 year ago

Is the from the DEBUG log? Anything else in there?

jm66 commented 1 year ago

Nothing relevant to my eyes, just the last event, but I could share the logs if you'd like.

jm66 commented 1 year ago

2023-05-18T17:15:14.667Z   DEBUG [KNATIVE]   knative/knative.go:184  got response   {"eventID": "1f98ca23-2fe3-4376-8606-3c1c965b9a23", "response": "202: "}
2023-05-18T17:15:14.667Z   INFO  [KNATIVE]   knative/knative.go:194  successfully sent event {"eventID": "1f98ca23-2fe3-4376-8606-3c1c965b9a23"}
2023-05-18T17:15:14.667Z   INFO  [VCENTER]   vcenter/vcenter.go:343  invoking processor   {"eventID": "2971841d-73cd-48b0-a68e-bd7065b59dc9"}
2023-05-18T17:15:14.667Z   DEBUG [KNATIVE]   knative/knative.go:170  processing event  {"eventID": "2971841d-73cd-48b0-a68e-bd7065b59dc9", "event": "Context Attributes,\n  specversion: 1.0\n  type: com.vmware.event.router/event\n  source: https://vcdomain/sdk\n  subject: UserLogoutSessionEvent\n  id: 2971841d-73cd-48b0-a68e-bd7065b59dc9\n  time: 2023-05-18T17:15:13.103999Z\n  datacontenttype: application/json\nExtensions,\n  vsphereapiversion: 7.0.3.0\nData,\n  {\n    \"Key\": 310792526,\n    \"ChainId\": 310792526,\n    \"CreatedTime\": \"2023-05-18T17:15:13.103999Z\",\n    \"UserName\": \"VSKEY5\\\\vsswebservices\",\n    \"Datacenter\": null,\n    \"ComputeResource\": null,\n    \"Host\": null,\n    \"Vm\": null,\n    \"Ds\": null,\n    \"Net\": null,\n    \"Dvs\": null,\n    \"FullFormattedMessage\": \"User ****** logged out (login time: Thursday, 18 May, 2023 05:15:01 PM, number of API invocations: 4, user agent: Go-http-client/1.1)\",\n    \"ChangeTag\": \"\",\n    \"IpAddress\": \"*****\",\n    \"UserAgent\": \"Go-http-client/1.1\",\n    \"CallCount\": 4,\n    \"SessionId\": \"5219944b-3cd6-66e9-4ff8-4e662962c304\",\n    \"LoginTime\": \"2023-05-18T17:15:01.773485Z\"\n  }\n"}
2023-05-18T17:15:14.667Z   INFO  [KNATIVE]   knative/knative.go:182  sending event  {"eventID": "2971841d-73cd-48b0-a68e-bd7065b59dc9", "subject": "UserLogoutSessionEvent"}```
jm66 commented 1 year ago

Hey @embano1 , got this after submitting kubectl rollout restart deployment vmware-event-router-vcenter -n vmware-system:

root@veba [ ~ ]# 2023-05-19T15:47:25.339Z   INFO    [MAIN]  router/main.go:212  initiating shutdown {"commit": "73f72694", "version": "v0.7.5"}
2023-05-19T15:47:25.339Z    INFO    [VCENTER]   vcenter/vcenter.go:370  attempting graceful shutdown
2023-05-19T15:47:25.339Z    DEBUG   [KNATIVE]   knative/knative.go:184  got response    {"eventID": "2971841d-73cd-48b0-a68e-bd7065b59dc9", "response": "Post \"http://default-broker-ingress.vmware-functions.svc.cluster.local/\": context canceled"}
2023-05-19T15:47:25.339Z    ERROR   [VCENTER]   vcenter/vcenter.go:347  could not process event {"event": "Context Attributes,\n  specversion: 1.0\n  type: com.vmware.event.router/event\n  source: https://****/sdk\n  subject: UserLogoutSessionEvent\n  id: 2971841d-73cd-48b0-a68e-bd7065b59dc9\n  time: 2023-05-18T17:15:13.103999Z\n  datacontenttype: application/json\nExtensions,\n  vsphereapiversion: 7.0.3.0\nData,\n  {\n    \"Key\": 310792526,\n    \"ChainId\": 310792526,\n    \"CreatedTime\": \"2023-05-18T17:15:13.103999Z\",\n    \"UserName\": \"********\",\n    \"Datacenter\": null,\n    \"ComputeResource\": null,\n    \"Host\": null,\n    \"Vm\": null,\n    \"Ds\": null,\n    \"Net\": null,\n    \"Dvs\": null,\n    \"FullFormattedMessage\": \"User ******* logged out (login time: Thursday, 18 May, 2023 05:15:01 PM, number of API invocations: 4, user agent: Go-http-client/1.1)\",\n    \"ChangeTag\": \"\",\n    \"IpAddress\": \"******\",\n    \"UserAgent\": \"Go-http-client/1.1\",\n    \"CallCount\": 4,\n    \"SessionId\": \"5219944b-3cd6-66e9-4ff8-4e662962c304\",\n    \"LoginTime\": \"2023-05-18T17:15:01.773485Z\"\n  }\n", "error": "knative: send event 2971841d-73cd-48b0-a68e-bd7065b59dc9: Post \"http://default-broker-ingress.vmware-functions.svc.cluster.local/\": context canceled"}
github.com/vmware-samples/vcenter-event-broker-appliance/vmware-event-router/internal/provider/vcenter.(*EventStream).processEvents
    github.com/vmware-samples/vcenter-event-broker-appliance/vmware-event-router/internal/provider/vcenter/vcenter.go:347
github.com/vmware-samples/vcenter-event-broker-appliance/vmware-event-router/internal/provider/vcenter.(*EventStream).stream
    github.com/vmware-samples/vcenter-event-broker-appliance/vmware-event-router/internal/provider/vcenter/vcenter.go:318
github.com/vmware-samples/vcenter-event-broker-appliance/vmware-event-router/internal/provider/vcenter.(*EventStream).Stream
    github.com/vmware-samples/vcenter-event-broker-appliance/vmware-event-router/internal/provider/vcenter/vcenter.go:231
main.main.func3
    github.com/vmware-samples/vcenter-event-broker-appliance/vmware-event-router/cmd/router/main.go:206
golang.org/x/sync/errgroup.(*Group).Go.func1
    golang.org/x/sync@v0.1.0/errgroup/errgroup.go:75
embano1 commented 1 year ago

That error is fine because during a rollout the current instance is terminated and the code reacts gracefully to it (context cancelled). The current event in flight is then not correctly processed from the source (vcenter) and, depending on whether checkpointing is enabled, the code will resume from the last checkpoint or "now" when it starts.

Even before the shutdown it was processing an event so I was wondering whether the code/event stream works as expected then?

laugrean commented 1 year ago

Hey, Unfortunately we encounter the same problem. Helm deployment: Chart: event-router-v0.7.6
App version: v0.7.5

On our test environment everything works for months already.

If I activate info mode, it just stops logging anything:

2023-06-13T06:13:05.216Z INFO [KNATIVE] knative/knative.go:194 successfully sent event {"eventID": "4ab869ed-b2e0-4f90-9fad-00a5c1d83f82"}
2023-06-13T06:13:05.941Z INFO [VCENTER] vcenter/vcenter.go:343 invoking processor {"eventID": "06d60432-b9c9-4a3e-9f04-51bc8a875d10"}
2023-06-13T06:13:05.941Z INFO [KNATIVE] knative/knative.go:182 sending event {"eventID": "06d60432-b9c9-4a3e-9f04-51bc8a875d10", "subject": "com.vmware.vc.HA.NotAllHostAddrsPingable"}
2023-06-13T06:13:05.942Z INFO [KNATIVE] knative/knative.go:194 successfully sent event {"eventID": "06d60432-b9c9-4a3e-9f04-51bc8a875d10"}

Within the debug mode we can only see those entries anymore:

2023-06-13T06:35:50.917Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 5}
2023-06-13T06:35:55.973Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 5}
2023-06-13T06:36:01.026Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 5}
2023-06-13T06:36:06.072Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 5}

a short pod deletion fixes the problem and everything is back to normal for at least a few minutes, sometimes hours...


2023-06-13T06:41:13.196Z INFO [KNATIVE] knative/knative.go:182 sending event {"eventID": "80e3ab5d-9dd3-47ef-8ad2-efe54ec86872", "subject": "com.vmware.vcIntegrity.ClusterMembershipChange"}
2023-06-13T06:41:13.197Z DEBUG [KNATIVE] knative/knative.go:184 got response {"eventID": "80e3ab5d-9dd3-47ef-8ad2-efe54ec86872", "response": "202: "}
2023-06-13T06:41:13.197Z INFO [KNATIVE] knative/knative.go:194 successfully sent event {"eventID": "80e3ab5d-9dd3-47ef-8ad2-efe54ec86872"}
2023-06-13T06:41:14.051Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 1}
2023-06-13T06:41:15.108Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 2}
2023-06-13T06:41:17.160Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 4}
2023-06-13T06:41:21.237Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 5}
2023-06-13T06:41:26.298Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 5}
2023-06-13T06:41:31.414Z INFO [VCENTER] vcenter/vcenter.go:343 invoking processor {"eventID": "24358daa-9903-4929-bceb-96d617560f1c"}
``ยด
lamw commented 1 year ago

@embano1 would you also recommend folks here give Tanzu Sourecs for Knative a try to see if that helps?

embano1 commented 1 year ago

@embano1 would you also recommend folks here give Tanzu Sourecs for Knative a try to see if that helps?

You mean regarding the VC DB overflow issue? Sources won't help here bc it's using the same EventHistoryCollector mechanism as router - and this is a server-side issue.

lamw commented 1 year ago

You mean regarding the VC DB overflow issue?

No, I meant for @jm66 issue https://github.com/vmware-samples/vcenter-event-broker-appliance/issues/1054#issuecomment-1554782948 are we saying this is related to VC DB overflow?

but there's also @laugrean report which isn't clear to me if its VC DB overflow either unless you're referring to this issue?

embano1 commented 1 year ago

No, I meant for @jm66 issue #1054 (comment) are we saying this is related to VC DB overflow?

See my response above: https://github.com/vmware-samples/vcenter-event-broker-appliance/issues/1054#issuecomment-1555717307

The initial issue description though seems to be a deadlock in the router.

but there's also @laugrean report which isn't clear to me if its VC DB overflow either unless you're referring to this issue?

IIRC, this is also deadlock in router.

lamw commented 1 year ago

@embano1 OK, since both of the issues reported seems to be a deadlock in router ... then my initial comment about testing Tanzu Sources on VEBA would at least see if that may help with issue? If so, I'll put something quick/dirty together that'll go ahead and undeploy router and setup sources ...

embano1 commented 1 year ago

@embano1 OK, since both of the issues reported seems to be a deadlock in router ... then my initial comment about testing Tanzu Sources on VEBA would at least see if that may help with issue? If so, I'll put something quick/dirty together that'll go ahead and undeploy router and setup sources ...

Yup, we should definitely cross-check with Sources. There's lots of code both share, hoping it's related to event processing/invocation where code differs.

lamw commented 1 year ago

@jm66 @laugrean To summarize the next steps, we would like to try un-deploying the event-router from within VEBA and deploy the Tanzu Sources for Knative which includes vSphere as a source (similar code which was ported from VEBA to Tanzu Sources) to see if this issue is resolved.

The instructions below assumes VEBA appliance can go outbound to pull down some additional packages and a require change is needed for your function.yaml deployment as it slightly different schema in how you subscribe to an event, which I will detail further below after the setup.

Step 0 - SSH to VEBA appliance

Step 1 - Undeploy Event Router

kubectl -n vmware-system delete -f /root/config/event-router/vmware-event-router-k8s-vcenter.yaml

Step 2 - Install Tanzu Sources for Knative

kubectl apply -f https://github.com/vmware-tanzu/sources-for-knative/releases/latest/download/release.yaml

Step 3 - Install Knative CLI & vSphere Sources

curl -L https://github.com/knative/client/releases/download/knative-v1.10.0/kn-linux-amd64 -o /usr/local/bin/kn
chmod +x /usr/local/bin/kn
curl -L https://github.com/vmware-tanzu/sources-for-knative/releases/download/v0.37.0/kn-vsphere_0.37.0_Linux_x86_64.tar.gz -o /root/kn-vsphere_0.37.0_Linux_x86_64.tar.gz
tar -zxvf /root/kn-vsphere_0.37.0_Linux_x86_64.tar.gz
mv /root/kn-vsphere_0.37.0_Linux_x86_64/kn-vsphere /usr/local/bin/kn-vsphere
chmod +x /usr/local/bin/kn-vsphere
rm -rf /root/kn-vsphere_0.37.0_Linux_x86_64*

Step 4 - Export VC Creds

export VCENTER_USERNAME="FILL_ME"
export VCENTER_PASSWORD="FILL_ME"
export VCENTER_HOSTNAME="FILL_ME"

Step 5 - Create vSphere Secret

kn vsphere auth create \
    --namespace vmware-functions \
    --username ${VCENTER_USERNAME} \
    --password ${VCENTER_PASSWORD} \
    --name vcenter-creds \
    --verify-url https://${VCENTER_HOSTNAME} \
    --verify-insecure

Step 6 - Create vSphere Source

kn vsphere source create \
    --namespace vmware-functions \
    --name vcsa-source \
    --vc-address https://${VCENTER_HOSTNAME} \
    --skip-tls-verify \
    --secret-ref vcenter-creds \
    --sink-uri http://default-broker-ingress.vmware-functions.svc.cluster.local \
    --encoding json

If everything was setup successfully, you should see the following pods running:

# kubectl get pods -A
NAMESPACE            NAME                                                  READY   STATUS      RESTARTS      AGE
cert-manager         cert-manager-99bb69456-dhxcv                          1/1     Running     2 (23m ago)   89d
cert-manager         cert-manager-cainjector-ffb4747bb-8x7ck               1/1     Running     2 (23m ago)   89d
cert-manager         cert-manager-webhook-545bd5d7d8-5fxtr                 1/1     Running     2 (23m ago)   89d
contour-external     contour-685f87dc74-ccdcn                              1/1     Running     2 (23m ago)   89d
contour-external     contour-685f87dc74-q24fx                              1/1     Running     2 (23m ago)   89d
contour-external     contour-certgen-v1.22.0-42c87                         0/1     Completed   0             89d
contour-external     envoy-jqzft                                           2/2     Running     4 (23m ago)   89d
contour-internal     contour-c4478d89b-cvdk5                               1/1     Running     2 (23m ago)   89d
contour-internal     contour-c4478d89b-ws4tw                               1/1     Running     2 (23m ago)   89d
contour-internal     contour-certgen-v1.22.0-nn2r2                         0/1     Completed   0             89d
contour-internal     envoy-pkb7d                                           2/2     Running     4 (23m ago)   89d
knative-eventing     eventing-controller-fdc4dd6bb-6zttl                   1/1     Running     2 (23m ago)   89d
knative-eventing     eventing-webhook-676dfb6c4f-hmnwz                     1/1     Running     2 (23m ago)   89d
knative-eventing     rabbitmq-broker-controller-54c85d4f98-b5vrv           1/1     Running     2 (23m ago)   89d
knative-eventing     rabbitmq-broker-webhook-877b8d7df-bqd4t               1/1     Running     2 (23m ago)   89d
knative-serving      activator-7cbbfbc85-4zg95                             1/1     Running     2 (23m ago)   89d
knative-serving      autoscaler-8f986cff-jgngg                             1/1     Running     2 (23m ago)   89d
knative-serving      controller-58dfb45d74-bz88p                           1/1     Running     2 (23m ago)   89d
knative-serving      domain-mapping-5d8db49bf6-8z96x                       1/1     Running     2 (23m ago)   89d
knative-serving      domainmapping-webhook-584476fd67-cdm5s                1/1     Running     2 (23m ago)   89d
knative-serving      net-contour-controller-6768758c67-wb5jn               1/1     Running     2 (23m ago)   89d
knative-serving      webhook-6d5c55fd8c-zl5zg                              1/1     Running     2 (23m ago)   89d
kube-system          antrea-agent-9xthc                                    2/2     Running     7 (23m ago)   89d
kube-system          antrea-controller-6db8bb65cf-k8ltr                    1/1     Running     3 (22m ago)   89d
kube-system          coredns-565d847f94-dvrdf                              1/1     Running     2 (23m ago)   89d
kube-system          coredns-565d847f94-zmt46                              1/1     Running     2 (23m ago)   89d
kube-system          etcd-veba.primp-industries.local                      1/1     Running     2 (23m ago)   89d
kube-system          kube-apiserver-veba.primp-industries.local            1/1     Running     2 (23m ago)   89d
kube-system          kube-controller-manager-veba.primp-industries.local   1/1     Running     2 (23m ago)   89d
kube-system          kube-proxy-n22lb                                      1/1     Running     2 (23m ago)   89d
kube-system          kube-scheduler-veba.primp-industries.local            1/1     Running     2 (23m ago)   89d
local-path-storage   local-path-provisioner-5646477f4b-pql69               1/1     Running     2 (23m ago)   89d
rabbitmq-system      messaging-topology-operator-74c896bb55-5h66m          1/1     Running     2 (23m ago)   89d
rabbitmq-system      rabbitmq-cluster-operator-586b7547f8-7vr9x            1/1     Running     2 (23m ago)   89d
vmware-functions     default-broker-ingress-bbf774754-xjzpz                1/1     Running     2 (23m ago)   89d
vmware-functions     sockeye-79b7fc7c55-tmjqk                              1/1     Running     2 (23m ago)   89d
vmware-functions     sockeye-trigger-dispatcher-5b96bdd8d-mjth6            1/1     Running     2 (23m ago)   89d
vmware-functions     vcsa-source-adapter-7d474dbbfb-xd7ms                  1/1     Running     0             7m2s
vmware-sources       horizon-source-controller-754d465b5b-ptbmw            1/1     Running     0             17m
vmware-sources       horizon-source-webhook-6d7cbd8dd-mtz2t                1/1     Running     0             17m
vmware-sources       vsphere-source-webhook-55c56cf48-x7fpw                1/1     Running     0             17m
vmware-system        cadvisor-jt8rk                                        1/1     Running     2 (23m ago)   89d
vmware-system        tinywww-5b795ddd75-pjds5                              1/1     Running     2 (23m ago)   89d
vmware-system        veba-rabbit-server-0                                  1/1     Running     4 (21m ago)   89d
vmware-system        veba-ui-846bb59f69-rk994                              1/1     Running     2 (23m ago)   89d

Specifically, we want to make sure vSphere Sources Adapter is running as shown in example below:

kubectl -n vmware-functions get deploy/vcsa-source-adapter
NAME                  READY   UP-TO-DATE   AVAILABLE   AGE
vcsa-source-adapter   1/1     1            1           11m

We can check the logs of vSphere sources and ensure the very last last or so state it was able to login by retrieving the VC time:

kubectl -n vmware-functions logs deploy/vcsa-source-adapter
{"level":"warn","ts":"2023-06-18T14:09:14.176Z","logger":"vsphere-source-adapter","caller":"v2/config.go:197","msg":"Tracing configuration is invalid, using the no-op default{error 26 0  empty json tracing config}","commit":"8fda92a-dirty"}
{"level":"warn","ts":"2023-06-18T14:09:14.177Z","logger":"vsphere-source-adapter","caller":"v2/config.go:190","msg":"Sink timeout configuration is invalid, default to -1 (no timeout)","commit":"8fda92a-dirty"}
{"level":"info","ts":"2023-06-18T14:09:14.261Z","logger":"vsphere-source-adapter","caller":"kvstore/kvstore_cm.go:54","msg":"Initializing configMapKVStore...","commit":"8fda92a-dirty"}
{"level":"info","ts":"2023-06-18T14:09:14.271Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:92","msg":"configuring checkpointing","commit":"8fda92a-dirty","ReplayWindow":"5m0s","Period":"10s"}
{"level":"warn","ts":"2023-06-18T14:09:14.271Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:131","msg":"could not retrieve checkpoint configuration","commit":"8fda92a-dirty","error":"key checkpoint does not exist"}
{"level":"info","ts":"2023-06-18T14:09:14.274Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:311","msg":"no valid checkpoint found","commit":"8fda92a-dirty"}
{"level":"info","ts":"2023-06-18T14:09:14.274Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:312","msg":"setting begin of event stream","commit":"8fda92a-dirty","beginTimestamp":"2023-06-18 14:09:14.260972 +0000 UTC"}
{"level":"info","ts":"2023-06-18T14:14:14.261Z","logger":"vsphere-source-adapter","caller":"vsphere/client.go:115","msg":"Executing SOAP keep-alive handler","commit":"8fda92a-dirty","rpc":"keepalive"}
{"level":"info","ts":"2023-06-18T14:14:14.265Z","logger":"vsphere-source-adapter","caller":"vsphere/client.go:121","msg":"vCenter current time: 2023-06-18 14:14:14.31874 +0000 UTC","commit":"8fda92a-dirty","rpc":"keepalive"}

At this point, you should also be seeing events flow into Sockeye by opening browser to https://VEBA_FQDN/events

Lastly, to deploy or re-deploy your functions, you need to edit the function.yaml manifest and update the type field to something like: com.vmware.vsphere.<EVENTID>.v0 and remove the subject field.

....snip ....
apiVersion: eventing.knative.dev/v1
kind: Trigger
metadata:
  name: veba-ps-slack-trigger
  labels:
    app: veba-ui
spec:
  broker: default
  filter:
    attributes:
      type: com.vmware.vsphere.DrsVmPoweredOnEvent.v0
  subscriber:
    ref:
      apiVersion: serving.knative.dev/v1
      kind: Service
      name: kn-ps-slack

Let me know if you have any questions and hopefully this yields better results ...

JFYI - Tanzu Sources by default does not log all events with their default INFO logging, you'll have to enable DEBUG to do so if you wish to see vSphere events in Tanzu Sources logs but I think for now, lets see if this resolves the issue with the setup

laugrean commented 1 year ago

Hey,

In my szenario veba is running as a helm chart deployment on RedHat Openshift. I've multiple vcenter connected to the same broker within the same namespace.

The installation comes directly from your github repo: https://github.com/vmware-samples/vcenter-event-broker-appliance.git

Should I uninstall the helm chart in step 1?

Is your release.yaml mentioned in step 2 based on tanzu specialities or is it working on any kubernetes? Currently my helm deployment and all trigger + functions are installed in the same namespace. If I'll use it, is it sufficient to delete the namespace creation part and replace all namespaces with the one I want it to be installed?

embano1 commented 1 year ago

Should I uninstall the helm chart in step 1?

Without touching your prod environment, you could also deploy a separate VEBA instance/Kubernetes environment and just deploy the sources to a broker without additional triggers to see if the sources continue to run when you see the router stopping in the other appliance.

But you can also run sources in parallel to your existing setup by installing as described above w/out having to uninstall your router. Depending on your configured triggers, this can lead to duplicate events though. So just be careful.

laugrean commented 1 year ago

Did it on my test environment. First problem: we always need Openshift Cluster Admin permissions. But here we go:

{"level":"warn","ts":"2023-06-19T07:31:42.597Z","logger":"vsphere-source-adapter","caller":"v2/config.go:197","msg":"Tracing configuration is invalid, using the no-op default{error 26 0 empty json tracing config}","commit":"8fda92a-dirty"}
{"level":"warn","ts":"2023-06-19T07:31:42.597Z","logger":"vsphere-source-adapter","caller":"v2/config.go:190","msg":"Sink timeout configuration is invalid, default to -1 (no timeout)","commit":"8fda92a-dirty"}
{"level":"info","ts":"2023-06-19T07:31:42.701Z","logger":"vsphere-source-adapter","caller":"kvstore/kvstore_cm.go:54","msg":"Initializing configMapKVStore...","commit":"8fda92a-dirty"}
{"level":"info","ts":"2023-06-19T07:31:42.715Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:92","msg":"configuring checkpointing","commit":"8fda92a-dirty","ReplayWindow":"5m0s","Period":"10s"}
{"level":"warn","ts":"2023-06-19T07:31:42.715Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:131","msg":"could not retrieve checkpoint configuration","commit":"8fda92a-dirty","error":"key checkpoint does not exist"}
{"level":"info","ts":"2023-06-19T07:31:42.725Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:311","msg":"no valid checkpoint found","commit":"8fda92a-dirty"}
{"level":"info","ts":"2023-06-19T07:31:42.725Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:312","msg":"setting begin of event stream","commit":"8fda92a-dirty","beginTimestamp":"2023-06-19 07:31:42.72101 +0000 UTC"}

It's connected, but I cannot see any events within sockeye.

embano1 commented 1 year ago

To keep things simple for now, try enabling DEBUG log level to see if it's receiving/doing something: https://github.com/vmware-tanzu/sources-for-knative#source-adapter-log-level

laugrean commented 1 year ago

Finally I made it, it did not work with the sink-uri, but the sink-name:

kn vsphere source create \
    --namespace xxx\
    --name vcsa-source \
    --vc-address https://${VCENTER_HOSTNAME} \
    --skip-tls-verify \
    --secret-ref vcenter-creds \
    --sink-api-version eventing.knative.dev/v1 \
    --sink-kind Broker \
    --sink-name default \
    --encoding json

Now the vsphere event source and two router are running against the same vcenter.

I'll let you know when I see the event router problem again, and if the vcsa will survive.

laugrean commented 1 year ago

Both router connected to my vcenter got stucked again at the same point in time, but the vsca still receives events

embano1 commented 1 year ago

Both router connected to my vcenter got stucked again at the same point in time, but the vsca still receives events

Wait, sources and router are not sending anything anymore? Can you please enable DEBUG logging in the sources and dump output here when this happens?

laugrean commented 1 year ago

source is still working routers stopped working

My setup is: router with info mode against vcenter -> stoppedreceiving events (vmware-router-74f7c779f4-cv82r) router with debug mode against vcenter -> stopped receiving events** (vmware-router-debug-668b6dfd75-mdm7) event with debug mode against vcenter -> still working (vcsa-source-adapter-84c75c57f6-kq4x2)

I've setup two routers, to check if they got stucked at the same time -> Yes they do. Your provided solution to swith to Tanzu Sources is now running for more than one day without any problems.

embano1 commented 1 year ago

stopped receiving events**

What do these ** mean? Do you have some logs here again please?

Your provided solution to swith to Tanzu Sources is now running for more than one day without any problems.

This is good news!

laugrean commented 1 year ago

What do these ** mean? Do you have some logs here again please?

just missed to remove the bold code style at the end.

Logs: The debug node:

2023-06-21T06:03:21.132Z INFO [KNATIVE] knative/knative.go:194 successfully sent event {"eventID": "724e457d-28f1-428a-a0c5-a419358362bd"}
2023-06-21T06:03:21.910Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 1}
2023-06-21T06:03:22.960Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 2}
2023-06-21T06:03:25.011Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 4}
2023-06-21T06:03:29.066Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 5}
2023-06-21T06:03:34.120Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 5}
... and so on

Last entry for the info node, after that nothing else

2023-06-21T06:03:21.568Z INFO [KNATIVE] knative/knative.go:194 successfully sent event {"eventID": "f0ec150b-2e4e-45e4-9452-53a31284779b"}

source logs

{"level":"debug","ts":"2023-06-21T06:03:17.000Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:203","msg":"backing off retrieving events: no new events received","commit":"8fda92a-dirty","backoffSeconds":5}
{"level":"debug","ts":"2023-06-21T06:03:22.096Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:208","msg":"got 3 events","commit":"8fda92a-dirty"}
{"level":"debug","ts":"2023-06-21T06:03:22.096Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:268","msg":"sending event","commit":"8fda92a-dirty","ID":"395017269","type":"com.vmware.vsphere.DrsSoftRuleViolationEvent.v0","data":"xxx"}
{"level":"debug","ts":"2023-06-21T06:03:22.100Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:268","msg":"sending event","commit":"8fda92a-dirty","ID":"395017270","type":"com.vmware.vsphere.DrsSoftRuleViolationEvent.v0","data":"xxx"}
{"level":"debug","ts":"2023-06-21T06:03:22.101Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:268","msg":"sending event","commit":"8fda92a-dirty","ID":"395017271","type":"com.vmware.vsphere.DrsSoftRuleViolationEvent.v0","data":"xxx"}
{"level":"debug","ts":"2023-06-21T06:03:22.153Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:203","msg":"backing off retrieving events: no new events received","commit":"8fda92a-dirty","backoffSeconds":1}
{"level":"debug","ts":"2023-06-21T06:03:23.208Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:203","msg":"backing off retrieving events: no new events received","commit":"8fda92a-dirty","backoffSeconds":2}

During a short investigation I could see this problem in AriaLogs, not sure if this is releated. Did not had time to further investigate.

vpxd-main - - - 2023-06-21T08:03:21.399+02:00 warning vpxd[15384] [Originator@6876 sub=vmomi.soapStub[826] opID=TaskLoop-host-8441071] SOAP request returned HTTP failure; <SSL

last time we could see:

info vpxd[13645] [Originator@6876 sub=HttpSvc.HTTPService.HttpConnection] HTTP Connection read failed while waiting for further requests; <io_obj p:0x00007f9594563e60, h:-1, <UNIX '/var/run/vmware/vpxd-webserver-pipe'>, <UNIX ''>>, N7Vmacore16TimeoutExceptionE(Operation timed out: Stream: <io_obj p:0x00007f9594563e60, h:-1, <UNIX '/var/run/vmware/vpxd-webserver-pipe'>, <UNIX ''>>, duration: 00:00:48.994720 (hh:mm:ss.us))\n--> [context]zKq7AVECAQAAAH1IQAEMdnB4ZAAA7tg3bGlidm1hY29yZS5zbwAAp3gsACVsLQAL6TIANyQtAFQEIwAjqCMAp0kjAE+fIwCsZDcBh38AbGlicHRocmVhZC5zby4wAAIvNg9saWJjLnNvLjYA[/context]
embano1 commented 1 year ago

Good news: we don't have a deadlock :) My suspicion here is that, given Sources are working, the router runs into session timeout issues. Looks like this is not implemented in the code base, but I have done it for Sources. That would explain why there're no new events which is fixed after restart.

lamw commented 1 year ago

1075

laugrean commented 1 year ago

Unfortunately our kubernetes admins declined the above mentioned workaround for our production environments.

Currently we have installed everything within one namespace, but there are several ClusterRoleBindings needed e.g.:

Problem are global permissions like:

 configmaps                                                    []                 []              [get list create update delete patch watch]
  events                                                        []                 []              [get list create update delete patch watch]
  secrets                                                       []                 []              [get list create update delete patch watch]
  serviceaccounts                                               []                 []              [get list create update delete patch watch]
  services                                                      []                 []              [get list create update delete patch watch]
  mutatingwebhookconfigurations.admissionregistration.k8s.io    []                 []              [get list create update delete patch watch]
  validatingwebhookconfigurations.admissionregistration.k8s.io  []                 []              [get list create update delete patch watch]
  customresourcedefinitions.apiextensions.k8s.io                []                 []              [get list create update delete patch watch]
  deployments.apps/finalizers                                   []                 []              [get list create update delete patch watch]
  deployments.apps                                              []                 []              [get list create update delete patch watch]
  leases.coordination.k8s.io                                    []                 []              [get list create update delete patch watch]
  rolebindings.rbac.authorization.k8s.io                        []                 []              [get list create update delete patch watch]

Any idea if this can be changed to dedicated namespaces only?

embano1 commented 1 year ago

@laugrean that is understandable. I'll have @lamw and the VMware team jump in to comment on the topic of RBAC here.

lamw commented 1 year ago

@gabo1208 See above ^

laugrean commented 1 year ago

I've now again the same problem with the vcsa-source-adapter. After a few days I do not receive any events anymore. Sockeye is empty for affected source-adapter. Those are the entries within the log:

{"level":"info","ts":"2023-09-05T05:26:19.279Z","logger":"vsphere-source-adapter","caller":"vsphere/client.go:115","msg":"Executing SOAP keep-alive handler","commit":"8fda92a-dirty","rpc":"keepalive"}
{"level":"info","ts":"2023-09-05T05:26:19.289Z","logger":"vsphere-source-adapter","caller":"vsphere/client.go:121","msg":"vCenter current time: 2023-09-05 05:26:19.285489 +0000 UTC","commit":"8fda92a-dirty","rpc":"keepalive"}
{"level":"info","ts":"2023-09-05T05:31:19.291Z","logger":"vsphere-source-adapter","caller":"vsphere/client.go:115","msg":"Executing SOAP keep-alive handler","commit":"8fda92a-dirty","rpc":"keepalive"}
{"level":"info","ts":"2023-09-05T05:31:19.302Z","logger":"vsphere-source-adapter","caller":"vsphere/client.go:121","msg":"vCenter current time: 2023-09-05 05:31:19.298007 +0000 UTC","commit":"8fda92a-dirty","rpc":"keepalive"}

When I kill the vca-source-adapter-pod I see this message:

{"level":"info","ts":"2023-09-05T05:32:49.857Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:296","msg":"found existing checkpoint","commit":"8fda92a-dirty"}
{"level":"warn","ts":"2023-09-05T05:32:49.857Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:299","msg":"potential data loss: last event timestamp in checkpoint is older than configured maximum","commit":"8fda92a-dirty","maxHistory":"5m0s","checkpointTimestamp":"2023-08-25 06:09:09.535 +0000 UTC"}
{"level":"warn","ts":"2023-09-05T05:32:49.857Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:303","msg":"setting begin of event stream","commit":"8fda92a-dirty","beginTimestamp":"2023-09-05 05:27:49.853811 +0000 UTC"}

And everything is back to normal... for at least a few days

embano1 commented 1 year ago

IIUC, when you enable debug logging in the adapter it will tell you that it's not receiving new events, correct? If that's the case, and seeing the above info logs about keep-alives which execute correctly, I'm afraid it's definitely a server-side issue as nothing is coming through the EventHistoryCollector stream anymore. The server might close/cancel that stream after some time, or there is a network issue and the session is not correctly reestablished. Can't tell from this side. Would definitely need someone from engineering to look into the server side behavior and also someone from govmomi (Go SDK) to look into what happens with the stream under the cover e.g., is it cleanly terminated by the server or there's a network interruption, leading to a dangling EventHistoryCollector session where the client believes it's still in tact but no data is received anymore. IIUC, there's also no heartbeat mechanism in the protocol where the client can detect that the stream is basically gone.

BTW: seems you're running 8fda92a-dirty which is before the URL fix. You might want to update the version of Sources.

laugrean commented 1 year ago

Hey,

Sorry for the late reply.

I had now the following scenario:

  1. kill the pod vsphere source pod (no events problem)
  2. it restarted
  3. received 100 events
  4. stops working
{"level":"warn","ts":"2023-09-29T06:24:07.287Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:303","msg":"setting begin of event stream","commit":"8fda92a-dirty","beginTimestamp":"2023-09-29 06:19:07.283263 +0000 UTC"}
{"level":"debug","ts":"2023-09-29T06:24:07.515Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:208","msg":"got 100 events","commit":"8fda92a-dirty"}
<all events listed>
{"level":"debug","ts":"2023-09-29T06:24:07.671Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:203","msg":"backing off retrieving events: no new events received","commit":"8fda92a-dirty","backoffSeconds":1}
{"level":"debug","ts":"2023-09-29T06:24:08.731Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:203","msg":"backing off retrieving events: no new events received","commit":"8fda92a-dirty","backoffSeconds":2}
{"level":"debug","ts":"2023-09-29T06:24:10.797Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:203","msg":"backing off retrieving events: no new events received","commit":"8fda92a-dirty","backoffSeconds":4}
{"level":"debug","ts":"2023-09-29T06:24:14.870Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:203","msg":"backing off retrieving events: no new events received","commit":"8fda92a-dirty","backoffSeconds":5}
{"level":"debug","ts":"2023-09-29T06:24:19.873Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:185","msg":"creating checkpoint","commit":"8fda92a-dirty","checkpoint":{"vCenter":"vcenter.xxx","lastEventKey":425952999,"lastEventType":"com.vmware.vc.HardwareSensorGroupStatus","lastEventKeyTimestamp":"2023-09-29T06:21:29.574999Z","createdTimestamp":"2023-09-29T06:24:07.60603884Z"}}
{"level":"debug","ts":"2023-09-29T06:24:19.952Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:203","msg":"backing off retrieving events: no new events received","commit":"8fda92a-dirty","backoffSeconds":5}
{"level":"debug","ts":"2023-0

it was running for a few seconds only...

I've updated kn-vsphere to the latest release and give it another try

rguske commented 9 months ago

@laugrean any updates on this? We had a user recently reported that after introducing the VSphereSource (replacing the event-router), the issue (stop working) was solved. See: https://vmwarecode.slack.com/archives/CQLT9B5AA/p1701877325178139

rguske commented 8 months ago

@jm66 @laugrean Any updates on this? Is the issue solved by using the Tanzu Sources for Knative (VSphereSource)?

jm66 commented 5 months ago

@rguske Finally got a chance to deploy the latest version after a few issues during the setup.sh script. I will provide an update soon.

Something I noticed is that I cannot see the following events: BadUsernameSessionEvent and com.vmware.sso.LoginFailure.

Also the filtering type is not needed anymore?

rguske commented 5 months ago

Hi @jm66, thanks for your answer.

Something I noticed is that I cannot see the following events: BadUsernameSessionEvent and com.vmware.sso.LoginFailure.

You can see all available events here: EVENTS

Also the filtering type is not needed anymore?

type: is still vaild, see:

Reference: Docs

apiVersion: eventing.knative.dev/v1
kind: Trigger
metadata:
  name: veba-ps-echo-trigger
  labels:
    app: veba-ui
spec:
  broker: default
  filter:
    attributes:
      type: com.vmware.vsphere.VmPoweredOffEvent.v0
  subscriber:
    ref:
      apiVersion: serving.knative.dev/v1
      kind: Service
      name: kn-ps-echo

Looking forward reading your updates.

jm66 commented 5 months ago

Thanks. Yes I see the event listed in the ref docs, but it is not showing in sockeye at all.

In v0.7.5 type was either something like event or eventx. Subject was the actual event type, and now type contains v0.7.5 subject attribute value, with a .v0 suffix?

jm66 commented 5 months ago

You can see all available events here: EVENTS

Got it working. Key is that all events need to use the format com.vmware.vsphere.{event}.v0.

Also the filtering type is not needed anymore?

I stand corrected. I was meaning to refer to subject. Looks like isn't required anymore.

All my functions have been migrated to the new spec and apparently are working.