Closed jm66 closed 4 months ago
Howdy ๐ jm66 ! Thank you for your interest in this project. We value your feedback and will respond soon.
Hi @jm66 - When you say the vmware-event-router
stops working, could you elaborate what you're seeing? Is the pod running or has it gone into a crash / restart state? Does this happen immediately after deployment or after a period of time? How large is your vCenter inventory, we'll need a bit more information to understand the issue.
Also, if you're on VEBA Slack channel, we can also chat further to diagnose the issue
Our env is around 3-4K VMs. The pod looks healthy but no events are shown in the pod logs nor the /events resource. Tried months ago joining but no luck. Would you mind sending an invite for the slack channel to Jm.Lopez (at) utoronto.ca?
OK, lets take a look at the setup to see whats going on. I've added your email to invite, you should get an email to complete the signup and you can join https://vmwarecode.slack.com/archives/CQLT9B5AA
Just joined. Thanks. Any particular channel?
Yes, the one I linked above (that should take you to our VEBA channel)
I wonder if it could be related to this? https://github.com/vmware-samples/vcenter-event-broker-appliance/issues/809
I wonder if it could be related to this? https://github.com/vmware-samples/vcenter-event-broker-appliance/issues/809
Most likely ๐๐ป
@embano1 That's a good point! I was actually thinking it could be that but forgot I had a blog post about it.
@jm66 Could you check https://williamlam.com/2022/07/heads-up-potential-missing-vcenter-server-events-due-to-sequence-id-overflow.html and see if this is what you're observing?
Hey @lamw I checked the post and the chainId
is positive:
chainId | int | 310662297 |
---|
Is it possible we have experienced the issue in the past and vCenter self heals?
vCenter doesn't "heal", but this int32
wraps over time so it could be that the underlying event collector used by the router might be stuck. Could you please try and restart the router/VEBA? Also, enabling DEBUG logging can also reveal some insights. All assuming the router is actually running correctly.
@embano1 yeah, actually to workaround this we cron'd this /usr/bin/kubectl rollout restart deployment vmware-event-router-vcenter -n vmware-system
to run daily. So far the router hasn't got stuck.
Got it. Enabled debug
and disabled the cron job.
@embano1 the event-router
is stuck now:
kubectl get pods -n vmware-system
NAME READY STATUS RESTARTS AGE
cadvisor-rhdw8 1/1 Running 9 (7d1h ago) 31d
fluent-bit-rzbd4 1/1 Running 15 (7d1h ago) 31d
tinywww-669b487769-pr76s 1/1 Running 6 (7d1h ago) 31d
veba-rabbit-server-0 1/1 Running 9 (7d1h ago) 31d
veba-ui-846bb59f69-n7ln7 1/1 Running 6 (7d1h ago) 31d
vmware-event-router-vcenter-ffb58667c-pwj9z 1/1 Running 0 2d
vmware-event-router-webhook-755654855-dw7mc 1/1 Running 0 2d
Last event UserLogoutSessionEvent
with key 310792526
was processed at 2023-05-18T17:15:14.667Z
.
Is the from the DEBUG log? Anything else in there?
Nothing relevant to my eyes, just the last event, but I could share the logs if you'd like.
2023-05-18T17:15:14.667Z DEBUG [KNATIVE] knative/knative.go:184 got response {"eventID": "1f98ca23-2fe3-4376-8606-3c1c965b9a23", "response": "202: "}
2023-05-18T17:15:14.667Z INFO [KNATIVE] knative/knative.go:194 successfully sent event {"eventID": "1f98ca23-2fe3-4376-8606-3c1c965b9a23"}
2023-05-18T17:15:14.667Z INFO [VCENTER] vcenter/vcenter.go:343 invoking processor {"eventID": "2971841d-73cd-48b0-a68e-bd7065b59dc9"}
2023-05-18T17:15:14.667Z DEBUG [KNATIVE] knative/knative.go:170 processing event {"eventID": "2971841d-73cd-48b0-a68e-bd7065b59dc9", "event": "Context Attributes,\n specversion: 1.0\n type: com.vmware.event.router/event\n source: https://vcdomain/sdk\n subject: UserLogoutSessionEvent\n id: 2971841d-73cd-48b0-a68e-bd7065b59dc9\n time: 2023-05-18T17:15:13.103999Z\n datacontenttype: application/json\nExtensions,\n vsphereapiversion: 7.0.3.0\nData,\n {\n \"Key\": 310792526,\n \"ChainId\": 310792526,\n \"CreatedTime\": \"2023-05-18T17:15:13.103999Z\",\n \"UserName\": \"VSKEY5\\\\vsswebservices\",\n \"Datacenter\": null,\n \"ComputeResource\": null,\n \"Host\": null,\n \"Vm\": null,\n \"Ds\": null,\n \"Net\": null,\n \"Dvs\": null,\n \"FullFormattedMessage\": \"User ****** logged out (login time: Thursday, 18 May, 2023 05:15:01 PM, number of API invocations: 4, user agent: Go-http-client/1.1)\",\n \"ChangeTag\": \"\",\n \"IpAddress\": \"*****\",\n \"UserAgent\": \"Go-http-client/1.1\",\n \"CallCount\": 4,\n \"SessionId\": \"5219944b-3cd6-66e9-4ff8-4e662962c304\",\n \"LoginTime\": \"2023-05-18T17:15:01.773485Z\"\n }\n"}
2023-05-18T17:15:14.667Z INFO [KNATIVE] knative/knative.go:182 sending event {"eventID": "2971841d-73cd-48b0-a68e-bd7065b59dc9", "subject": "UserLogoutSessionEvent"}```
Hey @embano1 , got this after submitting kubectl rollout restart deployment vmware-event-router-vcenter -n vmware-system
:
root@veba [ ~ ]# 2023-05-19T15:47:25.339Z INFO [MAIN] router/main.go:212 initiating shutdown {"commit": "73f72694", "version": "v0.7.5"}
2023-05-19T15:47:25.339Z INFO [VCENTER] vcenter/vcenter.go:370 attempting graceful shutdown
2023-05-19T15:47:25.339Z DEBUG [KNATIVE] knative/knative.go:184 got response {"eventID": "2971841d-73cd-48b0-a68e-bd7065b59dc9", "response": "Post \"http://default-broker-ingress.vmware-functions.svc.cluster.local/\": context canceled"}
2023-05-19T15:47:25.339Z ERROR [VCENTER] vcenter/vcenter.go:347 could not process event {"event": "Context Attributes,\n specversion: 1.0\n type: com.vmware.event.router/event\n source: https://****/sdk\n subject: UserLogoutSessionEvent\n id: 2971841d-73cd-48b0-a68e-bd7065b59dc9\n time: 2023-05-18T17:15:13.103999Z\n datacontenttype: application/json\nExtensions,\n vsphereapiversion: 7.0.3.0\nData,\n {\n \"Key\": 310792526,\n \"ChainId\": 310792526,\n \"CreatedTime\": \"2023-05-18T17:15:13.103999Z\",\n \"UserName\": \"********\",\n \"Datacenter\": null,\n \"ComputeResource\": null,\n \"Host\": null,\n \"Vm\": null,\n \"Ds\": null,\n \"Net\": null,\n \"Dvs\": null,\n \"FullFormattedMessage\": \"User ******* logged out (login time: Thursday, 18 May, 2023 05:15:01 PM, number of API invocations: 4, user agent: Go-http-client/1.1)\",\n \"ChangeTag\": \"\",\n \"IpAddress\": \"******\",\n \"UserAgent\": \"Go-http-client/1.1\",\n \"CallCount\": 4,\n \"SessionId\": \"5219944b-3cd6-66e9-4ff8-4e662962c304\",\n \"LoginTime\": \"2023-05-18T17:15:01.773485Z\"\n }\n", "error": "knative: send event 2971841d-73cd-48b0-a68e-bd7065b59dc9: Post \"http://default-broker-ingress.vmware-functions.svc.cluster.local/\": context canceled"}
github.com/vmware-samples/vcenter-event-broker-appliance/vmware-event-router/internal/provider/vcenter.(*EventStream).processEvents
github.com/vmware-samples/vcenter-event-broker-appliance/vmware-event-router/internal/provider/vcenter/vcenter.go:347
github.com/vmware-samples/vcenter-event-broker-appliance/vmware-event-router/internal/provider/vcenter.(*EventStream).stream
github.com/vmware-samples/vcenter-event-broker-appliance/vmware-event-router/internal/provider/vcenter/vcenter.go:318
github.com/vmware-samples/vcenter-event-broker-appliance/vmware-event-router/internal/provider/vcenter.(*EventStream).Stream
github.com/vmware-samples/vcenter-event-broker-appliance/vmware-event-router/internal/provider/vcenter/vcenter.go:231
main.main.func3
github.com/vmware-samples/vcenter-event-broker-appliance/vmware-event-router/cmd/router/main.go:206
golang.org/x/sync/errgroup.(*Group).Go.func1
golang.org/x/sync@v0.1.0/errgroup/errgroup.go:75
That error is fine because during a rollout the current instance is terminated and the code reacts gracefully to it (context cancelled
). The current event in flight is then not correctly processed from the source (vcenter) and, depending on whether checkpointing is enabled, the code will resume from the last checkpoint or "now" when it starts.
Even before the shutdown it was processing an event so I was wondering whether the code/event stream works as expected then?
Hey,
Unfortunately we encounter the same problem.
Helm deployment:
Chart: event-router-v0.7.6
App version: v0.7.5
On our test environment everything works for months already.
If I activate info mode, it just stops logging anything:
2023-06-13T06:13:05.216Z INFO [KNATIVE] knative/knative.go:194 successfully sent event {"eventID": "4ab869ed-b2e0-4f90-9fad-00a5c1d83f82"}
2023-06-13T06:13:05.941Z INFO [VCENTER] vcenter/vcenter.go:343 invoking processor {"eventID": "06d60432-b9c9-4a3e-9f04-51bc8a875d10"}
2023-06-13T06:13:05.941Z INFO [KNATIVE] knative/knative.go:182 sending event {"eventID": "06d60432-b9c9-4a3e-9f04-51bc8a875d10", "subject": "com.vmware.vc.HA.NotAllHostAddrsPingable"}
2023-06-13T06:13:05.942Z INFO [KNATIVE] knative/knative.go:194 successfully sent event {"eventID": "06d60432-b9c9-4a3e-9f04-51bc8a875d10"}
Within the debug mode we can only see those entries anymore:
2023-06-13T06:35:50.917Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 5}
2023-06-13T06:35:55.973Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 5}
2023-06-13T06:36:01.026Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 5}
2023-06-13T06:36:06.072Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 5}
a short pod deletion fixes the problem and everything is back to normal for at least a few minutes, sometimes hours...
2023-06-13T06:41:13.196Z INFO [KNATIVE] knative/knative.go:182 sending event {"eventID": "80e3ab5d-9dd3-47ef-8ad2-efe54ec86872", "subject": "com.vmware.vcIntegrity.ClusterMembershipChange"}
2023-06-13T06:41:13.197Z DEBUG [KNATIVE] knative/knative.go:184 got response {"eventID": "80e3ab5d-9dd3-47ef-8ad2-efe54ec86872", "response": "202: "}
2023-06-13T06:41:13.197Z INFO [KNATIVE] knative/knative.go:194 successfully sent event {"eventID": "80e3ab5d-9dd3-47ef-8ad2-efe54ec86872"}
2023-06-13T06:41:14.051Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 1}
2023-06-13T06:41:15.108Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 2}
2023-06-13T06:41:17.160Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 4}
2023-06-13T06:41:21.237Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 5}
2023-06-13T06:41:26.298Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 5}
2023-06-13T06:41:31.414Z INFO [VCENTER] vcenter/vcenter.go:343 invoking processor {"eventID": "24358daa-9903-4929-bceb-96d617560f1c"}
``ยด
@embano1 would you also recommend folks here give Tanzu Sourecs for Knative a try to see if that helps?
@embano1 would you also recommend folks here give Tanzu Sourecs for Knative a try to see if that helps?
You mean regarding the VC DB overflow issue? Sources won't help here bc it's using the same EventHistoryCollector
mechanism as router
- and this is a server-side issue.
You mean regarding the VC DB overflow issue?
No, I meant for @jm66 issue https://github.com/vmware-samples/vcenter-event-broker-appliance/issues/1054#issuecomment-1554782948 are we saying this is related to VC DB overflow?
but there's also @laugrean report which isn't clear to me if its VC DB overflow either unless you're referring to this issue?
No, I meant for @jm66 issue #1054 (comment) are we saying this is related to VC DB overflow?
See my response above: https://github.com/vmware-samples/vcenter-event-broker-appliance/issues/1054#issuecomment-1555717307
The initial issue description though seems to be a deadlock in the router
.
but there's also @laugrean report which isn't clear to me if its VC DB overflow either unless you're referring to this issue?
IIRC, this is also deadlock in router
.
@embano1 OK, since both of the issues reported seems to be a deadlock in router ... then my initial comment about testing Tanzu Sources on VEBA would at least see if that may help with issue? If so, I'll put something quick/dirty together that'll go ahead and undeploy router and setup sources ...
@embano1 OK, since both of the issues reported seems to be a deadlock in router ... then my initial comment about testing Tanzu Sources on VEBA would at least see if that may help with issue? If so, I'll put something quick/dirty together that'll go ahead and undeploy router and setup sources ...
Yup, we should definitely cross-check with Sources. There's lots of code both share, hoping it's related to event processing/invocation where code differs.
@jm66 @laugrean To summarize the next steps, we would like to try un-deploying the event-router from within VEBA and deploy the Tanzu Sources for Knative which includes vSphere as a source (similar code which was ported from VEBA to Tanzu Sources) to see if this issue is resolved.
The instructions below assumes VEBA appliance can go outbound to pull down some additional packages and a require change is needed for your function.yaml
deployment as it slightly different schema in how you subscribe to an event, which I will detail further below after the setup.
Step 0 - SSH to VEBA appliance
Step 1 - Undeploy Event Router
kubectl -n vmware-system delete -f /root/config/event-router/vmware-event-router-k8s-vcenter.yaml
Step 2 - Install Tanzu Sources for Knative
kubectl apply -f https://github.com/vmware-tanzu/sources-for-knative/releases/latest/download/release.yaml
Step 3 - Install Knative CLI & vSphere Sources
curl -L https://github.com/knative/client/releases/download/knative-v1.10.0/kn-linux-amd64 -o /usr/local/bin/kn
chmod +x /usr/local/bin/kn
curl -L https://github.com/vmware-tanzu/sources-for-knative/releases/download/v0.37.0/kn-vsphere_0.37.0_Linux_x86_64.tar.gz -o /root/kn-vsphere_0.37.0_Linux_x86_64.tar.gz
tar -zxvf /root/kn-vsphere_0.37.0_Linux_x86_64.tar.gz
mv /root/kn-vsphere_0.37.0_Linux_x86_64/kn-vsphere /usr/local/bin/kn-vsphere
chmod +x /usr/local/bin/kn-vsphere
rm -rf /root/kn-vsphere_0.37.0_Linux_x86_64*
Step 4 - Export VC Creds
export VCENTER_USERNAME="FILL_ME"
export VCENTER_PASSWORD="FILL_ME"
export VCENTER_HOSTNAME="FILL_ME"
Step 5 - Create vSphere Secret
kn vsphere auth create \
--namespace vmware-functions \
--username ${VCENTER_USERNAME} \
--password ${VCENTER_PASSWORD} \
--name vcenter-creds \
--verify-url https://${VCENTER_HOSTNAME} \
--verify-insecure
Step 6 - Create vSphere Source
kn vsphere source create \
--namespace vmware-functions \
--name vcsa-source \
--vc-address https://${VCENTER_HOSTNAME} \
--skip-tls-verify \
--secret-ref vcenter-creds \
--sink-uri http://default-broker-ingress.vmware-functions.svc.cluster.local \
--encoding json
If everything was setup successfully, you should see the following pods running:
# kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
cert-manager cert-manager-99bb69456-dhxcv 1/1 Running 2 (23m ago) 89d
cert-manager cert-manager-cainjector-ffb4747bb-8x7ck 1/1 Running 2 (23m ago) 89d
cert-manager cert-manager-webhook-545bd5d7d8-5fxtr 1/1 Running 2 (23m ago) 89d
contour-external contour-685f87dc74-ccdcn 1/1 Running 2 (23m ago) 89d
contour-external contour-685f87dc74-q24fx 1/1 Running 2 (23m ago) 89d
contour-external contour-certgen-v1.22.0-42c87 0/1 Completed 0 89d
contour-external envoy-jqzft 2/2 Running 4 (23m ago) 89d
contour-internal contour-c4478d89b-cvdk5 1/1 Running 2 (23m ago) 89d
contour-internal contour-c4478d89b-ws4tw 1/1 Running 2 (23m ago) 89d
contour-internal contour-certgen-v1.22.0-nn2r2 0/1 Completed 0 89d
contour-internal envoy-pkb7d 2/2 Running 4 (23m ago) 89d
knative-eventing eventing-controller-fdc4dd6bb-6zttl 1/1 Running 2 (23m ago) 89d
knative-eventing eventing-webhook-676dfb6c4f-hmnwz 1/1 Running 2 (23m ago) 89d
knative-eventing rabbitmq-broker-controller-54c85d4f98-b5vrv 1/1 Running 2 (23m ago) 89d
knative-eventing rabbitmq-broker-webhook-877b8d7df-bqd4t 1/1 Running 2 (23m ago) 89d
knative-serving activator-7cbbfbc85-4zg95 1/1 Running 2 (23m ago) 89d
knative-serving autoscaler-8f986cff-jgngg 1/1 Running 2 (23m ago) 89d
knative-serving controller-58dfb45d74-bz88p 1/1 Running 2 (23m ago) 89d
knative-serving domain-mapping-5d8db49bf6-8z96x 1/1 Running 2 (23m ago) 89d
knative-serving domainmapping-webhook-584476fd67-cdm5s 1/1 Running 2 (23m ago) 89d
knative-serving net-contour-controller-6768758c67-wb5jn 1/1 Running 2 (23m ago) 89d
knative-serving webhook-6d5c55fd8c-zl5zg 1/1 Running 2 (23m ago) 89d
kube-system antrea-agent-9xthc 2/2 Running 7 (23m ago) 89d
kube-system antrea-controller-6db8bb65cf-k8ltr 1/1 Running 3 (22m ago) 89d
kube-system coredns-565d847f94-dvrdf 1/1 Running 2 (23m ago) 89d
kube-system coredns-565d847f94-zmt46 1/1 Running 2 (23m ago) 89d
kube-system etcd-veba.primp-industries.local 1/1 Running 2 (23m ago) 89d
kube-system kube-apiserver-veba.primp-industries.local 1/1 Running 2 (23m ago) 89d
kube-system kube-controller-manager-veba.primp-industries.local 1/1 Running 2 (23m ago) 89d
kube-system kube-proxy-n22lb 1/1 Running 2 (23m ago) 89d
kube-system kube-scheduler-veba.primp-industries.local 1/1 Running 2 (23m ago) 89d
local-path-storage local-path-provisioner-5646477f4b-pql69 1/1 Running 2 (23m ago) 89d
rabbitmq-system messaging-topology-operator-74c896bb55-5h66m 1/1 Running 2 (23m ago) 89d
rabbitmq-system rabbitmq-cluster-operator-586b7547f8-7vr9x 1/1 Running 2 (23m ago) 89d
vmware-functions default-broker-ingress-bbf774754-xjzpz 1/1 Running 2 (23m ago) 89d
vmware-functions sockeye-79b7fc7c55-tmjqk 1/1 Running 2 (23m ago) 89d
vmware-functions sockeye-trigger-dispatcher-5b96bdd8d-mjth6 1/1 Running 2 (23m ago) 89d
vmware-functions vcsa-source-adapter-7d474dbbfb-xd7ms 1/1 Running 0 7m2s
vmware-sources horizon-source-controller-754d465b5b-ptbmw 1/1 Running 0 17m
vmware-sources horizon-source-webhook-6d7cbd8dd-mtz2t 1/1 Running 0 17m
vmware-sources vsphere-source-webhook-55c56cf48-x7fpw 1/1 Running 0 17m
vmware-system cadvisor-jt8rk 1/1 Running 2 (23m ago) 89d
vmware-system tinywww-5b795ddd75-pjds5 1/1 Running 2 (23m ago) 89d
vmware-system veba-rabbit-server-0 1/1 Running 4 (21m ago) 89d
vmware-system veba-ui-846bb59f69-rk994 1/1 Running 2 (23m ago) 89d
Specifically, we want to make sure vSphere Sources Adapter is running as shown in example below:
kubectl -n vmware-functions get deploy/vcsa-source-adapter
NAME READY UP-TO-DATE AVAILABLE AGE
vcsa-source-adapter 1/1 1 1 11m
We can check the logs of vSphere sources and ensure the very last last or so state it was able to login by retrieving the VC time:
kubectl -n vmware-functions logs deploy/vcsa-source-adapter
{"level":"warn","ts":"2023-06-18T14:09:14.176Z","logger":"vsphere-source-adapter","caller":"v2/config.go:197","msg":"Tracing configuration is invalid, using the no-op default{error 26 0 empty json tracing config}","commit":"8fda92a-dirty"}
{"level":"warn","ts":"2023-06-18T14:09:14.177Z","logger":"vsphere-source-adapter","caller":"v2/config.go:190","msg":"Sink timeout configuration is invalid, default to -1 (no timeout)","commit":"8fda92a-dirty"}
{"level":"info","ts":"2023-06-18T14:09:14.261Z","logger":"vsphere-source-adapter","caller":"kvstore/kvstore_cm.go:54","msg":"Initializing configMapKVStore...","commit":"8fda92a-dirty"}
{"level":"info","ts":"2023-06-18T14:09:14.271Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:92","msg":"configuring checkpointing","commit":"8fda92a-dirty","ReplayWindow":"5m0s","Period":"10s"}
{"level":"warn","ts":"2023-06-18T14:09:14.271Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:131","msg":"could not retrieve checkpoint configuration","commit":"8fda92a-dirty","error":"key checkpoint does not exist"}
{"level":"info","ts":"2023-06-18T14:09:14.274Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:311","msg":"no valid checkpoint found","commit":"8fda92a-dirty"}
{"level":"info","ts":"2023-06-18T14:09:14.274Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:312","msg":"setting begin of event stream","commit":"8fda92a-dirty","beginTimestamp":"2023-06-18 14:09:14.260972 +0000 UTC"}
{"level":"info","ts":"2023-06-18T14:14:14.261Z","logger":"vsphere-source-adapter","caller":"vsphere/client.go:115","msg":"Executing SOAP keep-alive handler","commit":"8fda92a-dirty","rpc":"keepalive"}
{"level":"info","ts":"2023-06-18T14:14:14.265Z","logger":"vsphere-source-adapter","caller":"vsphere/client.go:121","msg":"vCenter current time: 2023-06-18 14:14:14.31874 +0000 UTC","commit":"8fda92a-dirty","rpc":"keepalive"}
At this point, you should also be seeing events flow into Sockeye by opening browser to https://VEBA_FQDN/events
Lastly, to deploy or re-deploy your functions, you need to edit the function.yaml
manifest and update the type
field to something like: com.vmware.vsphere.<EVENTID>.v0
and remove the subject
field.
....snip ....
apiVersion: eventing.knative.dev/v1
kind: Trigger
metadata:
name: veba-ps-slack-trigger
labels:
app: veba-ui
spec:
broker: default
filter:
attributes:
type: com.vmware.vsphere.DrsVmPoweredOnEvent.v0
subscriber:
ref:
apiVersion: serving.knative.dev/v1
kind: Service
name: kn-ps-slack
Let me know if you have any questions and hopefully this yields better results ...
JFYI - Tanzu Sources by default does not log all events with their default INFO logging, you'll have to enable DEBUG to do so if you wish to see vSphere events in Tanzu Sources logs but I think for now, lets see if this resolves the issue with the setup
Hey,
In my szenario veba is running as a helm chart deployment on RedHat Openshift. I've multiple vcenter connected to the same broker within the same namespace.
The installation comes directly from your github repo: https://github.com/vmware-samples/vcenter-event-broker-appliance.git
Should I uninstall the helm chart in step 1?
Is your release.yaml mentioned in step 2 based on tanzu specialities or is it working on any kubernetes? Currently my helm deployment and all trigger + functions are installed in the same namespace. If I'll use it, is it sufficient to delete the namespace creation part and replace all namespaces with the one I want it to be installed?
Should I uninstall the helm chart in step 1?
Without touching your prod environment, you could also deploy a separate VEBA instance/Kubernetes environment and just deploy the sources to a broker without additional triggers to see if the sources continue to run when you see the router
stopping in the other appliance.
But you can also run sources in parallel to your existing setup by installing as described above w/out having to uninstall your router. Depending on your configured triggers, this can lead to duplicate events though. So just be careful.
Did it on my test environment. First problem: we always need Openshift Cluster Admin permissions. But here we go:
{"level":"warn","ts":"2023-06-19T07:31:42.597Z","logger":"vsphere-source-adapter","caller":"v2/config.go:197","msg":"Tracing configuration is invalid, using the no-op default{error 26 0 empty json tracing config}","commit":"8fda92a-dirty"}
{"level":"warn","ts":"2023-06-19T07:31:42.597Z","logger":"vsphere-source-adapter","caller":"v2/config.go:190","msg":"Sink timeout configuration is invalid, default to -1 (no timeout)","commit":"8fda92a-dirty"}
{"level":"info","ts":"2023-06-19T07:31:42.701Z","logger":"vsphere-source-adapter","caller":"kvstore/kvstore_cm.go:54","msg":"Initializing configMapKVStore...","commit":"8fda92a-dirty"}
{"level":"info","ts":"2023-06-19T07:31:42.715Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:92","msg":"configuring checkpointing","commit":"8fda92a-dirty","ReplayWindow":"5m0s","Period":"10s"}
{"level":"warn","ts":"2023-06-19T07:31:42.715Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:131","msg":"could not retrieve checkpoint configuration","commit":"8fda92a-dirty","error":"key checkpoint does not exist"}
{"level":"info","ts":"2023-06-19T07:31:42.725Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:311","msg":"no valid checkpoint found","commit":"8fda92a-dirty"}
{"level":"info","ts":"2023-06-19T07:31:42.725Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:312","msg":"setting begin of event stream","commit":"8fda92a-dirty","beginTimestamp":"2023-06-19 07:31:42.72101 +0000 UTC"}
It's connected, but I cannot see any events within sockeye.
To keep things simple for now, try enabling DEBUG
log level to see if it's receiving/doing something:
https://github.com/vmware-tanzu/sources-for-knative#source-adapter-log-level
Finally I made it, it did not work with the sink-uri, but the sink-name:
kn vsphere source create \
--namespace xxx\
--name vcsa-source \
--vc-address https://${VCENTER_HOSTNAME} \
--skip-tls-verify \
--secret-ref vcenter-creds \
--sink-api-version eventing.knative.dev/v1 \
--sink-kind Broker \
--sink-name default \
--encoding json
Now the vsphere event source and two router are running against the same vcenter.
I'll let you know when I see the event router problem again, and if the vcsa will survive.
Both router connected to my vcenter got stucked again at the same point in time, but the vsca still receives events
Both router connected to my vcenter got stucked again at the same point in time, but the vsca still receives events
Wait, sources
and router are not sending anything anymore? Can you please enable DEBUG
logging in the sources
and dump output here when this happens?
source is still working routers stopped working
My setup is:
router with info mode against vcenter -> stopped
receiving events (vmware-router-74f7c779f4-cv82r)
router with debug mode against vcenter -> stopped
receiving events** (vmware-router-debug-668b6dfd75-mdm7)
event with debug mode against vcenter -> still working
(vcsa-source-adapter-84c75c57f6-kq4x2)
I've setup two routers, to check if they got stucked at the same time -> Yes they do. Your provided solution to swith to Tanzu Sources is now running for more than one day without any problems.
stopped receiving events**
What do these **
mean? Do you have some logs here again please?
Your provided solution to swith to Tanzu Sources is now running for more than one day without any problems.
This is good news!
What do these ** mean? Do you have some logs here again please?
just missed to remove the bold code style at the end.
Logs: The debug node:
2023-06-21T06:03:21.132Z INFO [KNATIVE] knative/knative.go:194 successfully sent event {"eventID": "724e457d-28f1-428a-a0c5-a419358362bd"}
2023-06-21T06:03:21.910Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 1}
2023-06-21T06:03:22.960Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 2}
2023-06-21T06:03:25.011Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 4}
2023-06-21T06:03:29.066Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 5}
2023-06-21T06:03:34.120Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 5}
... and so on
Last entry for the info node, after that nothing else
2023-06-21T06:03:21.568Z INFO [KNATIVE] knative/knative.go:194 successfully sent event {"eventID": "f0ec150b-2e4e-45e4-9452-53a31284779b"}
source logs
{"level":"debug","ts":"2023-06-21T06:03:17.000Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:203","msg":"backing off retrieving events: no new events received","commit":"8fda92a-dirty","backoffSeconds":5}
{"level":"debug","ts":"2023-06-21T06:03:22.096Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:208","msg":"got 3 events","commit":"8fda92a-dirty"}
{"level":"debug","ts":"2023-06-21T06:03:22.096Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:268","msg":"sending event","commit":"8fda92a-dirty","ID":"395017269","type":"com.vmware.vsphere.DrsSoftRuleViolationEvent.v0","data":"xxx"}
{"level":"debug","ts":"2023-06-21T06:03:22.100Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:268","msg":"sending event","commit":"8fda92a-dirty","ID":"395017270","type":"com.vmware.vsphere.DrsSoftRuleViolationEvent.v0","data":"xxx"}
{"level":"debug","ts":"2023-06-21T06:03:22.101Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:268","msg":"sending event","commit":"8fda92a-dirty","ID":"395017271","type":"com.vmware.vsphere.DrsSoftRuleViolationEvent.v0","data":"xxx"}
{"level":"debug","ts":"2023-06-21T06:03:22.153Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:203","msg":"backing off retrieving events: no new events received","commit":"8fda92a-dirty","backoffSeconds":1}
{"level":"debug","ts":"2023-06-21T06:03:23.208Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:203","msg":"backing off retrieving events: no new events received","commit":"8fda92a-dirty","backoffSeconds":2}
During a short investigation I could see this problem in AriaLogs, not sure if this is releated. Did not had time to further investigate.
vpxd-main - - - 2023-06-21T08:03:21.399+02:00 warning vpxd[15384] [Originator@6876 sub=vmomi.soapStub[826] opID=TaskLoop-host-8441071] SOAP request returned HTTP failure; <SSL
last time we could see:
info vpxd[13645] [Originator@6876 sub=HttpSvc.HTTPService.HttpConnection] HTTP Connection read failed while waiting for further requests; <io_obj p:0x00007f9594563e60, h:-1, <UNIX '/var/run/vmware/vpxd-webserver-pipe'>, <UNIX ''>>, N7Vmacore16TimeoutExceptionE(Operation timed out: Stream: <io_obj p:0x00007f9594563e60, h:-1, <UNIX '/var/run/vmware/vpxd-webserver-pipe'>, <UNIX ''>>, duration: 00:00:48.994720 (hh:mm:ss.us))\n--> [context]zKq7AVECAQAAAH1IQAEMdnB4ZAAA7tg3bGlidm1hY29yZS5zbwAAp3gsACVsLQAL6TIANyQtAFQEIwAjqCMAp0kjAE+fIwCsZDcBh38AbGlicHRocmVhZC5zby4wAAIvNg9saWJjLnNvLjYA[/context]
Good news: we don't have a deadlock :) My suspicion here is that, given Sources
are working, the router
runs into session timeout issues. Looks like this is not implemented in the code base, but I have done it for Sources
. That would explain why there're no new events which is fixed after restart.
Unfortunately our kubernetes admins declined the above mentioned workaround for our production environments.
Currently we have installed everything within one namespace, but there are several ClusterRoleBindings needed e.g.:
Problem are global permissions like:
configmaps [] [] [get list create update delete patch watch]
events [] [] [get list create update delete patch watch]
secrets [] [] [get list create update delete patch watch]
serviceaccounts [] [] [get list create update delete patch watch]
services [] [] [get list create update delete patch watch]
mutatingwebhookconfigurations.admissionregistration.k8s.io [] [] [get list create update delete patch watch]
validatingwebhookconfigurations.admissionregistration.k8s.io [] [] [get list create update delete patch watch]
customresourcedefinitions.apiextensions.k8s.io [] [] [get list create update delete patch watch]
deployments.apps/finalizers [] [] [get list create update delete patch watch]
deployments.apps [] [] [get list create update delete patch watch]
leases.coordination.k8s.io [] [] [get list create update delete patch watch]
rolebindings.rbac.authorization.k8s.io [] [] [get list create update delete patch watch]
Any idea if this can be changed to dedicated namespaces only?
@laugrean that is understandable. I'll have @lamw and the VMware team jump in to comment on the topic of RBAC here.
@gabo1208 See above ^
I've now again the same problem with the vcsa-source-adapter. After a few days I do not receive any events anymore. Sockeye is empty for affected source-adapter. Those are the entries within the log:
{"level":"info","ts":"2023-09-05T05:26:19.279Z","logger":"vsphere-source-adapter","caller":"vsphere/client.go:115","msg":"Executing SOAP keep-alive handler","commit":"8fda92a-dirty","rpc":"keepalive"}
{"level":"info","ts":"2023-09-05T05:26:19.289Z","logger":"vsphere-source-adapter","caller":"vsphere/client.go:121","msg":"vCenter current time: 2023-09-05 05:26:19.285489 +0000 UTC","commit":"8fda92a-dirty","rpc":"keepalive"}
{"level":"info","ts":"2023-09-05T05:31:19.291Z","logger":"vsphere-source-adapter","caller":"vsphere/client.go:115","msg":"Executing SOAP keep-alive handler","commit":"8fda92a-dirty","rpc":"keepalive"}
{"level":"info","ts":"2023-09-05T05:31:19.302Z","logger":"vsphere-source-adapter","caller":"vsphere/client.go:121","msg":"vCenter current time: 2023-09-05 05:31:19.298007 +0000 UTC","commit":"8fda92a-dirty","rpc":"keepalive"}
When I kill the vca-source-adapter-pod I see this message:
{"level":"info","ts":"2023-09-05T05:32:49.857Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:296","msg":"found existing checkpoint","commit":"8fda92a-dirty"}
{"level":"warn","ts":"2023-09-05T05:32:49.857Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:299","msg":"potential data loss: last event timestamp in checkpoint is older than configured maximum","commit":"8fda92a-dirty","maxHistory":"5m0s","checkpointTimestamp":"2023-08-25 06:09:09.535 +0000 UTC"}
{"level":"warn","ts":"2023-09-05T05:32:49.857Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:303","msg":"setting begin of event stream","commit":"8fda92a-dirty","beginTimestamp":"2023-09-05 05:27:49.853811 +0000 UTC"}
And everything is back to normal... for at least a few days
IIUC, when you enable debug
logging in the adapter it will tell you that it's not receiving new events, correct? If that's the case, and seeing the above info
logs about keep-alives which execute correctly, I'm afraid it's definitely a server-side issue as nothing is coming through the EventHistoryCollector
stream anymore. The server might close/cancel that stream after some time, or there is a network issue and the session is not correctly reestablished. Can't tell from this side. Would definitely need someone from engineering to look into the server side behavior and also someone from govmomi
(Go SDK) to look into what happens with the stream under the cover e.g., is it cleanly terminated by the server or there's a network interruption, leading to a dangling EventHistoryCollector
session where the client believes it's still in tact but no data is received anymore. IIUC, there's also no heartbeat mechanism in the protocol where the client can detect that the stream is basically gone.
BTW: seems you're running 8fda92a-dirty
which is before the URL fix. You might want to update the version of Sources.
Hey,
Sorry for the late reply.
I had now the following scenario:
{"level":"warn","ts":"2023-09-29T06:24:07.287Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:303","msg":"setting begin of event stream","commit":"8fda92a-dirty","beginTimestamp":"2023-09-29 06:19:07.283263 +0000 UTC"}
{"level":"debug","ts":"2023-09-29T06:24:07.515Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:208","msg":"got 100 events","commit":"8fda92a-dirty"}
<all events listed>
{"level":"debug","ts":"2023-09-29T06:24:07.671Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:203","msg":"backing off retrieving events: no new events received","commit":"8fda92a-dirty","backoffSeconds":1}
{"level":"debug","ts":"2023-09-29T06:24:08.731Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:203","msg":"backing off retrieving events: no new events received","commit":"8fda92a-dirty","backoffSeconds":2}
{"level":"debug","ts":"2023-09-29T06:24:10.797Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:203","msg":"backing off retrieving events: no new events received","commit":"8fda92a-dirty","backoffSeconds":4}
{"level":"debug","ts":"2023-09-29T06:24:14.870Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:203","msg":"backing off retrieving events: no new events received","commit":"8fda92a-dirty","backoffSeconds":5}
{"level":"debug","ts":"2023-09-29T06:24:19.873Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:185","msg":"creating checkpoint","commit":"8fda92a-dirty","checkpoint":{"vCenter":"vcenter.xxx","lastEventKey":425952999,"lastEventType":"com.vmware.vc.HardwareSensorGroupStatus","lastEventKeyTimestamp":"2023-09-29T06:21:29.574999Z","createdTimestamp":"2023-09-29T06:24:07.60603884Z"}}
{"level":"debug","ts":"2023-09-29T06:24:19.952Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:203","msg":"backing off retrieving events: no new events received","commit":"8fda92a-dirty","backoffSeconds":5}
{"level":"debug","ts":"2023-0
it was running for a few seconds only...
I've updated kn-vsphere to the latest release and give it another try
@laugrean any updates on this? We had a user recently reported that after introducing the VSphereSource
(replacing the event-router
), the issue (stop working) was solved.
See: https://vmwarecode.slack.com/archives/CQLT9B5AA/p1701877325178139
@jm66 @laugrean Any updates on this? Is the issue solved by using the Tanzu Sources for Knative (VSphereSource
)?
@rguske Finally got a chance to deploy the latest version after a few issues during the setup.sh
script. I will provide an update soon.
Something I noticed is that I cannot see the following events: BadUsernameSessionEvent
and com.vmware.sso.LoginFailure
.
Also the filtering type
is not needed anymore?
Hi @jm66, thanks for your answer.
Something I noticed is that I cannot see the following events: BadUsernameSessionEvent and com.vmware.sso.LoginFailure.
You can see all available events here: EVENTS
Also the filtering type is not needed anymore?
type:
is still vaild, see:
Reference: Docs
apiVersion: eventing.knative.dev/v1
kind: Trigger
metadata:
name: veba-ps-echo-trigger
labels:
app: veba-ui
spec:
broker: default
filter:
attributes:
type: com.vmware.vsphere.VmPoweredOffEvent.v0
subscriber:
ref:
apiVersion: serving.knative.dev/v1
kind: Service
name: kn-ps-echo
Looking forward reading your updates.
Thanks. Yes I see the event listed in the ref docs, but it is not showing in sockeye at all.
In v0.7.5 type was either something like event or eventx. Subject was the actual event type, and now type contains v0.7.5 subject attribute value, with a .v0 suffix?
You can see all available events here: EVENTS
Got it working. Key is that all events need to use the format com.vmware.vsphere.{event}.v0
.
Also the filtering type is not needed anymore?
I stand corrected. I was meaning to refer to subject
. Looks like isn't required anymore.
All my functions have been migrated to the new spec and apparently are working.
Describe the bug
vmware-event-router
stops working after two days. No errors registered in logs.To Reproduce N/A
Expected behavior
vmware-event-router
to continuously run without interruption.Screenshots N/A
Version (please complete the following information):
Additional context
This behaviour is only shown in our prod instance which events volume is greater than our testing instance.