vmware-samples / vcenter-event-broker-appliance

The VMware Event Broker Appliance Fling enables customers to unlock the hidden potential of events in their SDDC to easily create event-driven automation.
Other
157 stars 68 forks source link

[BUG] Deployment error #1176

Open rumart opened 8 months ago

rumart commented 8 months ago

Describe the bug The VEBA deployment doesn't finish and throws an error when deploying the RabbitMQ cluster

To Reproduce Steps to reproduce the behavior: I've deployed the OVA as described in the docs Waited for around 20 minutes, but none of the web endpoints work (Connection refused)

Expected behavior The deployment to finish and the endpoints to work

Screenshots Screenshot of bootstrap-debug.log

image

Version (please complete the following information):

Additional context When troubleshooting I saw that the deployment stopped in what seems to be setup-05-knative.sh script.

I commented out scripts 1 through 4 in setup.sh and reran setup.sh

After a short while the script stopped with this message:

image

Checked the setup-05-knative.sh script and found that the VEBA_BOM_FILE variable was defined after it being used in the file

image

The ytt command on line 44 uses $VEBA_BOM_FILE, but the variable is first defined on line 51.

I moved that line above line 44 and reran setup.sh

Now the deployment could finish and I can access the web endpoints

github-actions[bot] commented 8 months ago

Howdy 🖐   rumart ! Thank you for your interest in this project. We value your feedback and will respond soon.

rumart commented 8 months ago

Here's a screenshot of kubectl get pods -A before re-running the setup file

image
rguske commented 8 months ago

Hi @rumart, the VEBA_BOM_FILE variable is already set in setup-04-kubernetes.sh for the first time - HERE. I can see on your screenshot that the installation didn't finish successfully. The vmware-sources ns is e.g. missing. We've faced this issue before and actually, it should be fixed with #1170. We have to dig into it.

rumart commented 8 months ago

Yeah, so when I comment out setup-04 it doesn't pick up on the BOM variable, but nevertheless, since it get's defined in setup-05 could it just be moved up a bit? Or should it be removed altogether?

Thanks for looking into it

rguske commented 8 months ago

I don't think that the issue is caused by not setting the VEBA_BOM_FILE variable. We have the suspicion that it is timing-related. Have you tried deploying it again? To what kind of environment are you deploying VEBA to?

rumart commented 8 months ago

I agree, the VEBA_BOM_FILE issue is because I've re-run the script without running the setup-04 which sets it the first time. Was more thinking of fixing that setup-05 file separately..

Anyways, I'm running it on a small home lab vSAN cluster. Have tried redeploying a few times, all stopping on the same error message.

I'll try to run it on a different env later tonight to see if that changes anything

rumart commented 8 months ago

I've tried on a single ESXi host not running anything else, storage on NVME. I've added more CPU and RAM to the appliance. Still errors out on the same step

I ssh'd to the appliance as soon as it was available and tailed the bootstrap-debug.log. The error failed calling webhook "defaulting.webhook.rabbitmq.eventing.knative.dev" happens after just a couple of minutes. As far as I understand there's a 10 minute timeout on most of the commands?

rguske commented 8 months ago

IIRC, the 10 minutes are the default for the kubectl wait command if you don't specify --timeout separately. I really wonder about this issue. I deployed it in my homelab (2-node vSAN cluster) as well and it worked like a charm. Anyway, like I said, William had this issue before as well but reordering the command executions did the trick. When I have time, I'll try to add another wait condition to the script(s)(if necessary!). Thanks @rumart

lamw commented 8 months ago

I suspect that the current "wait" conditions are actually passing, unless you login and it looks to be waiting for default 10m as mentioned by Robert. If it truly is a timing, we can always enhance the OVF properties to allow that to be customizable but I'm not sure if thats actually the case and we may need some other wait condition. If we can debug this further Robert, then we can spin up a custom build to verify for @rumart

jm66 commented 6 months ago

Just as @rumart, first error I got:

Error from server (InternalError): error when creating "/root/config/knative/rabbit.yaml": Internal error occurred: failed calling webhook "defaulting.webhook.rabbitmq.eventing.knative.dev": failed to call webhook: Post "https://rabbitmq-broker-webhook.knative-eventing.svc:443/defaulting?timeout=2s": dial tcp 10.109.26.244:443: connect: connection refused

Second try, I increased the timeout value and kept going.

Third try stumbled upon the following:

/root/setup/setup-05-knative.sh: line 44: VEBA_BOM_FILE: unbound variable

Which had to work around to keep the installation going.

rguske commented 4 months ago

@rumart I owe you a deep apology for not getting back to you earlier. Would you be open to troubleshoot your issue further? I've just added another wait condition to the setup-05-knative.sh script and have built a new appliance (test)version. I'd love to follow the deployment in your test-environment. Maybe we could run a Zoom session? What really helps to get started is the following approach:

From there you can perfectly follow the progress.

image

The new build can be downloaded for testing purposes here: DOWNLOAD

rumart commented 4 months ago

Thanks @rguske. I've been busy with other things so haven't had the time myself. I'm very interested in troubleshooting further and get this up and running.

rguske commented 4 months ago

Thanks @rguske. I've been busy with other things so haven't had the time myself. I'm very interested in troubleshooting further and get this up and running.

Sure, just let me know when you have the time and ping me on Discord or Slack (CNCF Workspace). Looking forward finding the rc.

rumart commented 4 months ago

Seems I cannot download the testversion..

On 13 Jun 2024, at 08:58, Robert Guske @.***> wrote:

Thanks @rguske https://github.com/rguske. I've been busy with other things so haven't had the time myself. I'm very interested in troubleshooting further and get this up and running.

Sure, just let me know when you have the time and ping me on Discord or Slack (CNCF Workspace). Looking forward finding the rc.

— Reply to this email directly, view it on GitHub https://github.com/vmware-samples/vcenter-event-broker-appliance/issues/1176#issuecomment-2164711208, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADIR6R7QM6CG4N3SMZCO7HLZHE7J5AVCNFSM6AAAAABJF4UKS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRUG4YTCMRQHA. You are receiving this because you were mentioned.

rguske commented 4 months ago

Seems I cannot download the testversion.. On 13 Jun 2024, at 08:58, Robert Guske @.***> wrote: Thanks @rguske https://github.com/rguske. I've been busy with other things so haven't had the time myself. I'm very interested in troubleshooting further and get this up and running. Sure, just let me know when you have the time and ping me on Discord or Slack (CNCF Workspace). Looking forward finding the rc. — Reply to this email directly, view it on GitHub <#1176 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADIR6R7QM6CG4N3SMZCO7HLZHE7J5AVCNFSM6AAAAABJF4UKS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRUG4YTCMRQHA. You are receiving this because you were mentioned.

I've authorized you now 👍🏻

benwa commented 4 months ago

Just to add in, yesterday, we were on vCenter 7.0.3 and I was able to deploy. Today, after an update to vCenter 8.0.2, I get the same error as @rumart.

rabbitmqcluster.rabbitmq.com/veba-rabbit created
Error from server (InternalError): error when creating "/root/config/knative/rabbit.yaml": Internal error occurred: failed calling webhook "defaulting.webhook.rabbitmq.eventing.knative.dev": failed to call webhook: Post "https://rabbitmq-broker-webhook.knative-eventing.svc:443/defaulting?timeout=2s": dial tcp 10.98.98.40:443: connect: connection refused
rguske commented 4 months ago

Thanks a lot for your input @benwa. I don't think this issue is related to the vSphere version, since the first "real" interaction with the vCenter Server is at line 22 in script 06. when the VSphereSourcegets created. It really seems to be a timing issue. I still try to find out which component probably needs a dedicated wait condition.

benwa commented 4 months ago

Welp, I redownloaded the ova from the Flings site and ran a checksum. It was different. Redeployed and I'm all good now.

rumart commented 4 months ago

Eh… I still can’t deploy it. Even with a new test version provided by @rguskeOn 26 Jun 2024, at 18:29, William Lam @.***> wrote: Closed #1176 as completed.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

rguske commented 4 months ago

Issue still exists.

rguske commented 4 months ago

@rumart I've now added a sleep 30 to setup-05-knative.sh. I haven't found the problematic part yet. Could you give this version a try? DOWNLOAD.

Screenshot 2024-06-28 at 21 26 42

Thy

rumart commented 4 months ago

Now I'm able to deploy successfully. Tested several times without issues

rguske commented 4 months ago

Now I'm able to deploy successfully. Tested several times without issues

Interesting! Thanks lot for verifying Rudi. However, I will try to narrow it down. There must be different way. We'd really appreciate if you'd be open to test further builds. Thy :)

royiversen78 commented 3 months ago

First time VEBA user eager to get this working, but I'm also experencing this issue VEBA 0.8.0 vCenter 8.0.3

/var/log/bootstrap-debug.log

Error from server (InternalError): error when creating "/root/config/knative/rabbit.yaml": Internal error occurred: failed calling webhook "defaulting.webhook.rabbitmq.eventing.knative.dev": failed to call webhook: Post "https://rabbitmq-broker-webhook.knative-eventing.svc:443/defaulting?timeout=2s": dial tcp 10.105.248.31:443: connect: connection refused

rguske commented 3 months ago

First time VEBA user eager to get this working, but I'm also experencing this issue

VEBA 0.8.0

vCenter 8.0.3

/var/log/bootstrap-debug.log

Error from server (InternalError): error when creating "/root/config/knative/rabbit.yaml": Internal error occurred: failed calling webhook "defaulting.webhook.rabbitmq.eventing.knative.dev": failed to call webhook: Post "https://rabbitmq-broker-webhook.knative-eventing.svc:443/defaulting?timeout=2s": dial tcp 10.105.248.31:443: connect: connection refused

Thanks for reporting it. Could you please try the version provided in this comment HERE? Thy

royiversen78 commented 3 months ago

Thanks for reporting it. Could you please try the version provided in this comment HERE? Thy

That link doesn't work anymore. Google Drive says:

Sorry, the file you have requested does not exist.

Make sure that you have the correct URL and the file exists.

rguske commented 3 months ago

I will provide a new link in a bit. I was on vacation and back on the issue now. The issue looks similar to what is described here: https://cert-manager.io/docs/troubleshooting/webhook/

So, it looks to me that the Kubernetes API server is trying to call the rabbitmq-broker-webhook when we are installing the RabbitMQ cluster via kubectl apply -f ${RABBITMQ_CONFIG}.

Even tough, the following is included in our script which should ensure that everything is in READY state.

kubectl wait --for=condition=available deploy/rabbitmq-broker-webhook --timeout=${KUBECTL_WAIT} -n knative-eventing

rguske commented 3 months ago

@royiversen78 use this LINK temporarily.

royiversen78 commented 3 months ago

@royiversen78 use this LINK temporarily.

I'm getting the same issue with this version

Error from server (InternalError): error when creating "/root/config/knative/rabbit.yaml": Internal error occurred: failed calling webhook "defaulting.webhook.rabbitmq.eventing.knative.dev": failed to call webhook: Post "https://rabbitmq-broker-webhook.knative-eventing.svc:443/defaulting?timeout=2s": dial tcp 10.108.11.231:443: connect: connection refused

rguske commented 1 week ago

@rumart @royiversen78 we added a pause to the installation to ensure service dependencies and availabilities. Changes just got merged. https://github.com/vmware-samples/vcenter-event-broker-appliance/pull/1268

If you'd like to test its functionality, please DM me (preferred on CNCF Slack) and I will provide you a download link to the OVA. Thanks