osism / cloud-in-a-box

Cloud in a box
https://osism.github.io/docs/guides/deploy-guide/examples/cloud-in-a-box
Apache License 2.0
17 stars 4 forks source link

bootstrap failed: Wait for an healthy netbox service #317

Open scpcom opened 11 hours ago

scpcom commented 11 hours ago

CPU: 16 cores RAM: 64GB HDD: 1TB NIC: 1 Network: VLAN with OPNsense providing DHCP and Internet, no other machines.

Today I tried to start a sandbox environment with this guide: https://docs.scs.community/docs/iaas/guides/other-guides/cloud-in-a-box/

First i tried the manual steps, bootstrap stopped very early at:

bootstrap | PLAY [Make ssh pipelining working] *********************************************
bootstrap |
bootstrap | TASK [Do not require tty for all users] ****************************************
bootstrap | fatal: [manager.systems.in-a-box.cloud]: UNREACHABLE! => {"changed": false, "msg": "Invalid/incorrect password: Warning: Permanently added 'localhost' (ED25519) to the list of known hosts.\r\nPermission denied, please try again.", "unreachable": true}
bootstrap |
bootstrap | PLAY RECAP *********************************************************************
bootstrap | manager.systems.in-a-box.cloud : ok=0    changed=0    unreachable=1    failed=0    skipped=0    rescued=0    ignored=0

Then I started over again and downloaded and used ubuntu-autoinstall-cloud-in-a-box-1.iso Date: 23. Sep 21:55 Size: 2136942592 sha256: 9214910ae60e3119a1cf96287d7e55995d61a856d62bf48b5f13279789691816

Bootstrap runs 20 Minutes and stops at:

bootstrap | RUNNING HANDLER [osism.services.netbox : Wait for an healthy netbox service] ***
bootstrap | FAILED - RETRYING: [manager.systems.in-a-box.cloud]: Wait for an healthy netbox service (60 retries left).
...
bootstrap | FAILED - RETRYING: [manager.systems.in-a-box.cloud]: Wait for an healthy netbox service (1 retries left).
bootstrap | fatal: [manager.systems.in-a-box.cloud]: FAILED! => {"attempts": 60, "changed": true, "cmd": "set -o pipefail\ndocker compose --project-directory /opt/netbox    ps --all --format json |    jq '. | select(.State==\"created\" or .State==\"exited\" or .Health==\"starting\" or .Health==\"unhealthy\") | .Name'\n", "delta": "0:00:00.104984", "end": "2024-11-27 22:32:27.926267", "msg": "", "rc": 0, "start": "2024-11-27 22:32:27.821283", "stderr": "", "stderr_lines": [], "stdout": "\"netbox-postgres-1\"\n\"netbox-redis-1\"", "stdout_lines": ["\"netbox-postgres-1\"", "\"netbox-redis-1\""]}
bootstrap |
bootstrap | PLAY RECAP *********************************************************************
bootstrap | manager.systems.in-a-box.cloud : ok=34   changed=17   unreachable=0    failed=1    skipped=11   rescued=0    ignored=0

I am able to login as dragon via ssh.

garloff commented 3 hours ago

You can run deploy.sh in /opt/configuration to retry the deployment (and even edit the script to ignore a failed netbox, things work without it on a CiaB.) . Not a solution to the problem, obviously, but a way to proceed.

berendt commented 2 hours ago

Can you share the logs of the netbox-netbox-1 and netbox-postgres-1 containers

docker logs netbox-netbox-1 docker logs netbox-postgres-1

scpcom commented 41 minutes ago

You can run deploy.sh in /opt/configuration to retry the deployment (and even edit the script to ignore a failed netbox, things work without it on a CiaB.) . Not a solution to the problem, obviously, but a way to proceed.

This fails with "/opt/configuration/deploy.sh: line 19: osism: command not found". But since I want to test SCS "as-is" I would prefer to not modify any script at this moment.

Can you share the logs of the netbox-netbox-1 and netbox-postgres-1 containers

docker logs netbox-netbox-1 docker logs netbox-postgres-1

docker logs netbox-netbox-1.txt docker logs netbox-postgres-1.txt

berendt commented 15 minutes ago

It seems that the local storage of your test node is slow. On slow local storage the Netbox initialisation takes a pretty long time. I will increase the timeout in the Cloud in a Box configuration repository.