Closed planetf1 closed 3 years ago
Root cause it actually likely to do with Helm and how it handles Jobs...
While you can avoid the timeout causing an error by adding a --timeout 900
or the like to the Helm install command (which means a timeout of that many seconds per step rather than overall deployment [default is 300]), you'll still be blocked on returning to the command-line until the install command completes. This seems to be the expected behaviour of Helm, though...
Also it's documented here (including a suggestion on monitoring the "actual" status): https://github.com/odpi/egeria/tree/master/open-metadata-resources/open-metadata-deployment#deploying-the-demonstration
Yes I think it's because of the post-install hook which is kicking off the jobs and waiting for completion. It makes some sense, but may be a usability concern. I also don't know how it might affect installing a helm chart from within a catalog on some cloud platforms.
Not urgent, but will leave this open as we consider alternatives.
One suggestion I received was to use an initialization container instead (the original plan) (ensuring idempotency)- though jobs do seem a neater approach and are explicitly run once, in sequence. Nice.. Stateful sets also offer richer control of ordering, but I worry the binding is too tight.
Needs more research into other options. so leaving issue open for now.
Perhaps a hybrid solution might be
initContainer
s are not really an option -- the normal (non-init) container
is blocked from starting until all initContainer
s have completed: so you'd be trying to configure something through the initContainer
that isn't yet running as an actual container... (Unless you're talking about a series of initContainer
s on some "dummy" Pod that does nothing but run through a set of initContainer
s ?)
yes that was the suggestion on one of the helm chats but i'm unconvinced. will discuss with some colleagues in out local teams with more experience - no need to rush into a change, but something we should address if we can (delay, timeout error, reported as failed - though does init, and kubectl status looks good0.
As a point of reference, I just had helm complete successfully - for the first time, on cloud:
➜ charts git:(helm32) helm install vdc -f ~/cloud.yaml git:(helm32|)
NAME: yellow-quail
LAST DEPLOYED: Thu Apr 18 08:09:32 2019
NAMESPACE: egeria
STATUS: DEPLOYED
RESOURCES:
==> v1/ConfigMap
NAME DATA AGE
yellow-quail-openldap-env 6 5m43s
yellow-quail-vdc-configmap 21 5m43s
==> v1/Deployment
NAME READY UP-TO-DATE AVAILABLE AGE
yellow-quail-vdc-atlas 1/1 1 1 5m43s
yellow-quail-vdc-egeria 1/1 1 1 5m43s
yellow-quail-vdc-gaian-deployment 1/1 1 1 5m43s
yellow-quail-vdc-ibm-igc 1/1 1 1 5m43s
yellow-quail-vdc-omrsmonitor 1/1 1 1 5m43s
yellow-quail-vdc-postgresql 1/1 1 1 5m43s
yellow-quail-vdc-rangeradmin 1/1 1 1 5m43s
yellow-quail-vdc-ui 1/1 1 1 5m43s
==> v1/Pod(related)
NAME READY STATUS RESTARTS AGE
yellow-quail-openldap-674b4db598-xq92c 1/1 Running 0 5m43s
yellow-quail-vdc-atlas-5b8f558946-ls45b 1/1 Running 0 5m43s
yellow-quail-vdc-egeria-54fcbfbbfb-gwqkp 1/1 Running 0 5m43s
yellow-quail-vdc-gaian-deployment-7dfdd86956-zllxg 1/1 Running 0 5m43s
yellow-quail-vdc-ibm-igc-66897d944b-gd8vb 1/1 Running 0 5m43s
yellow-quail-vdc-omrsmonitor-7769cdd46b-kqzsx 1/1 Running 0 5m43s
yellow-quail-vdc-postgresql-cdc5964b6-btmrp 1/1 Running 0 5m43s
yellow-quail-vdc-rangeradmin-6f5d58578c-bt4tm 2/2 Running 0 5m42s
yellow-quail-vdc-ui-7b78b49ddb-pnjgf 1/1 Running 0 5m42s
==> v1/Role
NAME AGE
yellow-quail-vdc-api-role 5m43s
==> v1/RoleBinding
NAME AGE
yellow-quail-vdc-api-role-binding 5m43s
==> v1/Secret
NAME TYPE DATA AGE
yellow-quail-openldap Opaque 2 5m43s
==> v1/Service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
yellow-quail-openldap ClusterIP 172.21.219.139 <none> 389/TCP,636/TCP 5m43s
yellow-quail-vdc-atlas-service NodePort 172.21.195.234 <none> 21000:31000/TCP 5m43s
yellow-quail-vdc-egeria-service NodePort 172.21.167.42 <none> 8080:30080/TCP 5m43s
yellow-quail-vdc-gaian-service NodePort 172.21.164.7 <none> 6414:30414/TCP 5m43s
yellow-quail-vdc-ibm-igc-service NodePort 172.21.119.180 <none> 8080:30081/TCP 5m43s
yellow-quail-vdc-omrsmonitor-service NodePort 172.21.31.173 <none> 58080:31080/TCP 5m43s
yellow-quail-vdc-postgresql-service NodePort 172.21.151.152 <none> 5432:30432/TCP 5m43s
yellow-quail-vdc-ranger-service NodePort 172.21.43.204 <none> 6080:32080/TCP,6182:30182/TCP,6083:32299/TCP,6183:31210/TCP,3306:31763/TCP 5m43s
yellow-quail-vdc-ui-service NodePort 172.21.101.218 <none> 8443:30443/TCP 5m43s
==> v1beta2/Deployment
NAME READY UP-TO-DATE AVAILABLE AGE
yellow-quail-openldap 1/1 1 1 5m43s
Worth mentioning as this is 'just' a timeout. If only helm was a little more flexible in this area..
Have you tried this?
$ helm install vdc -f ~/cloud.yaml --timeout 900
Helm does support overriding the timeout, you just need to tell it what value to override it with 😉
i think from prior experiments/reading, that flag governs how long helm waits for individual commands to complete, not how long it waits for jobs etc to complete. there have been some discussions in that area, some changes, and additional timouts in the specs themselves that affect this behaviour.
Correct, it's about timeouts of individual steps (which jobs are part of) rather than the deployment in its entirety -- but that just means that it will only timeout if an individual step takes longer than that value (?) So you could in theory set it to a lower value and still have it work despite the overall deployment taking a long time (but each step completing in a time under the timeout)... Adding it works for me 🤷♂️
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.
The VDC environment will need significant re-work to adapt to the many changes made in Egeria to better support metadata integration. Additionally the helm charts used for the lab & a simple base config have evolved to better support different types of services, exposing of ports, persistent storage etc.
As such specific incremental changes to the current - now old - charts do not really add value.
As such closing for now
initial versions of the vdc chart returned quickly, though ongoing configuration continued in the background.
the current chart can take a long time to return at the command line, which is confusing. this is probably whilst waiting for initialization & could relate to a misconfiguration.
around 6 minutes in my test on IBMcloud using internal kafka (cp)
➜ charts git:(helm22) ✗ helm install vdc -f ~/cloud.yaml git:(helm22|✚3 2019/03/07 15:02:12 Warning: Building values map for chart 'cp-kafka'. Skipped value (map[]) for 'image', as it is not a table. Error: timed out waiting for the condition
need to consider if reasonable, potentially look at configurable timeouts. Slow kafka initialization is likely the root cause.