vdc helm chart install hangs for several minutes before timing out.

planetf1 commented 5 years ago

initial versions of the vdc chart returned quickly, though ongoing configuration continued in the background.

the current chart can take a long time to return at the command line, which is confusing. this is probably whilst waiting for initialization & could relate to a misconfiguration.

around 6 minutes in my test on IBMcloud using internal kafka (cp)

➜ charts git:(helm22) ✗ helm install vdc -f ~/cloud.yaml git:(helm22|✚3 2019/03/07 15:02:12 Warning: Building values map for chart 'cp-kafka'. Skipped value (map[]) for 'image', as it is not a table. Error: timed out waiting for the condition

need to consider if reasonable, potentially look at configurable timeouts. Slow kafka initialization is likely the root cause.

cmgrote commented 5 years ago

Root cause it actually likely to do with Helm and how it handles Jobs...

Helm seems to block until all of the Jobs are running
Since the Jobs run sequentially (would be nice if they could run in parallel!) this means each one has to complete before the next one starts
Each Job is also dependent on the deployment / services on which they depend actually being available before they will execute: so there is an additional wait on that (which may extend even further if the probes on the containers for eg. Atlas detects that it has raced itself and failed, thereby needing to be restarted and adding further delay)

While you can avoid the timeout causing an error by adding a --timeout 900 or the like to the Helm install command (which means a timeout of that many seconds per step rather than overall deployment [default is 300]), you'll still be blocked on returning to the command-line until the install command completes. This seems to be the expected behaviour of Helm, though...

Also it's documented here (including a suggestion on monitoring the "actual" status): https://github.com/odpi/egeria/tree/master/open-metadata-resources/open-metadata-deployment#deploying-the-demonstration

planetf1 commented 5 years ago

Yes I think it's because of the post-install hook which is kicking off the jobs and waiting for completion. It makes some sense, but may be a usability concern. I also don't know how it might affect installing a helm chart from within a catalog on some cloud platforms.

Not urgent, but will leave this open as we consider alternatives.

One suggestion I received was to use an initialization container instead (the original plan) (ensuring idempotency)- though jobs do seem a neater approach and are explicitly run once, in sequence. Nice.. Stateful sets also offer richer control of ordering, but I worry the binding is too tight.

Needs more research into other options. so leaving issue open for now.

Perhaps a hybrid solution might be

cmgrote commented 5 years ago

initContainers are not really an option -- the normal (non-init) container is blocked from starting until all initContainers have completed: so you'd be trying to configure something through the initContainer that isn't yet running as an actual container... (Unless you're talking about a series of initContainers on some "dummy" Pod that does nothing but run through a set of initContainers ?)

planetf1 commented 5 years ago

yes that was the suggestion on one of the helm chats but i'm unconvinced. will discuss with some colleagues in out local teams with more experience - no need to rush into a change, but something we should address if we can (delay, timeout error, reported as failed - though does init, and kubectl status looks good0.

planetf1 commented 5 years ago

As a point of reference, I just had helm complete successfully - for the first time, on cloud:

➜  charts git:(helm32) helm install vdc -f ~/cloud.yaml                                                                                                           git:(helm32|)
NAME:   yellow-quail
LAST DEPLOYED: Thu Apr 18 08:09:32 2019
NAMESPACE: egeria
STATUS: DEPLOYED

RESOURCES:
==> v1/ConfigMap
NAME                        DATA  AGE
yellow-quail-openldap-env   6     5m43s
yellow-quail-vdc-configmap  21    5m43s

==> v1/Deployment
NAME                               READY  UP-TO-DATE  AVAILABLE  AGE
yellow-quail-vdc-atlas             1/1    1           1          5m43s
yellow-quail-vdc-egeria            1/1    1           1          5m43s
yellow-quail-vdc-gaian-deployment  1/1    1           1          5m43s
yellow-quail-vdc-ibm-igc           1/1    1           1          5m43s
yellow-quail-vdc-omrsmonitor       1/1    1           1          5m43s
yellow-quail-vdc-postgresql        1/1    1           1          5m43s
yellow-quail-vdc-rangeradmin       1/1    1           1          5m43s
yellow-quail-vdc-ui                1/1    1           1          5m43s

==> v1/Pod(related)
NAME                                                READY  STATUS   RESTARTS  AGE
yellow-quail-openldap-674b4db598-xq92c              1/1    Running  0         5m43s
yellow-quail-vdc-atlas-5b8f558946-ls45b             1/1    Running  0         5m43s
yellow-quail-vdc-egeria-54fcbfbbfb-gwqkp            1/1    Running  0         5m43s
yellow-quail-vdc-gaian-deployment-7dfdd86956-zllxg  1/1    Running  0         5m43s
yellow-quail-vdc-ibm-igc-66897d944b-gd8vb           1/1    Running  0         5m43s
yellow-quail-vdc-omrsmonitor-7769cdd46b-kqzsx       1/1    Running  0         5m43s
yellow-quail-vdc-postgresql-cdc5964b6-btmrp         1/1    Running  0         5m43s
yellow-quail-vdc-rangeradmin-6f5d58578c-bt4tm       2/2    Running  0         5m42s
yellow-quail-vdc-ui-7b78b49ddb-pnjgf                1/1    Running  0         5m42s

==> v1/Role
NAME                       AGE
yellow-quail-vdc-api-role  5m43s

==> v1/RoleBinding
NAME                               AGE
yellow-quail-vdc-api-role-binding  5m43s

==> v1/Secret
NAME                   TYPE    DATA  AGE
yellow-quail-openldap  Opaque  2     5m43s

==> v1/Service
NAME                                  TYPE       CLUSTER-IP      EXTERNAL-IP  PORT(S)                                                                     AGE
yellow-quail-openldap                 ClusterIP  172.21.219.139  <none>       389/TCP,636/TCP                                                             5m43s
yellow-quail-vdc-atlas-service        NodePort   172.21.195.234  <none>       21000:31000/TCP                                                             5m43s
yellow-quail-vdc-egeria-service       NodePort   172.21.167.42   <none>       8080:30080/TCP                                                              5m43s
yellow-quail-vdc-gaian-service        NodePort   172.21.164.7    <none>       6414:30414/TCP                                                              5m43s
yellow-quail-vdc-ibm-igc-service      NodePort   172.21.119.180  <none>       8080:30081/TCP                                                              5m43s
yellow-quail-vdc-omrsmonitor-service  NodePort   172.21.31.173   <none>       58080:31080/TCP                                                             5m43s
yellow-quail-vdc-postgresql-service   NodePort   172.21.151.152  <none>       5432:30432/TCP                                                              5m43s
yellow-quail-vdc-ranger-service       NodePort   172.21.43.204   <none>       6080:32080/TCP,6182:30182/TCP,6083:32299/TCP,6183:31210/TCP,3306:31763/TCP  5m43s
yellow-quail-vdc-ui-service           NodePort   172.21.101.218  <none>       8443:30443/TCP                                                              5m43s

==> v1beta2/Deployment
NAME                   READY  UP-TO-DATE  AVAILABLE  AGE
yellow-quail-openldap  1/1    1           1          5m43s

Worth mentioning as this is 'just' a timeout. If only helm was a little more flexible in this area..

cmgrote commented 5 years ago

Have you tried this?

$ helm install vdc -f ~/cloud.yaml --timeout 900

Helm does support overriding the timeout, you just need to tell it what value to override it with 😉

planetf1 commented 5 years ago

i think from prior experiments/reading, that flag governs how long helm waits for individual commands to complete, not how long it waits for jobs etc to complete. there have been some discussions in that area, some changes, and additional timouts in the specs themselves that affect this behaviour.

cmgrote commented 5 years ago

Correct, it's about timeouts of individual steps (which jobs are part of) rather than the deployment in its entirety -- but that just means that it will only timeout if an individual step takes longer than that value (?) So you could in theory set it to a lower value and still have it work despite the overall deployment taking a long time (but each step completing in a time under the timeout)... Adding it works for me 🤷‍♂️

github-actions[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.

planetf1 commented 3 years ago

The VDC environment will need significant re-work to adapt to the many changes made in Egeria to better support metadata integration. Additionally the helm charts used for the lab & a simple base config have evolved to better support different types of services, exposing of ports, persistent storage etc.

As such specific incremental changes to the current - now old - charts do not really add value.

As such closing for now

odpi / egeria-samples

vdc helm chart install hangs for several minutes before timing out. #14