linode / terraform-linode-dcos

[WORK-IN-PROGRESS] DC/OS Provisioning Terraform module for Linode
https://registry.terraform.io/modules/linode/dcos/linode/
3 stars 4 forks source link

journalctl shows failures on nearly every node after during installer.sh #1

Closed displague closed 5 years ago

displague commented 5 years ago

There are references to the private IP of the bootstrap node in most of the errors. The bootstrap node itself seems clean of warnings :-/

Jan 09 21:49:44 linode-dcos-master-00 bootstrap[25577]: [INFO] Clearing proxy environment variables
Jan 09 21:49:44 linode-dcos-master-00 bootstrap[25577]: [INFO] No zk.pid last mtime found at /var/lib/dcos/bootstrap/exhibitor_pid_stat
Jan 09 21:49:44 linode-dcos-master-00 bootstrap[25577]: [INFO] Shortcut failed, waiting for exhibitor to bring up zookeeper and stabilize
Jan 09 21:49:44 linode-dcos-master-00 bootstrap[25577]: [INFO] Expected cluster size: 3
Jan 09 21:49:44 linode-dcos-master-00 bootstrap[25577]: [INFO] Waiting for ZooKeeper cluster to stabilize
Jan 09 21:49:44 linode-dcos-master-00 bootstrap[25577]: [INFO] Serving hosts: `192.168.175.169`, leader: `192.168.175.169`
Jan 09 21:49:44 linode-dcos-master-00 bootstrap[25577]: Traceback (most recent call last):
Jan 09 21:49:44 linode-dcos-master-00 bootstrap[25577]:   File "/opt/mesosphere/bin/bootstrap", line 11, in <module>
Jan 09 21:49:44 linode-dcos-master-00 bootstrap[25577]:     load_entry_point('dcos-internal-utils==0.0.1', 'console_scripts', 'bootstrap')()
Jan 09 21:49:44 linode-dcos-master-00 bootstrap[25577]:   File "/opt/mesosphere/lib/python3.6/site-packages/dcos_internal_utils/cli.py", line 106, in main
Jan 09 21:49:44 linode-dcos-master-00 bootstrap[25577]:     exhibitor.wait(opts.master_count)
Jan 09 21:49:44 linode-dcos-master-00 bootstrap[25577]:   File "/opt/mesosphere/lib/python3.6/site-packages/dcos_internal_utils/exhibitor.py", line 113, in wait
Jan 09 21:49:44 linode-dcos-master-00 bootstrap[25577]:     raise Exception(msg_fmt.format(cluster_size, len(serving), len(leaders)))
Jan 09 21:49:44 linode-dcos-master-00 bootstrap[25577]: Exception: Expected 3 servers and 1 leader, got 1 servers and 1 leaders
Jan 09 21:49:45 linode-dcos-master-00 systemd[1]: dcos-net.service: Control process exited, code=exited status=1
Jan 09 21:49:45 linode-dcos-master-00 systemd[1]: dcos-net.service: Failed with result 'exit-code'.
Jan 09 21:49:45 linode-dcos-master-00 systemd[1]: Failed to start DC/OS Net: A distributed systems & network overlay orchestration engine.
Jan 09 21:49:45 linode-dcos-master-00 systemd[1]: dcos-oauth.service: Service hold-off time over, scheduling restart.
Jan 09 21:49:45 linode-dcos-master-00 systemd[1]: dcos-oauth.service: Scheduled restart job, restart counter is at 233.
Jan 09 21:49:45 linode-dcos-master-00 systemd[1]: Stopped DC/OS Authentication (OAuth): authenticates DC/OS users using OpenID Connect and Auth0.
Jan 09 21:49:45 linode-dcos-master-00 systemd[1]: Starting DC/OS Authentication (OAuth): authenticates DC/OS users using OpenID Connect and Auth0...
Jan 09 21:49:45 linode-dcos-master-00 systemd[1]: Starting Generate resolv.conf: configures network name resolution...
Jan 09 21:49:45 linode-dcos-master-00 bootstrap[25586]: [INFO] Clearing proxy environment variables
Jan 09 21:49:45 linode-dcos-master-00 bootstrap[25586]: [INFO] No zk.pid last mtime found at /var/lib/dcos/bootstrap/exhibitor_pid_stat
Jan 09 21:49:45 linode-dcos-master-00 bootstrap[25586]: [INFO] Shortcut failed, waiting for exhibitor to bring up zookeeper and stabilize
Jan 09 21:49:45 linode-dcos-master-00 bootstrap[25586]: [INFO] Expected cluster size: 3
Jan 09 21:49:45 linode-dcos-master-00 bootstrap[25586]: [INFO] Waiting for ZooKeeper cluster to stabilize
Jan 09 21:49:45 linode-dcos-master-00 bootstrap[25586]: [INFO] Serving hosts: `192.168.175.169`, leader: `192.168.175.169`
Jan 09 21:49:45 linode-dcos-master-00 bootstrap[25586]: Traceback (most recent call last):
Jan 09 21:49:45 linode-dcos-master-00 bootstrap[25586]:   File "/opt/mesosphere/bin/bootstrap", line 11, in <module>
Jan 09 21:49:45 linode-dcos-master-00 bootstrap[25586]:     load_entry_point('dcos-internal-utils==0.0.1', 'console_scripts', 'bootstrap')()
Jan 09 21:49:45 linode-dcos-master-00 bootstrap[25586]:   File "/opt/mesosphere/lib/python3.6/site-packages/dcos_internal_utils/cli.py", line 106, in main
Jan 09 21:49:45 linode-dcos-master-00 bootstrap[25586]:     exhibitor.wait(opts.master_count)
Jan 09 21:49:45 linode-dcos-master-00 bootstrap[25586]:   File "/opt/mesosphere/lib/python3.6/site-packages/dcos_internal_utils/exhibitor.py", line 113, in wait
Jan 09 21:49:45 linode-dcos-master-00 bootstrap[25586]:     raise Exception(msg_fmt.format(cluster_size, len(serving), len(leaders)))
Jan 09 21:49:45 linode-dcos-master-00 bootstrap[25586]: Exception: Expected 3 servers and 1 leader, got 1 servers and 1 leaders
Jan 09 21:49:45 linode-dcos-master-00 systemd[1]: dcos-diagnostics.service: Service hold-off time over, scheduling restart.
Jan 09 21:49:45 linode-dcos-master-00 systemd[1]: dcos-diagnostics.service: Scheduled restart job, restart counter is at 233.
Jan 09 21:49:45 linode-dcos-master-00 systemd[1]: dcos-metrics-master.service: Service hold-off time over, scheduling restart.
Jan 09 21:49:45 linode-dcos-master-00 systemd[1]: dcos-metrics-master.service: Scheduled restart job, restart counter is at 233.
Jan 09 21:49:45 linode-dcos-master-00 systemd[1]: dcos-telegraf.service: Service hold-off time over, scheduling restart.
Jan 09 21:49:45 linode-dcos-master-00 systemd[1]: dcos-telegraf.service: Scheduled restart job, restart counter is at 233.
Jan 09 21:49:45 linode-dcos-master-00 systemd[1]: Stopped Telegraf: collects and reports metrics.
Jan 09 21:49:45 linode-dcos-master-00 systemd[1]: Starting Telegraf: collects and reports metrics...
Jan 09 21:49:45 linode-dcos-master-00 systemd[1]: Stopped DC/OS Metrics Master: exposes node metrics.
Jan 09 21:49:45 linode-dcos-master-00 systemd[1]: Starting DC/OS Metrics Master: exposes node metrics...
Jan 09 21:49:45 linode-dcos-master-00 systemd[1]: Stopped DC/OS Diagnostics Master: aggregates and exposes component health.
Jan 09 21:49:45 linode-dcos-master-00 systemd[1]: Starting DC/OS Diagnostics Master: aggregates and exposes component health...
Jan 09 21:49:45 linode-dcos-master-00 systemd[1]: dcos-oauth.service: Control process exited, code=exited status=1
Jan 09 21:49:45 linode-dcos-master-00 systemd[1]: dcos-oauth.service: Failed with result 'exit-code'.
Jan 09 21:53:33 linode-dcos-public-agent-00 systemd[1]: dcos-net.service: Main process exited, code=exited, status=1/FAILURE
Jan 09 21:53:33 linode-dcos-public-agent-00 systemd[1]: dcos-net.service: Failed with result 'exit-code'.
Jan 09 21:53:35 linode-dcos-public-agent-00 systemd[1]: dcos-mesos-slave-public.service: Service hold-off time over, scheduling restart.
Jan 09 21:53:35 linode-dcos-public-agent-00 systemd[1]: dcos-mesos-slave-public.service: Scheduled restart job, restart counter is at 295.
Jan 09 21:53:35 linode-dcos-public-agent-00 systemd[1]: Stopped Mesos Agent Public: distributed systems kernel public agent.
Jan 09 21:53:35 linode-dcos-public-agent-00 systemd[1]: Starting Mesos Agent Public: distributed systems kernel public agent...
Jan 09 21:53:35 linode-dcos-public-agent-00 mesos-agent[11227]: ping: unknown host ready.spartan
Jan 09 21:53:35 linode-dcos-public-agent-00 systemd[1]: dcos-mesos-slave-public.service: Control process exited, code=exited status=2
Jan 09 21:53:35 linode-dcos-public-agent-00 systemd[1]: dcos-mesos-slave-public.service: Failed with result 'exit-code'.
Jan 09 21:53:35 linode-dcos-public-agent-00 systemd[1]: Failed to start Mesos Agent Public: distributed systems kernel public agent.
Jan 09 21:53:35 linode-dcos-public-agent-00 systemd[1]: Started OpenSSH per-connection server daemon (172.104.2.4:56881).
Jan 09 21:53:35 linode-dcos-public-agent-00 sshd[11229]: Accepted publickey for core from 172.104.2.4 port 56881 ssh2: RSA SHA256:EJVZYwd79ydCK/ezDALjkz4co1ofNsj9+wEPJrWNqgY
Jan 09 21:53:35 linode-dcos-public-agent-00 sshd[11229]: pam_unix(sshd:session): session opened for user core by (uid=0)
Jan 09 21:53:35 linode-dcos-public-agent-00 systemd-logind[796]: New session 7 of user core.
Jan 09 21:53:35 linode-dcos-public-agent-00 systemd[1]: Started Session 7 of user core.
Jan 09 21:53:38 linode-dcos-public-agent-00 systemd[1]: dcos-adminrouter-agent.service: Service hold-off time over, scheduling restart.
Jan 09 21:53:38 linode-dcos-public-agent-00 systemd[1]: dcos-adminrouter-agent.service: Scheduled restart job, restart counter is at 294.
Jan 09 21:53:38 linode-dcos-public-agent-00 systemd[1]: dcos-net.service: Service hold-off time over, scheduling restart.
Jan 09 21:53:38 linode-dcos-public-agent-00 systemd[1]: dcos-net.service: Scheduled restart job, restart counter is at 185.
Jan 09 21:53:38 linode-dcos-public-agent-00 systemd[1]: Stopped DC/OS Net: A distributed systems & network overlay orchestration engine.
Jan 09 21:53:38 linode-dcos-public-agent-00 systemd[1]: Starting DC/OS Net: A distributed systems & network overlay orchestration engine...
Jan 09 21:53:38 linode-dcos-public-agent-00 systemd[1]: Stopped Admin Router Agent: exposes a unified control plane proxy for components and services using NGINX.
Jan 09 21:53:38 linode-dcos-public-agent-00 systemd[1]: Starting Admin Router Agent: exposes a unified control plane proxy for components and services using NGINX...
Jan 09 21:53:38 linode-dcos-public-agent-00 check-time[11243]: Checking whether time is synchronized using the kernel adjtimex API.
Jan 09 21:53:38 linode-dcos-public-agent-00 check-time[11243]: Time can be synchronized via most popular mechanisms (ntpd, chrony, systemd-timesyncd, etc.)
Jan 09 21:53:38 linode-dcos-public-agent-00 check-time[11243]: Time is in sync!
Jan 09 21:53:38 linode-dcos-public-agent-00 ping[11244]: ping: unknown host ready.spartan
Jan 09 21:53:38 linode-dcos-public-agent-00 systemd[1]: dcos-adminrouter-agent.service: Control process exited, code=exited status=2
Jan 09 21:53:38 linode-dcos-public-agent-00 systemd[1]: dcos-adminrouter-agent.service: Failed with result 'exit-code'.
Jan 09 21:53:38 linode-dcos-public-agent-00 systemd[1]: Failed to start Admin Router Agent: exposes a unified control plane proxy for components and services using NGINX.
Jan 09 21:53:38 linode-dcos-public-agent-00 systemd[1]: Stopped Wait for Network to be Configured.
Jan 09 21:53:38 linode-dcos-public-agent-00 systemd[1]: Stopping Wait for Network to be Configured...
Jan 09 21:53:38 linode-dcos-public-agent-00 systemd[1]: Stopping Network Service...
Jan 09 21:53:38 linode-dcos-public-agent-00 systemd[1]: Stopped Network Service.
Jan 09 21:53:38 linode-dcos-public-agent-00 systemd[1]: Starting Network Service...
Jan 09 21:53:38 linode-dcos-public-agent-00 systemd-networkd[11271]: spartan: Gained IPv6LL
Jan 09 21:53:38 linode-dcos-public-agent-00 systemd-networkd[11271]: minuteman: Gained IPv6LL
Jan 09 21:53:38 linode-dcos-public-agent-00 systemd-networkd[11271]: eth0: Gained IPv6LL
Jan 09 21:53:38 linode-dcos-public-agent-00 systemd-networkd[11271]: Enumeration completed
Jan 09 21:53:38 linode-dcos-public-agent-00 systemd[1]: Started Network Service.
Jan 09 21:53:38 linode-dcos-public-agent-00 systemd[1]: Starting Wait for Network to be Configured...
Jan 09 21:53:38 linode-dcos-public-agent-00 systemd-networkd-wait-online[11272]: ignoring: lo
Jan 09 21:53:38 linode-dcos-public-agent-00 systemd-networkd[11271]: spartan: Link is not managed by us
Jan 09 21:53:38 linode-dcos-public-agent-00 systemd-networkd[11271]: lo: Link is not managed by us
Jan 09 21:53:38 linode-dcos-public-agent-00 systemd-networkd[11271]: docker0: Link is not managed by us
Jan 09 21:53:38 linode-dcos-public-agent-00 systemd-networkd[11271]: minuteman: Link is not managed by us
Jan 09 21:53:38 linode-dcos-public-agent-00 systemd[1]: Started Wait for Network to be Configured.
Jan 09 21:53:38 linode-dcos-public-agent-00 systemd-networkd[11271]: lo: Configured
Jan 09 21:53:38 linode-dcos-public-agent-00 systemd-networkd[11271]: eth0: DHCPv4 address 45.79.184.248/24 via 45.79.184.1
Jan 09 21:53:38 linode-dcos-public-agent-00 systemd-networkd[11271]: eth0: Configured
Jan 09 21:53:38 linode-dcos-public-agent-00 dcos-net-setup.py[11275]: RTNETLINK answers: File exists
Jan 09 21:53:39 linode-dcos-public-agent-00 dcos-net-setup.py[11282]: RTNETLINK answers: File exists
Jan 09 21:53:39 linode-dcos-public-agent-00 dcos-net-setup.py[11290]: RTNETLINK answers: File exists
Jan 09 21:53:39 linode-dcos-public-agent-00 dcos-net-setup.py[11293]: RTNETLINK answers: File exists
Jan 09 21:53:39 linode-dcos-public-agent-00 bootstrap[10603]: [INFO] Unlocked fd 4
Jan 09 21:53:39 linode-dcos-public-agent-00 bootstrap[10603]: [INFO] Closing /var/lib/dcos with fd 4
Jan 09 21:53:39 linode-dcos-public-agent-00 bootstrap[10603]: Traceback (most recent call last):
Jan 09 21:53:39 linode-dcos-public-agent-00 bootstrap[10603]:   File "/opt/mesosphere/bin/bootstrap", line 11, in <module>
Jan 09 21:53:39 linode-dcos-public-agent-00 bootstrap[10603]:     load_entry_point('dcos-internal-utils==0.0.1', 'console_scripts', 'bootstrap')()
Jan 09 21:53:39 linode-dcos-public-agent-00 bootstrap[10603]:   File "/opt/mesosphere/lib/python3.6/site-packages/dcos_internal_utils/cli.py", line 116, in main
Jan 09 21:53:39 linode-dcos-public-agent-00 bootstrap[10603]:     bootstrappers[service](b, opts)
Jan 09 21:53:39 linode-dcos-public-agent-00 bootstrap[10603]:   File "/opt/mesosphere/lib/python3.6/site-packages/dcos_internal_utils/cli.py", line 23, in wrapper
Jan 09 21:53:39 linode-dcos-public-agent-00 bootstrap[10603]:     fun(b, opts)
Jan 09 21:53:39 linode-dcos-public-agent-00 bootstrap[10603]:   File "/opt/mesosphere/lib/python3.6/site-packages/dcos_internal_utils/cli.py", line 54, in dcos_telegraf_agent
Jan 09 21:53:39 linode-dcos-public-agent-00 bootstrap[10603]:     b.cluster_id('/var/lib/dcos/cluster-id', readonly=True)
Jan 09 21:53:39 linode-dcos-public-agent-00 bootstrap[10603]:   File "/opt/mesosphere/lib/python3.6/site-packages/dcos_internal_utils/bootstrap.py", line 66, in cluster_id
Jan 09 21:53:39 linode-dcos-public-agent-00 bootstrap[10603]:     zkid = self._consensus('/cluster-id', zkid, ANYONE_READ)
Jan 09 21:53:39 linode-dcos-public-agent-00 bootstrap[10603]:   File "/opt/mesosphere/lib/python3.6/site-packages/dcos_internal_utils/bootstrap.py", line 105, in _consensus
Jan 09 21:53:39 linode-dcos-public-agent-00 bootstrap[10603]:     self.zk.sync(path)
Jan 09 21:53:39 linode-dcos-public-agent-00 bootstrap[10603]:   File "/opt/mesosphere/lib/python3.6/site-packages/dcos_internal_utils/bootstrap.py", line 41, in zk
Jan 09 21:53:39 linode-dcos-public-agent-00 bootstrap[10603]:     self._zk.start()
Jan 09 21:53:39 linode-dcos-public-agent-00 bootstrap[10603]:   File "/opt/mesosphere/lib/python3.6/site-packages/kazoo/client.py", line 567, in start
Jan 09 21:53:39 linode-dcos-public-agent-00 bootstrap[10603]:     raise self.handler.timeout_exception("Connection time-out")
Jan 09 21:53:39 linode-dcos-public-agent-00 bootstrap[10603]: kazoo.handlers.threading.KazooTimeoutError: Connection time-out
Jan 09 21:53:39 linode-dcos-public-agent-00 bootstrap[11019]: [INFO] Locked fd 4
Jan 09 21:53:39 linode-dcos-public-agent-00 bootstrap[11019]: [WARNING] Cannot resolve zk-1.zk: [Errno -2] Name or service not known
Jan 09 21:53:39 linode-dcos-public-agent-00 bootstrap[11019]: [WARNING] Cannot resolve zk-2.zk: [Errno -2] Name or service not known
Jan 09 21:53:39 linode-dcos-public-agent-00 bootstrap[11019]: [WARNING] Cannot resolve zk-3.zk: [Errno -2] Name or service not known
Jan 09 21:53:39 linode-dcos-public-agent-00 bootstrap[11019]: [WARNING] Cannot resolve zk-4.zk: [Errno -2] Name or service not known
Jan 09 21:53:39 linode-dcos-public-agent-00 bootstrap[11019]: [WARNING] Cannot resolve zk-5.zk: [Errno -2] Name or service not known
Jan 09 21:53:39 linode-dcos-public-agent-00 dcos-net-setup.py[11297]: RTNETLINK answers: File exists
Jan 09 21:53:39 linode-dcos-public-agent-00 systemd[1]: dcos-telegraf.service: Control process exited, code=exited status=1
Jan 09 21:53:39 linode-dcos-public-agent-00 systemd[1]: dcos-telegraf.service: Failed with result 'exit-code'.
Jan 09 21:53:39 linode-dcos-public-agent-00 systemd[1]: Failed to start Telegraf: collects and reports metrics.
Jan 09 21:53:39 linode-dcos-public-agent-00 dcos-net-setup.py[11303]: net.ipv6.conf.spartan.disable_ipv6 = 0
Jan 09 21:53:39 linode-dcos-public-agent-00 dcos-net-setup.py[11307]: RTNETLINK answers: File exists
Jan 09 21:53:40 linode-dcos-public-agent-00 systemd[1]: dcos-mesos-slave-public.service: Service hold-off time over, scheduling restart.
Jan 09 21:53:40 linode-dcos-public-agent-00 systemd[1]: dcos-mesos-slave-public.service: Scheduled restart job, restart counter is at 296.
Jan 09 21:53:40 linode-dcos-public-agent-00 systemd[1]: Stopped Mesos Agent Public: distributed systems kernel public agent.
Jan 09 21:53:40 linode-dcos-public-agent-00 systemd[1]: Starting Mesos Agent Public: distributed systems kernel public agent...
Jan 09 21:53:40 linode-dcos-public-agent-00 mesos-agent[11315]: ping: unknown host ready.spartan
Jan 09 21:53:40 linode-dcos-public-agent-00 systemd[1]: dcos-mesos-slave-public.service: Control process exited, code=exited status=2
Jan 09 21:53:40 linode-dcos-public-agent-00 systemd[1]: dcos-mesos-slave-public.service: Failed with result 'exit-code'.
Jan 09 21:53:40 linode-dcos-public-agent-00 systemd[1]: Failed to start Mesos Agent Public: distributed systems kernel public agent.
Jan 09 21:53:40 linode-dcos-public-agent-00 bootstrap[11311]: [INFO] Clearing proxy environment variables
Jan 09 21:53:40 linode-dcos-public-agent-00 bootstrap[11311]: [DEBUG] bootstrapping dcos-net

and so on..

displague commented 5 years ago

This error applies to the classic installer. We have new problems now :-)