openshift / openshift-ansible

Install and config an OpenShift 3.x cluster
https://try.openshift.com
Apache License 2.0
2.19k stars 2.31k forks source link

failing to create 3.10 cluster because ansible downloads 3.6 version oc tool #7790

Closed vikaschoudhary16 closed 4 years ago

vikaschoudhary16 commented 6 years ago

Description

Trying to install single node 3.10 cluster with pre-release version v3.10.0-0.14.0. Eventually openshift-ansible fails with error related to node label parsing.

Invalid value: "node-role.kubernetes.io~1master"

It fails because oc tool that openshift-ansible produces is version 3.6. Why it is not 3.10?

containerized=true

ansible_ssh_user=root
openshift_deployment_type=openshift-enterprise
openshift_image_tag=v3.10.0-0.14.0
openshift_release='3.10'
openshift_playbook_rpm_repos=[{'id': 'aos-playbook-rpm', 'name': 'aos-playbook-rpm', 'baseurl': 'https://mirror.openshift.com/enterprise/all/3.10/v3.10.0-0.14.0_2018-03-27.1/x86_64/os/', 'enabled': 1, 'gpgcheck': 0}]

Another confusing thing that i am not able to understand is the node image that openshift-ansible pulls to the host. Docker cmd shows image as "latest" but when i use "docker inspect ", version it shows is v3.6.173.0.96, same as oc tool version.

[root@dell-r620-01 openshift-ansible]# docker images
REPOSITORY                                              TAG                 IMAGE ID            CREATED             SIZE
registry.reg-aws.openshift.com:443/openshift3/node      latest              2896ff70a1ad        2 months ago        1.16 GB
registry.reg-aws.openshift.com:443/openshift3/ose-pod   latest              dd6523d97eb2        2 months ago        209 MB

[root@dell-r620-01 openshift-ansible]# docker inspect 2896ff70a1ad | grep version
                "version": "v3.6.173.0.96"
                "version": "v3.6.173.0.96"

same is the case with cli image.

INSTALLER STATUS ************************************************************************************************************************************************************
Initialization             : Complete (0:00:08)
Health Check               : Complete (0:00:01)
etcd Install               : Complete (0:01:53)
Master Install             : Complete (0:03:28)
Master Additional Install  : Complete (0:00:30)
Node Install               : In Progress (0:02:10)
        This phase can be restarted by running: playbooks/openshift-node/config.yml

Wednesday 04 April 2018  18:24:33 -0400 (0:00:20.880)       0:08:10.278 *******
===============================================================================
etcd : Install etcd ------------------------------------------------------------------------------------------------------------------------------------------------- 68.88s
openshift_master : Pre-pull master image ---------------------------------------------------------------------------------------------------------------------------- 52.33s
openshift_cli : Pull CLI Image -------------------------------------------------------------------------------------------------------------------------------------- 27.29s
openshift_manage_node : Label nodes --------------------------------------------------------------------------------------------------------------------------------- 20.88s
restart master ------------------------------------------------------------------------------------------------------------------------------------------------------ 20.70s
openshift_facts ----------------------------------------------------------------------------------------------------------------------------------------------------- 13.94s
openshift_manage_node : Wait for master API to become available before proceeding ----------------------------------------------------------------------------------- 10.85s
openshift_manage_node : Set node schedulability --------------------------------------------------------------------------------------------------------------------- 10.70s
openshift_manage_node : Wait for Node Registration ------------------------------------------------------------------------------------------------------------------ 10.70s
openshift_node : Start and enable node ------------------------------------------------------------------------------------------------------------------------------ 10.34s
openshift_master : Start and enable master api ---------------------------------------------------------------------------------------------------------------------- 10.25s
openshift_master : Start and enable master controller service ------------------------------------------------------------------------------------------------------- 10.25s
openshift_node : Install iSCSI storage plugin dependencies ----------------------------------------------------------------------------------------------------------- 9.75s
openshift_manageiq : Configure role/user permissions ----------------------------------------------------------------------------------------------------------------- 5.86s
nickhammond.logrotate : nickhammond.logrotate | Install logrotate ---------------------------------------------------------------------------------------------------- 5.19s
openshift_node : Install GlusterFS storage plugin dependencies ------------------------------------------------------------------------------------------------------- 5.05s
openshift_node : Install dnsmasq ------------------------------------------------------------------------------------------------------------------------------------- 4.92s
etcd : Install openssl ----------------------------------------------------------------------------------------------------------------------------------------------- 4.92s
etcd : Install etcd -------------------------------------------------------------------------------------------------------------------------------------------------- 4.81s
openshift_node : Install Ceph storage plugin dependencies ------------------------------------------------------------------------------------------------------------ 4.81s

Failure summary:

  1. Hosts:    mynode
     Play:     Additional node config
     Task:     Label nodes
     Message:  {u'cmd': u'/usr/local/bin/oc label node mynode node-role.kubernetes.io/master=true region=infra zone=default --overwrite -n default', u'returncode': 1, u'results': {}, u'stderr': u'The Node "mynode" is invalid: metadata.labels: Invalid value: "node-role.kubernetes.io~1master": name part must consist of alphanumeric characters, \'-\', \'_\' or \'.\', and must start and end with an alphanumeric character (e.g. \'MyName\',  or \'my.name\',  or \'123-abc\', regex used for validation is \'([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]\')\n', u'stdout': u''}

Also when i tried to manually label the node, showing same error:

[root@dell-r620-01 openshift-ansible]# oc label node mynode node-role.kubernetes.io/master=true region=infra zone=default --overwrite -n default The Node "mynode" is invalid: metadata.labels: Invalid value: "node-role.kubernetes.io~1master": name part must consist of alphanumeric characters, '-', '' or '.', and must start and end with an alphanumeric character (e.g. 'MyName', or 'my.name', or '123-abc', regex used for validation is '([A-Za-z0-9][-A-Za-z0-9.]*)?[A-Za-z0-9]')

Version
[root@dell-r620-01 openshift-ansible]# ansible --version
ansible 2.4.3.0
  config file = /home/vikas/openshift-ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, May  3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)]

[root@dell-r620-01 openshift-ansible]# git describe
openshift-ansible-3.10.0-0.15.0

[root@dell-r620-01 openshift-ansible]# oc version
oc v3.6.173.0.96
kubernetes v1.6.1+5115d708d7
features: Basic-Auth GSSAPI Kerberos SPNEGO
If i try to label node with oc tool 3.10 version (built manually), it works. So the main question is why openshift-ansible is not downloading 3.10 oc tool?

UPDATE: Able to get pass this error by adding following in the inventory file:

openshift_cli_image="registry.reg-aws.openshift.com:443/openshift3/ose:v3.10.0-0.14.0"

openshift-ansible is not using openshift_image_tag, in the docker pull for openshift3/ose image. Two questions:

  1. Should not openshift_image_tag be used in the docker pull?
  2. Even if it is not used and latest is used instead, why latest points to 3.6 and not to 3.10?

/cc @smarterclayton @sdodson @sjenning

smarterclayton commented 6 years ago

Looks like there may be a system container bug with the pull CLI image flow.

Can you turn on -vv and run through that step again?

On Apr 5, 2018, at 2:23 AM, Vikas Choudhary (vikasc) < notifications@github.com> wrote:

Description

Trying to install single node 3.10 cluster with pre-release version v3.10.0-0.14.0. Eventually openshift-ansible fails with error related to node label parsing.

Invalid value: "node-role.kubernetes.io http://node-role.kubernetes.io~1master"

It fails because oc tool that openshift-ansible produces is version 3.6. Why it is not 3.10?

containerized=true

ansible_ssh_user=root openshift_deployment_type=openshift-enterprise openshift_image_tag=v3.10.0-0.14.0 openshift_release='3.10' openshift_playbook_rpm_repos=[{'id': 'aos-playbook-rpm', 'name': 'aos-playbook-rpm', 'baseurl': 'https://mirror.openshift.com/enterprise/all/3.10/v3.10.0-0.14.0_2018-03-27.1/x86_64/os/', 'enabled': 1, 'gpgcheck': 0}]

INSTALLER STATUS


Initialization : Complete (0:00:08) Health Check : Complete (0:00:01) etcd Install : Complete (0:01:53) Master Install : Complete (0:03:28) Master Additional Install : Complete (0:00:30) Node Install : In Progress (0:02:10) This phase can be restarted by running: playbooks/openshift-node/config.yml

Wednesday 04 April 2018 18:24:33 -0400 (0:00:20.880) 0:08:10.278 ***

etcd : Install etcd

68.88s openshift_master : Pre-pull master image

52.33s openshift_cli : Pull CLI Image

27.29s openshift_manage_node : Label nodes

20.88s restart master ------------------------------------------------------------------------------------------------------------------------------------------------------ 20.70s openshift_facts

13.94s openshift_manage_node : Wait for master API to become available before proceeding ----------------------------------------------------------------------------------- 10.85s openshift_manage_node : Set node schedulability

10.70s openshift_manage_node : Wait for Node Registration

10.70s openshift_node : Start and enable node

10.34s openshift_master : Start and enable master api

10.25s openshift_master : Start and enable master controller service

10.25s openshift_node : Install iSCSI storage plugin dependencies

9.75s openshift_manageiq : Configure role/user permissions

5.86s nickhammond.logrotate : nickhammond.logrotate | Install logrotate

5.19s openshift_node : Install GlusterFS storage plugin dependencies

5.05s openshift_node : Install dnsmasq

4.92s etcd : Install openssl

4.92s etcd : Install etcd

4.81s openshift_node : Install Ceph storage plugin dependencies

4.81s

Failure summary:

  1. Hosts: mynode Play: Additional node config Task: Label nodes Message: {u'cmd': u'/usr/local/bin/oc label node mynode node-role.kubernetes.io/master=true region=infra zone=default --overwrite -n default', u'returncode': 1, u'results': {}, u'stderr': u'The Node "mynode" is invalid: metadata.labels: Invalid value: "node-role.kubernetes.io~1master": name part must consist of alphanumeric characters, \'-\', \'\' or \'.\', and must start and end with an alphanumeric character (e.g. \'MyName\', or \'my.name\', or \'123-abc\', regex used for validation is \'([A-Za-z0-9][-A-Za-z0-9.]*)?[A-Za-z0-9]\')\n', u'stdout': u''}

Also when i tried to manually label the node, showing same error:

[root@dell-r620-01 openshift-ansible]# oc label node mynode node-role.kubernetes.io/master=true region=infra zone=default --overwrite -n default The Node "mynode" is invalid: metadata.labels: Invalid value: "node-role.kubernetes.io http://node-role.kubernetes.io~1master": name part must consist of alphanumeric characters, '-', '' or '.', and must start and end with an alphanumeric character (e.g. 'MyName', or 'my.name http://my.name', or '123-abc', regex used for validation is '([A-Za-z0-9][-A-Za-z0-9.]*)?[A-Za-z0-9]')

Version

[root@dell-r620-01 openshift-ansible]# ansible --version ansible 2.4.3.0 config file = /home/vikas/openshift-ansible/ansible.cfg configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python2.7/site-packages/ansible executable location = /usr/bin/ansible python version = 2.7.5 (default, May 3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)]

[root@dell-r620-01 openshift-ansible]# git describe openshift-ansible-3.10.0-0.15.0

If i try to label node with oc tool 3.10 version (built manually), it works.

/cc @smarterclayton https://github.com/smarterclayton @sdodson https://github.com/sdodson @sjenning https://github.com/sjenning

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/openshift-ansible/issues/7790, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_p2_ImBFzLTyIQPR0AvPau75KzFeHks5tlbhkgaJpZM4TH6jG .

vikaschoudhary16 commented 6 years ago

@smarterclayton After using following in the ini file, there was no issue related to image tags:

oreg_url='registry.reg-aws.openshift.com:443/openshift3/ose-${component}:v3.10.0-0.14.0'

Next failure that i encountered was with node service:

● atomic-openshift-node.service
   Loaded: loaded (/etc/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Mon 2018-04-09 02:20:36 EDT; 813ms ago
  Process: 8433 ExecStopPost=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string: (code=exited, status=0/SUCCESS)
  Process: 8431 ExecStopPost=/usr/bin/rm /etc/dnsmasq.d/node-dnsmasq.conf (code=exited, status=0/SUCCESS)
  Process: 8419 ExecStop=/usr/bin/docker stop atomic-openshift-node (code=exited, status=1/FAILURE)
  Process: 8405 ExecStartPost=/usr/bin/sleep 10 (code=exited, status=0/SUCCESS)
  Process: 8404 ExecStart=/usr/bin/docker run --name atomic-openshift-node --rm --privileged --net=host --pid=host --env-file=/etc/sysconfig/atomic-openshift-node --entrypoint /usr/local/bin/openshift-node -v /:/rootfs:ro,rslave -e CONFIG_FILE=${CONFIG_FILE} -e OPTIONS=${OPTIONS} -e DEBUG_LOGLEVEL=${DEBUG_LOGLEVEL} -e HOST=/rootfs -e HOST_ETC=/host-etc -v /var/lib/origin:/var/lib/origin:rslave -v /etc/origin/node:/etc/origin/node -v /etc/localtime:/etc/localtime:ro -v /etc/machine-id:/etc/machine-id:ro -v /run:/run -v /sys:/sys:rw -v /sys/fs/cgroup:/sys/fs/cgroup:rw -v /usr/bin/docker:/usr/bin/docker:ro -v /var/lib/docker:/var/lib/docker -v /lib/modules:/lib/modules -v /etc/cni:/etc/cni:ro -v /opt/cni:/opt/cni:ro -v /etc/systemd/system:/host-etc/systemd/system -v /var/log:/var/log $NUAGE_ADDTL_BIND_MOUNTS -v /dev:/dev $DOCKER_ADDTL_BIND_MOUNTS -v /etc/pki:/etc/pki:ro -v /var/lib/origin/.docker:/root/.docker:ro registry.reg-aws.openshift.com:443/**openshift3/node:v3.10.0-0.14.0:${IMAGE_VERSION}** (code=exited, status=125)
  Process: 8401 ExecStartPre=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string:/in-addr.arpa/127.0.0.1,/cluster.local/127.0.0.1 (code=exited, status=0/SUCCESS)
  Process: 8398 ExecStartPre=/usr/bin/cp /etc/origin/node/node-dnsmasq.conf /etc/dnsmasq.d/ (code=exited, status=0/SUCCESS)
  Process: 8384 ExecStartPre=/usr/bin/docker rm -f atomic-openshift-node (code=exited, status=1/FAILURE)
 Main PID: 8404 (code=exited, status=125)

Apr 09 02:20:36 mynode systemd[1]: Failed to start atomic-openshift-node.service.
Apr 09 02:20:36 mynode systemd[1]: Unit atomic-openshift-node.service entered failed state.
Apr 09 02:20:36 mynode systemd[1]: atomic-openshift-node.service failed.

Problematic part is this:

-v /var/lib/origin/.docker:/root/.docker:ro registry.reg-aws.openshift.com:443/openshift3/node:v3.10.0-0.14.0:${IMAGE_VERSION}

After making this local change in openshift-ansible-3.10.0-0.14.0, i was able to get containerized cluster deployed: https://github.com/vikaschoudhary16/openshift-ansible/commit/c0825678c32e2294b3b383ec3793cb71a172e353#diff-330fdfd960ab6907fae5069bde42fcc3

openshift-bot commented 4 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 4 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot commented 4 years ago

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot commented 4 years ago

@openshift-bot: Closing this issue.

In response to [this](https://github.com/openshift/openshift-ansible/issues/7790#issuecomment-662187949): >Rotten issues close after 30d of inactivity. > >Reopen the issue by commenting `/reopen`. >Mark the issue as fresh by commenting `/remove-lifecycle rotten`. >Exclude this issue from closing again by commenting `/lifecycle frozen`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.