openshift / openshift-ansible

Install and config an OpenShift 3.x cluster
https://try.openshift.com
Apache License 2.0
2.18k stars 2.31k forks source link

3.11 HA installation - Fails on "wait for sync DS to set annotations on all nodes" #11348

Closed ralfbardoel closed 3 years ago

ralfbardoel commented 5 years ago

Description

A deployment of OpenShift origin 3.11 in HA setup fails at the task "wait for sync DS to set annotations on all nodes". Things that are already checked:

We are using Ansible version 2.7.8 and git describe returns "openshift-ansible-3.11.95-1".

Steps To Reproduce

Install HA setup with two masters, two infra nodes and three computing nodes on AWS and the following inventory file (keys and domains are replaced):

[OSEv3:children]
masters
nodes
etcd
lb

[OSEv3:vars]
ansible_ssh_user=origin
ansible_become=true
openshift_deployment_type=origin
openshift_disable_check=disk_availability

openshift_image_tag=v3.11.0
openshift_release=3.11.0
openshift_pkg_version=-3.11.0

openshift_cloudprovider_kind=aws
openshift_cloudprovider_aws_access_key=XXXXX
openshift_cloudprovider_aws_secret_key=XXXXXX
openshift_clusterid=XXXXX

openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider'}]
openshift_master_htpasswd_users={'XXXX' : 'XXX'}
openshift_master_default_subdomain=xxx.domain.com
openshift_docker_insecure_registries=172.30.0.0/16

openshift_clock_enabled=true

##S3 registry
openshift_hosted_registry_storage_kind=object
openshift_hosted_registry_storage_provider=s3
openshift_hosted_registry_storage_s3_accesskey=XXXX
openshift_hosted_registry_storage_s3_secretkey=XXXXX
openshift_hosted_registry_storage_s3_bucket=XXXXX
openshift_hosted_registry_storage_s3_region=eu-central-1
openshift_hosted_registry_storage_s3_chunksize=26214400
openshift_hosted_registry_storage_s3_rootdirectory=/registry
openshift_hosted_registry_pullthrough=true
openshift_hosted_registry_acceptschema2=true
openshift_hosted_registry_enforcequota=true

##Master config
openshift_master_cluster_method=native
openshift_master_cluster_hostname=openshift.domain.com
openshift_master_cluster_public_hostname=openshift.domain.com

##Certificates
openshift_master_overwrite_named_certificates=true
openshift_master_named_certificates=[{"cafile": "/home/origin/openshift.ca.pem", "certfile": "/home/origin/openshift.pem", "keyfile": "/home/origin/openshift.key.pem", "names": ["openshift.domain.com"]}]
openshift_hosted_router_certificate={"certfile": "/home/origin/XXX.pem", "keyfile": "/home/origin/XXX.key.pem", "cafile": "/home/origin/XXX.ca.pem"}

[masters]
mastera.openshift.domain.com
masterb.openshift.domain.com

[etcd]
infraa.openshift.domain.com
infrab.openshift.domain.com

[lb]
lb.openshift.domain.com

[nodes]
mastera.openshift.domain.com openshift_node_group_name='node-config-master'
masterb.openshift.domain.com openshift_node_group_name='node-config-master'
node01a.openshift.domain.com openshift_node_group_name='node-config-compute'
node01b.openshift.domain.com openshift_node_group_name='node-config-compute'
node01admz.openshift.domain.com openshift_node_group_name='node-config-compute'
infraa.openshift.domain.com openshift_node_group_name='node-config-infra'
infrab.openshift.domain.com openshift_node_group_name='node-config-infra'
Expected Results

OpenShift HA deployment without this errors.

Observed Results

The following error information is return by running ansible with -vvv flags:

https://gist.github.com/ralfbardoel/5923a2a1781a142155f61c08bbd32522

Additional Information

oc version -> v3.11.0+62803d0-1 kubernetes -> v1.11.0+d4cacc0 OpenShift rpm -> centos-release-openshift-origin-1-1.el7.centos.noarch

Running on CentOS 7 (CentOS Linux release 7.6.1810 (Core)) on AWS.

jcpowermac commented 5 years ago

@ralfbardoel there is also a BZ for this issue. Can you provide the logs to the sync pods?

oc get pod -n openshift-node -l app=sync
oc logs <pod>
vrutkovs commented 5 years ago

"lastHeartbeatTime": "2019-03-13T10:18:30Z" "lastTransitionTime": "2019-03-11T11:20:30Z"

That doesn't look right, is node service running there without errors?

Cybernemo commented 4 years ago

Same problem here. OpenShift origin: 3.11 Ansible: 2.9.6 The ansible host resolve each nodes. The prerequisite script works without any issue.

Inventory file

[OSEv3:children]
masters
etcd
nodes

[OSEv3:vars]
## Ansible user who can login to all nodes through SSH (e.g. ssh root@os-master1)
ansible_user=root

## Deployment type: "openshift-enterprise" or "origin"
openshift_deployment_type=origin
deployment_type=origin

## Specifies the major version
openshift_release=v3.11.0
openshift_pkg_version=-3.11.0
openshift_image_tag=v3.11.0
openshift_service_catalog_image_version=v3.11.0
template_service_broker_image_version=v3.11.0
openshift_metrics_image_version="v3.11"
openshift_logging_image_version="v3.11"
openshift_logging_elasticsearch_proxy_image_version="v1.0.0"
osm_use_cockpit=true
openshift_metrics_install_metrics=True
openshift_logging_install_logging=True

## Service address space,  /16 = 65,534 IPs
openshift_portal_net=172.30.0.0/16

## Pod address space
osm_cluster_network_cidr=10.128.0.0/14

## Subnet Length of each node, 9 = 510 IPs
osm_host_subnet_length=9

## Master API  port
openshift_master_api_port=8443

## Master console port  (e.g. https://console.openshift.local:443)
openshift_master_console_port=8443

## Clustering method
openshift_master_cluster_method=native

## Hostname used by nodes and other cluster internals
openshift_master_cluster_hostname=console-int.openshift.home

## Hostname used by platform users
openshift_master_cluster_public_hostname=console.openshift.home

## Application wildcard subdomain
openshift_master_default_subdomain=apps.openshift.home

## identity provider
openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider'}]

## Users being created in the cluster
## Password abcd1234
openshift_master_htpasswd_users={'admin': '$apr1$BfW0njqt$KbsFn1LKfkb10ARFGxoRX/', 'user1': '$apr1$7erCvbtG$60V7Vx2HBfaDrfG4pUkba.'}

## Persistent storage, NFS
openshift_hosted_registry_storage_kind=nfs
openshift_hosted_registry_storage_access_modes=['ReadWriteMany']
openshift_hosted_registry_storage_host=zion.home
openshift_hosted_registry_storage_nfs_directory=/volume1/SHARED
openshift_hosted_registry_storage_volume_name=registry
openshift_hosted_registry_storage_volume_size=50Gi

## Other vars
containerized=True
os_sdn_network_plugin_name='redhat/openshift-ovs-multitenant'
openshift_disable_check=disk_availability,docker_storage,memory_availability,docker_image_availability

#NFS check bug
openshift_enable_unsupported_configurations=True
#Another Bug 1569476
skip_sanity_checks=true

openshift_node_kubelet_args="{'eviction-hard': ['memory.available<100Mi'], 'minimum-container-ttl-duration': ['10s'], 'maximum-dead-containers-per-container': ['2'], 'maximum-dead-containers': ['5'], 'pods-per-core': ['10'], 'max-pods': ['25'], 'image-gc-high-threshold': ['80'], 'image-gc-low-threshold': ['60']}"

[OSEv3:vars]

[masters]
t4master1.home

[etcd]
t4master1.home

[nodes]
t4master1.home openshift_node_labels="{'region': 'master'}"
t4infra1.home openshift_node_labels="{'region': 'infra'}"
t4node1.home openshift_node_labels="{'region': 'primary'}"
t4node2.home openshift_node_labels="{'region': 'primary'}"

Error message

fatal: [t4master1.home]: FAILED! => {                                                                    
    "attempts": 180,                                                                                     
    "changed": false,                                                                                    
    "invocation": {                                                                                      
        "module_args": {                                                                                 
            "all_namespaces": null,                                                                      
            "content": null,                                                                             
            "debug": false,                                                                              
            "delete_after": false,                                                                       
            "field_selector": null,                                                                      
            "files": null,                                                                               
            "force": false,                                                                              
            "kind": "node",                                                                              
            "kubeconfig": "/etc/origin/master/admin.kubeconfig",                                         
            "name": null,                                                                                
            "namespace": "default",                                                                      
            "selector": "",                                                                              
            "state": "list"                                                                              
        }                                                                                                
    },                                                                                                   
    "module_results": {                                                                                  
        "cmd": "/usr/local/bin/oc get node --selector= -o json -n default",                              
        "results": [                                                                                     
            {                                                                                            
                "apiVersion": "v1",                                                                      
                "items": [                                                                               
                    {                                                                                    
                        "apiVersion": "v1",                                                              
                        "kind": "Node",                                                                  
                        "metadata": {                                                                    
                            "annotations": {                                                             
                                "volumes.kubernetes.io/controller-managed-attach-detach": "true"         
                            },                                                                           
                            "creationTimestamp": "2020-05-16T07:51:40Z",                                 
                            "labels": {                                                                  
                                "beta.kubernetes.io/arch": "amd64",                                      
                                "beta.kubernetes.io/os": "linux",
                                "kubernetes.io/hostname": "t4master1"
                            },
                            "name": "t4master1",
                            "namespace": "",
                            "resourceVersion": "3792",
                            "selfLink": "/api/v1/nodes/t4master1",
                            "uid": "13f89076-974a-11ea-838c-5254008ed04b"
                        },
                        "spec": {},
                        "status": {
                            "addresses": [
                                {
                                    "address": "192.168.1.223",
                                    "type": "InternalIP"
                                },                                                                                                                                                                      [139/86498]
                                {
                                    "address": "t4master1",
                                    "type": "Hostname"
                                }
                            ],
                            "allocatable": {
                                "cpu": "4",
                                "hugepages-2Mi": "0",
                                "memory": "1779192Ki",
                                "pods": "250"
                            },
                            "capacity": {
                                "cpu": "4",
                                "hugepages-2Mi": "0",
                                "memory": "1881592Ki",
                                "pods": "250"
                            },
                            "conditions": [
                                {
                                    "lastHeartbeatTime": "2020-05-16T08:26:27Z",
                                    "lastTransitionTime": "2020-05-16T07:51:40Z",
                                    "message": "kubelet has sufficient disk space available",
                                    "reason": "KubeletHasSufficientDisk",
                                    "status": "False",
                                    "type": "OutOfDisk"
                                },
                                {
                                    "lastHeartbeatTime": "2020-05-16T08:26:27Z",
                                    "lastTransitionTime": "2020-05-16T07:51:40Z",
                                    "message": "kubelet has sufficient memory available",
                                    "reason": "KubeletHasSufficientMemory",
                                    "status": "False",
                                    "type": "MemoryPressure"
                                },
                                {
                                    "lastHeartbeatTime": "2020-05-16T08:26:27Z",
                                    "lastTransitionTime": "2020-05-16T07:51:40Z",
                                    "message": "kubelet has no disk pressure",
                                    "reason": "KubeletHasNoDiskPressure",
                                    "status": "False",
                                    "type": "DiskPressure"
                                },
                                {
                                    "lastHeartbeatTime": "2020-05-16T08:26:27Z",
                                    "lastTransitionTime": "2020-05-16T07:51:40Z",
                                    "message": "kubelet has sufficient PID available",
                                    "reason": "KubeletHasSufficientPID",
                                    "status": "False",
                                    "type": "PIDPressure"
                                },
                                {
                                    "lastHeartbeatTime": "2020-05-16T08:26:27Z",
                                    "lastTransitionTime": "2020-05-16T07:51:40Z",
                                    "message": "kubelet is posting ready status",                                                                                                                        [85/86498]
                                    "reason": "KubeletReady",
                                    "status": "True",
                                    "type": "Ready"
                                }
                            ],
                            "daemonEndpoints": {
                                "kubeletEndpoint": { 
                                    "Port": 10250
                                }
                            },
                            "images": [
                                {
                                    "names": [
                                        "docker.io/openshift/origin-node@sha256:73a2fe2f4c9f93efd47bd909572a6592907098ba7b7f2839c3ee9165228b0772",
                                        "docker.io/openshift/origin-node:v3.11.0"
                                    ],
                                    "sizeBytes": 1193537132
                                },
                                {
                                    "names": [
                                        "docker.io/openshift/origin-control-plane@sha256:8b10156d1e67d326c88228a005a69dcbd211fa1e53b709ad66d8ff1971708c7b",
                                        "docker.io/openshift/origin-control-plane:v3.11.0"
                                    ],
                                    "sizeBytes": 835849824
                                },
                                {
                                    "names": [
                                        "docker.io/openshift/origin-pod@sha256:3178ea38ef67954ceeb0ad842adcab640019da246aba109226a73aea49f31d54",
                                        "docker.io/openshift/origin-pod:v3.11.0"
                                    ],
                                    "sizeBytes": 265514713
                                },
                                {
                                    "names": [
                                        "quay.io/coreos/etcd@sha256:ed2b69c34840f475929abd84133e17421d0608b26f9c3cbe54c7699918580a99",
                                        "quay.io/coreos/etcd:v3.2.26"
                                    ],
                                    "sizeBytes": 37605387
                                }
                            ],
                            "nodeInfo": {
                                "architecture": "amd64",
                                "bootID": "a4391a38-6de6-4b66-8ee2-e9d3992b8c07",
                                "containerRuntimeVersion": "docker://1.13.1",
                                "kernelVersion": "3.10.0-1062.4.3.el7.x86_64",
                                "kubeProxyVersion": "v1.11.0+d4cacc0",
                                "kubeletVersion": "v1.11.0+d4cacc0",
                                "machineID": "cfd6a6d3aa21425b990bf7fd727c9342",
                                "operatingSystem": "linux",
                                "osImage": "CentOS Linux 7 (Core)",
                                "systemUUID": "CFD6A6D3-AA21-425B-990B-F7FD727C9342"
                            }
                        }                                                                                                         
                    }                                                                                    
                ],                                                                                       
                "kind": "List",                                                                          
                "metadata": {                                                                            
                    "resourceVersion": "",                                                               
                    "selfLink": ""                                                                       
                }                                                                                        
            }                                                                                            
        ],                                                                                               
        "returncode": 0                                                                                  
    },                                                                                                   
    "state": "list"                                                                                      
}                           
openshift-bot commented 4 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 3 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot commented 3 years ago

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot commented 3 years ago

@openshift-bot: Closing this issue.

In response to [this](https://github.com/openshift/openshift-ansible/issues/11348#issuecomment-743282748): >Rotten issues close after 30d of inactivity. > >Reopen the issue by commenting `/reopen`. >Mark the issue as fresh by commenting `/remove-lifecycle rotten`. >Exclude this issue from closing again by commenting `/lifecycle frozen`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.