openshift / openshift-ansible

Install and config an OpenShift 3.x cluster
https://try.openshift.com
Apache License 2.0
2.19k stars 2.31k forks source link

Nuage CNI fails to setup network for any pod with Openshift Enterprise on Atomic #5204

Closed rushabh268 closed 4 years ago

rushabh268 commented 7 years ago

Description

Nuage CNI fails to setup network for any pod with Openshift Enterprise on Atomic

Version

Please put the following version information in the code block indicated below.

If you're operating from a git clone:

If you're running from playbooks installed via RPM or atomic-openshift-utils


ansible --version
ansible 2.3.1.0
  config file = /opt/openshift-ansible/ansible.cfg
  configured module search path = Default w/o overrides
  python version = 2.7.5 (default, May  3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)]

[root@ovs-12 openshift-ansible]# git describe
openshift-ansible-3.6.128-1

oc version
oc v3.5.5.5
kubernetes v1.5.2+43a9be4
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ovs-12.mvdcdev33.us.alcatel-lucent.com:8443
openshift v3.5.5.5
kubernetes v1.5.2+43a9be4
Steps To Reproduce
  1. Deploy openshift Atomic using openshift_use_nuage=True
  2. Deploy a pod to test
Expected Results

Describe what you expected to happen.

Pod should be resolved in the CNI network but instead fails with the error shown below

[root@ovs-12 openshift-ansible]# kubectl describe po/deploy-4266705484-2vglz 
Name:           deploy-4266705484-2vglz
Namespace:      default
Security Policy:    restricted
Node:           ovs-2.test.ose.atomic.com/100.200.55.105
Start Time:     Wed, 23 Aug 2017 09:18:10 -0700
Labels:         pod-template-hash=4266705484
            run=deploy
Status:         Pending
IP:         
Controllers:        ReplicaSet/deploy-4266705484
Containers:
  deploy:
    Container ID:   
    Image:      rstarmer/nginx-curl
    Image ID:       
    Port:       80/TCP
    Args:
      my-nginx
    State:      Waiting
      Reason:       ContainerCreating
    Ready:      False
    Restart Count:  0
    Volume Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-w8ncb (ro)
    Environment Variables:  <none>
Conditions:
  Type      Status
  Initialized   True 
  Ready     False 
  PodScheduled  True 
Volumes:
  default-token-w8ncb:
    Type:   Secret (a volume populated by a Secret)
    SecretName: default-token-w8ncb
QoS Class:  BestEffort
Tolerations:    <none>
Events:
  FirstSeen LastSeen    Count   From                        SubObjectPath   Type        Reason      Message
  --------- --------    -----   ----                        -------------   --------    ------      -------
  5m        5m      1   {default-scheduler }                        Normal      Scheduled   Successfully assigned deploy-4266705484-2vglz to ovs-2.test.ose.atomic.com
  5m        5s      22  {kubelet ovs-2.test.ose.atomic.com}         Warning     FailedSync  Error syncing pod, skipping: failed to "SetupNetwork" for "deploy-4266705484-2vglz_default" with SetupNetworkError: "Failed to setup network for pod \"deploy-4266705484-2vglz_default(a81ba60a-881e-11e7-918f-faaca6105000)\" using network plugins \"cni\": failed to send CNI request: Post http://dummy/: dial unix /var/run/openshift-sdn/cni-server.sock: connect: no such file or directory; Skipping pod"
Observed Results

Openshift is not able to invoke the Nuage CNI plugin due to the following error:


Error syncing pod, skipping: failed to "SetupNetwork" for "deploy-4266705484-2vglz_default" with SetupNetworkError: "Failed to setup network for pod \"deploy-4266705484-2vglz_default(a81ba60a-881e-11e7-918f-faaca6105000)\" using network plugins \"cni\": failed to send CNI request: Post http://dummy/: dial unix /var/run/openshift-sdn/cni-server.sock: connect: no such file or directory; Skipping pod"
Additional Information

Provide any additional information which may help us diagnose the issue.

[root@ovs-1 ~]# cat /etc/redhat-release Red Hat Enterprise Linux Atomic Host release 7.3

[root@ovs-12 openshift-ansible]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.4 (Maipo)


Work-around is mentioned in https://github.com/openshift/openshift-ansible/issues/3805 
add "-v /opt/cni/bin:/opt/cni/bin -v /etc/cni/net.d:/etc/cni/net.d" to /etc/systemd/system/docker.service.wants/origin-node.service
sdodson commented 7 years ago

Can you test this PR? https://github.com/openshift/openshift-ansible/pull/4991

rushabh268 commented 7 years ago

@sdodson That PR won't handle adding -v /opt/cni/bin:/opt/cni/bin -v /etc/cni/net.d:/etc/cni/net.d to the node.service file. I have tested that PR using the above workaround that I mentioned in the bug, @rparulek & @vareti are aware about this issue.

rparulek commented 7 years ago

@sdodson We did try using our 4991 PR on atomic hosts but it fails at CNI pod resolution with the following error:

Error syncing pod, skipping: failed to "SetupNetwork" for "router-3-deploy_default" with SetupNetworkError: "Failed to setup network for pod "router-3-deploy_default(799d976f-1502-11e7-b3c1-fa163e10ef43)" using network plugins "cni": failed to send CNI request: Post http://dummy/: dial unix /var/run/openshift-sdn/cni-server.sock: connect: no such file or directory

This is the reason we used the workaround that was mentioned for Calico in this issue : https://github.com/openshift/openshift-ansible/issues/3805

The work around was we had to add the following host to container mappings in the file "/etc/systemd/system/docker.service.wants/atomic-openshift-node.service" on all our atomic nodes manually:

-v /var/usr/share/vsp-openshift:/var/usr/share/vsp-openshift -v /etc/default:/etc/default -v /var/run:/var/run -v /opt/cni/bin:/opt/cni/bin -v /etc/cni/net.d:/etc/cni/net.d

Are we missing anything here? Do we need to use some other method of exposing the cni bin/net directories?

rparulek commented 7 years ago

@sdodson If we want to achieve the mounting we want on atomic nodes to get Nuage CNI working on Atomic hosts; is there a way we can pass additional Nuage needed docker mounts using "$DOCKER_ADDTL_BIND_MOUNTS" parameter somehow via openshift-ansible in line https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_node/templates/openshift.docker.node.service#L24. Will that be possible to achieve ?

FYI, We are currently using an older openshift-ansible tag (openshift-ansible-3.6.128-1), wherein we are adding the above mentioned Nuage specific mounts in file "/etc/systemd/system/docker.service.wants/atomic-openshift-node.service" on all atomic hosts. I assume this file corresponds to file https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_node/templates/openshift.docker.node.service on openshift-ansible master branch today, right?

rparulek commented 7 years ago

@sdodson @rushabh268 I have created an upstream PR https://github.com/openshift/openshift-ansible/pull/5220 to handle the adding of the custom Nuage docker mounts to atomic-openshift-node service during the Nuage installation time itself.

rparulek commented 7 years ago

@sdodson The PR I had created above did not seem to fix the issue wherein we like to add our custom Nuage docker mounts for the CNI plugin to be functional. It is because there is another openshift dep service file which sets the $DOCKER_ADDTL_BIND_MOUNTS environment variable as here : https://github.com/openshift/openshift-ansible/blob/4338dce09dbe5497f2a3700992eb4c5afeb4e6f6/roles/openshift_node/templates/openshift.docker.node.dep.service#L9 .

Is there a way you suggest this can be handled in openshift-ansible for adding these extra mounts for atomic-openshift-node.service? Any pointers in this context will be greatly appreciated!

Many Thanks!

rjhowe commented 6 years ago

Is this fixed now?

https://github.com/openshift/openshift-ansible/blob/release-3.7/roles/openshift_node/templates/openshift.docker.node.service#L38 https://github.com/openshift/openshift-ansible/blob/release-3.7/roles/nuage_node/vars/main.yaml#L27

Fixed via this commit?

https://github.com/openshift/openshift-ansible/commit/a468322fda65e49e0ef337d482945b6c5dd40270#diff-874c44ea72270336f5cdab25af95a275

openshift-bot commented 4 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 4 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot commented 4 years ago

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot commented 4 years ago

@openshift-bot: Closing this issue.

In response to [this](https://github.com/openshift/openshift-ansible/issues/5204#issuecomment-661684050): >Rotten issues close after 30d of inactivity. > >Reopen the issue by commenting `/reopen`. >Mark the issue as fresh by commenting `/remove-lifecycle rotten`. >Exclude this issue from closing again by commenting `/lifecycle frozen`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.