nokia / danm

TelCo grade network management in a Kubernetes cluster
BSD 3-Clause "New" or "Revised" License
374 stars 81 forks source link

Pods get IP from flannel instead of Danm #118

Closed Fillamug closed 5 years ago

Fillamug commented 5 years ago

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

bug feature

What happened: I tried to set up the svcwatcher demo, with small modifications to the danmnet yaml files to fit my environment. The danmnets succeeded in setting up the interfaces as shown in the demo video:

67: external.300@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 08:00:27:c5:bc:64 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::a00:27ff:fec5:bc64/64 scope link 
       valid_lft forever preferred_lft forever
68: vx_internal: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default 
    link/ether 06:33:f3:e7:b3:92 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::433:f3ff:fee7:b392/64 scope link 
       valid_lft forever preferred_lft forever

However when creating the deployments, the pods got their IP adresses from my container networking provider (which is flannel at the moment, but I've tried with calico and weavenet as well and ran into the same issue), instead from the cidr specified in the danmnets' yaml files:

NAMESPACE         NAME                                  READY   STATUS    RESTARTS   AGE    IP              NODE         NOMINATED NODE   READINESS GATES
example-vnf       internal-processor-5b7854d89f-958sg   1/1     Running   7          124m   10.244.1.40     node-1       <none>           <none>
example-vnf       internal-processor-5b7854d89f-fh5dn   1/1     Running   7          124m   10.244.1.43     node-1       <none>           <none>
example-vnf       internal-processor-5b7854d89f-n4zbk   1/1     Running   7          124m   10.244.2.42     node-2       <none>           <none>
example-vnf       internal-processor-5b7854d89f-nqd5j   1/1     Running   7          124m   10.244.1.42     node-1       <none>           <none>
example-vnf       internal-processor-5b7854d89f-w5n5f   1/1     Running   7          124m   10.244.2.40     node-2       <none>           <none>
example-vnf       internal-processor-5b7854d89f-wm4ps   1/1     Running   7          124m   10.244.2.41     node-2       <none>           <none>
example-vnf       loadbalancer-5c4fcf5cd8-d8v2l         1/1     Running   7          124m   10.244.1.41     node-1       <none>           <none>
example-vnf       loadbalancer-5c4fcf5cd8-rbbnw         1/1     Running   7          124m   10.244.2.39     node-2       <none>           <none>
external-client   external-client-db5c8847f-gm4fl       1/1     Running   7          124m   10.244.2.38     node-2       <none>           <none>

Which, I assume also causes the following problem, that when I step into one of the load-balancer pods for example, the interfaces are not correctly connected as opposed to how they are shown in the example video:

/ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
3: eth0@if135: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1450 qdisc noqueue 
    link/ether fa:91:12:98:e7:7d brd ff:ff:ff:ff:ff:ff
    inet 10.244.1.41/24 scope global eth0
       valid_lft forever preferred_lft forever

Nor do the services have their correct endpoints:

Name:              vnf-internal-processor
Namespace:         example-vnf
Labels:            <none>
Annotations:       danm.k8s.io/network: internal
                   danm.k8s.io/selector: {"app":"internal-processor"}
Selector:          <none>
Type:              ClusterIP
IP:                None
Port:              zeromq  5555/TCP
TargetPort:        5555/TCP
Endpoints:         <none>
Session Affinity:  None
Events:            <none>

What you expected to happen: I expected to have the same outcome as is shown in the example video, because I believe I followed all steps correctly, but it seems I probably did not.

How to reproduce it: I have set up a kubernetes system in vagrant with three nodes, one of them is the master and the other two are workers:

NAME         STATUS   ROLES    AGE   VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
k8s-master   Ready    master   22h   v1.15.1   192.168.50.10   <none>        Ubuntu 18.04.2 LTS   4.15.0-51-generic   docker://19.3.0
node-1       Ready    <none>   22h   v1.15.1   192.168.50.11   <none>        Ubuntu 18.04.2 LTS   4.15.0-51-generic   docker://19.3.0
node-2       Ready    <none>   22h   v1.15.1   192.168.50.12   <none>        Ubuntu 18.04.2 LTS   4.15.0-51-generic   docker://19.3.0

I also made the following modifications to the danmnets' yaml files found in the demo: external_net.yaml

kind: DanmNet
metadata:
  name: external
  namespace: external-client
spec:
  NetworkID: external
  NetworkType: ipvlan
  Options:
    host_device: eth0
    container_prefix: eth0
    rt_tables: 150
    vlan: 300
    cidr: 10.100.20.0/24
    allocation_pool:
      start: 10.100.20.50
      end: 10.100.20.60

vnf_external_net.yaml:

apiVersion: danm.k8s.io/v1
kind: DanmNet
metadata:
  name: external
  namespace: example-vnf
spec:
  NetworkID: external
  NetworkType: ipvlan
  Options:
    host_device: eth0
    container_prefix: ext
    rt_tables: 250
    vlan: 300
    cidr: 10.100.20.0/24
    allocation_pool:
      start: 10.100.20.10
      end: 10.100.20.30

vnf_internal_net.yaml:

apiVersion: danm.k8s.io/v1
kind: DanmNet
metadata:
  name: internal
  namespace: example-vnf
spec:
  NetworkID: internal
  NetworkType: ipvlan
  Options:
    host_device: eth1
    container_prefix: int
    rt_tables: 200
    vxlan: 600
    cidr: 10.100.1.0/24
    allocation_pool:
      start: 10.100.1.100
      end: 10.100.1.200

Environment:

Levovar commented 5 years ago

Well, you do actually need the features of the new version; because the issue you describe as a bug is actually how the system was designed to work until: https://github.com/nokia/danm/issues/108

If you are using an older version, kindly read the documentation related to that version: https://github.com/nokia/danm/tree/v3.3.0#creating-the-configuration-for-delegated-cni-operations

One final note: even with using the referenced feature you will never be able to integrate Flannel to central IPAM, simply because Flannel is not CNI compliant. Flannel CNI completely ignores the "ipam" section of its CNI config, and instead it will use the CIDR mounted by the Flannel DaemonSet under /var/run/flannel.scok

Levovar commented 5 years ago

On the other hand I'm happy to assist you in installing DANM from master, if you would detail the exact error :)

Fillamug commented 5 years ago

Ah I see, thank you for your answer. I must've missed the part about the non-CNI standard plugins.

Since it seems I'll have to upgrade to v4.0.0 I would share with you the problem I had with webhook (which is why I went back to v3.3.0).

First of all I encountered a problem when building the docker image. On line 17 of /integration/docker/webhook/Dockerfile it tries to clone a branch named "webhook", which it does not find so it stops. I assumed this is because the branch has already been merged, so I edited the Dockerfile by changing the following lines:

&& git clone -b 'webhook' --depth 1 https://github.com/nokia/danm.git $GOPATH/src/github.com/nokia/danm \
&& cd $GOPATH/src/github.com/nokia/danm \

To this:

&& git clone https://github.com/nokia/danm.git $GOPATH/src/github.com/nokia/danm \
&& cd $GOPATH/src/github.com/nokia/danm \
&& git checkout -b 'webhook' 0195b555ea4dc8768528efae593af044132577c9 \

After this modification it seemed to work and the image built successfully.

The next problem I had was when I was trying to create the danmnets, the webhook component threw these errors:

Error from server (InternalError): error when creating "external_net.yaml": Internal error occurred: failed calling webhook "danm-netvalidation.nokia.k8s.io": the server could not find the requested resource
Error from server (InternalError): error when creating "vnf_external_net.yaml": Internal error occurred: failed calling webhook "danm-netvalidation.nokia.k8s.io": the server could not find the requested resource
Error from server (InternalError): error when creating "vnf_internal_net.yaml": Internal error occurred: failed calling webhook "danm-netvalidation.nokia.k8s.io": the server could not find the requested resource
Error from server (InternalError): error when creating "vnf_management_net.yaml": Internal error occurred: failed calling webhook "danm-netvalidation.nokia.k8s.io": the server could not find the requested resource

I fail to see what could've caused this problem, if you have an idea, please tell me!

I modified the deployment in the webhook.yaml file a bit, as I provided the required certificates with a kubernetes secret. This is how it looks:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: danm-webhook-deployment
  namespace: kube-system
  labels:
    danm: webhook
spec:
  selector:
    matchLabels:
     danm: webhook
  template:
    metadata:
      annotations:
        # Adapt to your own network environment!
        danm.k8s.io/interfaces: |
          [
            {
              "network":"flannel"
            }
          ]
      name: danm-webhook
      labels:
        danm: webhook
    spec:
      serviceAccountName: danm-webhook
      containers:
        - name: danm-webhook
          image: fillamug/webhook:latest
          command: [ "/usr/local/bin/webhook", "-tls-cert-bundle=/etc/webhook/certs/cert.pem", "-tls-private-key-file=/etc/webhook/certs/key.pem", "bind-port=8443" ]
          imagePullPolicy: IfNotPresent
          volumeMounts:
            - name: webhook-certs
              mountPath: /etc/webhook/certs
              readOnly: true
     # Configure the directory holding the Webhook's server certificates
      volumes:
        - name: webhook-certs
          secret:
            secretName: danm-webhook-certs
Levovar commented 5 years ago

Dockerfile: yeah I copied one of my earlier versions to the repo, but you are right, it definitely needs to be adjusted! I will correct it It should simply checkout latest master, or build from the user's checkout

Reg error: actually you are getting that error from the K8s API server, not directly from the webhook. I got those kind of errors when my webhook configuration was not entirely proper. How does your MutatingWebhookConfiguration look like? Is the danm-webhook-svc Service also created? What happens when you manually contact the webhook? E.g on my cluster: [cloudadmin@controller-1 ~]$ curl https://danm-webhook-svc.kube-system.svc.nokia.net:443/netvalidation curl: (52) Empty reply from server [cloudadmin@controller-1 ~]$ curl https://danm-webhook-svc.kube-system.svc.nokia.net:443/netvalidation2 404 page not found

Fillamug commented 5 years ago

Oh, sorry, those error messages sent by webhook were from before I downgraded to v3.3.0. Now after reupgrading to v4.0.0 again, I got different errors:

Error from server (InternalError): error when creating "danmnets/external_net.yaml": Internal error occurred: failed calling webhook "danm-netvalidation.nokia.k8s.io": Post https://danm-webhook-svc.kube-system.svc:443/netvalidation?timeout=30s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Error from server (InternalError): error when creating "danmnets/vnf_external_net.yaml": Internal error occurred: failed calling webhook "danm-netvalidation.nokia.k8s.io": Post https://danm-webhook-svc.kube-system.svc:443/netvalidation?timeout=30s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Error from server (InternalError): error when creating "danmnets/vnf_internal_net.yaml": Internal error occurred: failed calling webhook "danm-netvalidation.nokia.k8s.io": Post https://danm-webhook-svc.kube-system.svc:443/netvalidation?timeout=30s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Error from server (InternalError): error when creating "danmnets/vnf_management_net.yaml": Internal error occurred: failed calling webhook "danm-netvalidation.nokia.k8s.io": Post https://danm-webhook-svc.kube-system.svc:443/netvalidation?timeout=30s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Which I find strange, since I think I did everything the same as before.

Nonetheless, here is my MutatingWebhookConfiguration that you asked for, I didn't change anything in it compared to /integration/manifests/webhook/webhook.yaml, other than filling in the caBundles:

apiVersion: admissionregistration.k8s.io/v1beta1
kind: MutatingWebhookConfiguration
metadata:
  name: danm-webhook-config
  namespace: kube-system
webhooks:
  - name: danm-netvalidation.nokia.k8s.io
    clientConfig:
      service:
        name: danm-webhook-svc
        namespace: kube-system
        path: "/netvalidation"
      # Configure your pre-generated certificate matching the details of your environment
      caBundle: <Filled in via a script>
    rules:
      - operations: ["CREATE","UPDATE"]
        apiGroups: ["danm.k8s.io"]
        apiVersions: ["v1"]
        resources: ["danmnets","clusternetworks","tenantnetworks"]
    failurePolicy: Fail
  - name: danm-configvalidation.nokia.k8s.io
    clientConfig:
      service:
        name: danm-webhook-svc
        namespace: kube-system
        path: "/confvalidation"
      # Configure your pre-generated certificate matching the details of your environment
      caBundle: <Filled in via a script>
    rules:
      - operations: ["CREATE","UPDATE"]
        apiGroups: ["danm.k8s.io"]
        apiVersions: ["v1"]
        resources: ["tenantconfigs"]
    failurePolicy: Fail
  - name: danm-netdeletion.nokia.k8s.io
    clientConfig:
      service:
        name: danm-webhook-svc
        namespace: kube-system
        path: "/netdeletion"
      # Configure your pre-generated certificate matching the details of your environment
      caBundle: <Filled in via a script>
    rules:
      - operations: ["DELETE"]
        apiGroups: ["danm.k8s.io"]
        apiVersions: ["v1"]
        resources: ["tenantnetworks"]
    failurePolicy: Fail

Also, the danm-webhook-svc Service did get created as well:

NAMESPACE     NAME               TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                  AGE    SELECTOR
kube-system   danm-webhook-svc   ClusterIP   10.96.199.114   <none>        443/TCP                  128m   danm=webhook

I created this one exactly as it is the /integration/manifests/webhook/webhook.yaml file.

Also, when I tried to manually connect to the webhook I got the following answer:

vagrant@k8s-master:~$ curl http://10.96.199.114:443/netvalidation
curl: (7) Failed to connect to 10.96.199.114 port 443: Connection timed out
Fillamug commented 5 years ago

I figured out why there was a different error this time.

The previous error message I send was given when the webhook pod deployes on the master node:

Error from server (InternalError): error when creating "external_net.yaml": Internal error occurred: failed calling webhook "danm-netvalidation.nokia.k8s.io": the server could not find the requested resource
Error from server (InternalError): error when creating "vnf_external_net.yaml": Internal error occurred: failed calling webhook "danm-netvalidation.nokia.k8s.io": the server could not find the requested resource
Error from server (InternalError): error when creating "vnf_internal_net.yaml": Internal error occurred: failed calling webhook "danm-netvalidation.nokia.k8s.io": the server could not find the requested resource
Error from server (InternalError): error when creating "vnf_management_net.yaml": Internal error occurred: failed calling webhook "danm-netvalidation.nokia.k8s.io": the server could not find the requested resource

In this case curl gives the following warning:

Warning: Binary output can mess up your terminal. Use "--output -" to tell 
Warning: curl to output it to your terminal anyway, or consider "--output 
Warning: <FILE>" to save to a file.

So I tried curling into a file, which then contained this:

^U^C^A^@^B^B

And the latter error I sent occurs when the webhook pod deploys on a worker node:

Error from server (InternalError): error when creating "danmnets/external_net.yaml": Internal error occurred: failed calling webhook "danm-netvalidation.nokia.k8s.io": Post https://danm-webhook-svc.kube-system.svc:443/netvalidation?timeout=30s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Error from server (InternalError): error when creating "danmnets/vnf_external_net.yaml": Internal error occurred: failed calling webhook "danm-netvalidation.nokia.k8s.io": Post https://danm-webhook-svc.kube-system.svc:443/netvalidation?timeout=30s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Error from server (InternalError): error when creating "danmnets/vnf_internal_net.yaml": Internal error occurred: failed calling webhook "danm-netvalidation.nokia.k8s.io": Post https://danm-webhook-svc.kube-system.svc:443/netvalidation?timeout=30s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Error from server (InternalError): error when creating "danmnets/vnf_management_net.yaml": Internal error occurred: failed calling webhook "danm-netvalidation.nokia.k8s.io": Post https://danm-webhook-svc.kube-system.svc:443/netvalidation?timeout=30s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Where curl gave the following answer as mentioned before:

curl: (7) Failed to connect to 10.96.199.114 port 443: Connection timed out

So from this I gather, that the pod should probably be always deployed on the master node, but that still leaves the warning given there to deal with, if you could help me with that.

Levovar commented 5 years ago

It is not a requirement to deploy on the master nodes. It is a requirement that the K8s API server needs to be able to access it through the provided Service.

In both cases it kind of looks like you have some issue with your cluster's network setup. You should be able to reach the webserver from the host through its service IP so either there is a connectivity issue, or the webserver itself is actually not running/serving the endpoints, or both.

Fillamug commented 5 years ago

Thanks, I will look into it and see what I can do.

Fillamug commented 5 years ago

I did a complete reinstall of the kubernetes cluster on the VMs and now when I create the danmnets, the webhook doesn't throw any errors, so I assume it works properly.

Levovar commented 5 years ago

cool! anything else I can help you with? does the feature you need work as expected?

Levovar commented 5 years ago

I got around testing the feature you were interested in! I needed to make some corrections, see next PR, but otherwise the concept works quite cool with a standard CNI plugin, such as bridge for example. Given following standard bridge CNI config: [cloudadmin@controller-1 ~]$ sudo cat /etc/cni/net.d/bridge_l3.conf { "name": "mynet", "type": "bridge", "bridge": "mynet0", "isDefaultGateway": true, "forceAddress": false, "ipMasq": true, "hairpinMode": true, "ipam": { "type": "host-local", "subnet": "10.10.0.0/16" }, "cniVersion": "0.3.1" }

And DANM ClusterNetwork manifest: [cloudadmin@controller-1 ~]$ kubectl describe cnet bridge | grep -B7 Cidr Network ID: bridge_l3 Network Type: bridge Options: Alloc: gAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAE= allocation_pool: End: 10.100.50.30 Start: 10.100.50.10 Cidr: 10.100.50.0/24

this is what happens when:

[cloudadmin@controller-1 ~]$ kubectl exec test-deployment-848cb89697-zk5mp -n kube-system ip a | grep bridge 5: test_bridge1@if71: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue inet 10.10.0.2/16 scope global test_bridge1

[cloudadmin@controller-1 ~]$ kubectl describe cnet bridge | grep Alloc Alloc: gAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAE=

[cloudadmin@controller-1 ~]$ kubectl exec test-deployment-85cbb96c6c-qwgwj -n kube-system ip a | grep bridge 5: test_bridge1@if547: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue inet 10.100.50.10/24 brd 10.100.50.255 scope global test_bridge1

[cloudadmin@controller-1 ~]$ kubectl describe cnet bridge | grep Alloc Alloc: gCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAE=

Levovar commented 5 years ago

So I consider the issue closed, feel free to follow up if something is still not clear

Fillamug commented 5 years ago

Thank you very much for your help! Sorry for answering so late, I had other things to attend to in the past couple days, but I will check out what you did as soon as I'm able.