networkmachinery / networkmachinery-operators

[PoC] Repo that holds in-tree networkmachinery operators (https://www.youtube.com/watch?v=JsJoRkmzoa0)
Other
13 stars 3 forks source link

smokeping tests in failed state (likely installation user error) #2

Open dougbtv opened 5 years ago

dougbtv commented 5 years ago

Hey Adel, I'm giving it a shot -- I do have the networkconnectivity-test-controller running! But first I want to say a huge thank you for the inspiring kubecon talk! This is really neat, and I'm quite excited and inspired by it.

I think I'm kinda close. I probably missed something. Mostly, I'm trying to get the "smokeping" demo working.

Some overall environment issue, I used a fresh kube cluster, just a single master that I removed the noschedule taint from...

[centos@kube-multus-master kubecon-demo]$ kubectl version
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.2", GitCommit:"66049e3b21efe110454d67df4fa62b08ea79a19b", GitTreeState:"clean", BuildDate:"2019-05-16T16:23:09Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.2", GitCommit:"66049e3b21efe110454d67df4fa62b08ea79a19b", GitTreeState:"clean", BuildDate:"2019-05-16T16:14:56Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
[centos@kube-multus-master kubecon-demo]$ kubectl get nodes
NAME                 STATUS   ROLES    AGE    VERSION
kube-multus-master   Ready    master   115m   v1.14.2

I installed Helm and then ran helm install your chart, like so:

$ helm init
$ kubectl create serviceaccount --namespace kube-system tiller
$ kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
$ kubectl patch deploy --namespace kube-system tiller-deploy -p '{"spec":{"template":{"spec":{"serviceAccount":"tiller"}}}}'
$ helm version
Client: &version.Version{SemVer:"v2.14.0", GitCommit:"05811b84a3f93603dd6c2fcfe57944dfa7ab7fd0", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.14.0", GitCommit:"05811b84a3f93603dd6c2fcfe57944dfa7ab7fd0", GitTreeState:"clean"}
# I noticed the controller would fail without some CRDs being present during helm install, so I created with...
$ find examples/ | grep -i crd | xargs -l1 kubectl apply -f 
# Then install the chart...
$ helm install ./kubernetes/networkconnectivity/`

The controller needed to restart itself once to find the certs for the webhook, but, then looked healthy in the logs after a restart.

Next, I created some pieces from the ./docs/kubecon-demo folder...

kubectl create -f demo-pods.yaml 
kubectl create -f demo-service.yaml
kubectl apply -f networkconnectivity-crd.yaml is unchanged
# I copied the original and then changed to default ns
kubectl create -f doug_networkconnectivity_layer3.yaml

Then I described the custom resource...

$ kubectl describe networkconnectivitytest.networkmachinery.io/smokeping
[... snipped ...]
Status:
  Test Status:
    API Version:  networkmachinery.io/v1alpha1
    Ip Endpoints:
      Ip:  8.8.8.8
      Ping Result:
        State:  Failed
    Kind:       PingStatus
    Pod Endpoints:
      Ping Result:
        State:  Failed
      Pod Params:
        Ip:         10.244.0.16
        Name:       demo-pod-1
        Namespace:  default
      Ping Result:
        State:  Failed
      Pod Params:
        Ip:         10.244.0.15
        Name:       demo-pod-2
        Namespace:  default
    Service Endpoints:
      Service Params:
        Ip:         10.105.51.1
        Name:       demo-kubecon
        Namespace:  default
      Service Results:
        Ip:  10.244.0.17
        Ping Result:
          State:  Failed
        Ip:       10.244.0.18
        Ping Result:
          State:  Failed
        Ip:       10.244.0.19
        Ping Result:
          State:  Failed

I looked at the logs, nothing blatant, just keeps saying it's reconciling the test, which seems, good? (I haven't read any of the code yet, tbqh)

$ kubectl logs -f networkconnectivity-test-controller-7fdd7cd9fc-4srvk
[... snipped ...]
{"level":"info","ts":1558453336.6280477,"logger":"networkconnectivity-test-controller","msg":"Reconciling Network Connectivity Test","Name":"smokeping"}

So, I tried creating some more pieces to see if I was missing something, so I also:

$ kubectl create -f networkmachinery-sflow-cm-example.yaml
$ kubectl create -f networkmachinery-sflow-daemonset-example.yaml

However, it's stuck initializing... So I got the logs from the initContainer...

$ kubectl logs sflow-ovs-installer-kwjrl -c sflow-ovs-installer-init
agent-ip AGENT_IP 10.244.0.1
header-bytes HEADER_BYTES 128
sampling-n SAMPLING_N 64
polling-secs POLLING_SECS 5
bridge-name BRIDGE_NAME k-vswitch0
collector-ip COLLECTOR_IP 10.0.0.10
collector-port COLLECTOR_PORT 6343
k-vswitch0 10.244.0.1 10.0.0.10 6343 128 64 5
Error: exit status 1: ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)
Usage:
  sflow-ovs-installer [flags]
  sflow-ovs-installer [command]

Available Commands:
  help        Help about any command
  version     Print the version number of networkmachinery-sflow

Flags:
      --agent-ip string         indicates the interface / ip that the sFlow agent should send traffic from
      --bridge-name string      the name of the OVS bridge to configure
      --collector-ip string     is the sFlow collector IP
      --collector-port string   is the default port number for sFlowTrend (default "6343")
      --header-bytes string     the header bytes (default "128")
  -h, --help                    help for sflow-ovs-installer
      --polling-secs string     frequency of sampling i.e., samples/sec (default "10")
      --sampling-n string       is the type of sampling (default "64")

Use "sflow-ovs-installer [command] --help" for more information about a command.

So I realize... Oh! Wait, I don't have OvS installed on the host system. So I made a basic OvS install, and I created a bridge with the same name as you have in the config map -- and I got the daemonset to come up properly, logs now look like:

$ kubectl logs sflow-ovs-installer-kwjrl
2019-05-21T16:24:19Z INFO: Starting sFlow-RT 2.3-1375
2019-05-21T16:24:20Z INFO: Version check, running latest
2019-05-21T16:24:20Z INFO: Listening, sFlow port 6343
2019-05-21T16:24:21Z INFO: Listening, HTTP port 8008

My host ip a now looks like: (sorry it's big, so I at least trimmed out the veths)

$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master ovs-system state UP group default qlen 1000
    link/ether 52:54:00:ff:af:2b brd ff:ff:ff:ff:ff:ff
    inet6 fe80::5054:ff:feff:af2b/64 scope link 
       valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:5c:0b:ca:cd brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:5cff:fe0b:cacd/64 scope link 
       valid_lft forever preferred_lft forever
4: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default 
    link/ether b6:f0:b3:6b:76:3b brd ff:ff:ff:ff:ff:ff
    inet 10.244.0.0/32 scope global flannel.1
       valid_lft forever preferred_lft forever
    inet6 fe80::b4f0:b3ff:fe6b:763b/64 scope link 
       valid_lft forever preferred_lft forever
5: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
    link/ether c2:fc:f1:e8:35:29 brd ff:ff:ff:ff:ff:ff
    inet 10.244.0.1/24 scope global cni0
       valid_lft forever preferred_lft forever
    inet6 fe80::c0fc:f1ff:fee8:3529/64 scope link 
       valid_lft forever preferred_lft forever
[..trimmed all the veths..]
23: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 9e:8d:95:e9:52:c2 brd ff:ff:ff:ff:ff:ff
24: k-vswitch0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether ae:fe:57:23:4e:43 brd ff:ff:ff:ff:ff:ff
    inet 192.168.122.105/24 scope global k-vswitch0
       valid_lft forever preferred_lft forever
    inet 192.168.122.28/24 brd 192.168.122.255 scope global secondary dynamic k-vswitch0
       valid_lft 2950sec preferred_lft 2950sec
    inet6 fe80::acfe:57ff:fe23:4e43/64 scope link 
       valid_lft forever preferred_lft forever

Any ideas on what I could be missing? I can add how I installed OvS, as well. I'm not an OvS expert by any means, so, high likelihood I made a mistake there.

zanetworker commented 5 years ago

@dougbtv

Thank you for attending the talk and I am glad you found it useful. For the smokeping demo, you just need to have the CRDs in the the cluster, you don't need OVS, you just need normal network connectivity via any CNI plugin. You also don't need the daemonset unless you need to run and test sFlow.

I also can't see for the resource you posted, what pod did you use as your source? you need to specify the correct source pod name and namespace for it to work.

Having the correct source-pod should get it to work, if not, then lets discuss about it tomorrow :). Also, I am quickly realising that the project needs more documentation but I guess its because I only intended for it to be a PoC for starters.

let me know how it goes.