Integrate with Calico - Githubissues

caseydavenport commented 8 years ago

CNI support has been added, hooray! https://github.com/kubernetes/kops/pull/621/files

With the above merged, it should be easy to add Calico.

This issue is to track the testing / documentation for Calico + kops.

chrislovecnm commented 8 years ago

@caseydavenport This issue is to track the testing & documentation for Calico + kops. 😀

caseydavenport commented 8 years ago

@chrislovecnm Yes, that's correct :) Updated.

chrislovecnm commented 8 years ago

@caseydavenport I am assigning this to you.

chrislovecnm commented 8 years ago

@caseydavenport you can coordinate with @razic on this. He is dropping in support for weave. Here is the issue https://github.com/kubernetes/kops/issues/777 as well.

caseydavenport commented 8 years ago

@chrislovecnm @razic happy to coordinate.

I'll also introduce @heschlie.

chrislovecnm commented 8 years ago

@caseydavenport who do you want this assigned to?

Buzer commented 8 years ago

Currently following was required when installing Calico on multi-AZ kops cluster which was created with cni networking:

Modify calico.yaml with following

1.1. Add annotations (this should be fixed with https://github.com/projectcalico/calico/pull/163) 1.2. Change latest tags to actual versions 1.3. Change etcd_endpoints to list of etcd nodes (http://etcd-$AZ.internal.$NAME:4001)
Check some master's IP from AWS
Copy modified file to master
SSH to master

4.1. Run docker run --rm --net=host -e ETCD_ENDPOINTS=http://127.0.0.1:4001 calico/ctl pool add $NETWORK/$SIZE --ipip --nat-outgoing 4.2. SSH Apply the modified calico.yaml

I think 4.1. is likely the only part that's somewhat hard to do with just #777 assuming manifests are templates and some variables (in this case etcd endpoints and some user-defined things like the CIDR for pool) will be provided to them. One possible way to do it purely within the manifest could be to run it as a job, make it write some key to etcd & modify Calico pod so that the so that it will wait until that key exists before starting Calico, but that's pretty hacky solution.

caseydavenport commented 8 years ago

@Buzer thanks for the detailed steps :)

Sounds like we need to get projectcalico/calico#163 merged and into a release to address 1.1 and 1.2 above.

For 1.3, etcd_endpoints will likely need to be templated, or we could set up a Kubernetes service which fronts the etcd cluster with a well-known clusterIP, similar to kubedns?

For 4.1, we do something similar for kubeadm - we tell Calico not to create a pool by default, and use a Job to configure Calico. This seems to work nicely once it has the right annotations to run as a critical pod, run on master allowed.

For 4.2, I suspect we can do this as part of the install in some way so users don't need to SSH in manually?

caseydavenport commented 8 years ago

@chrislovecnm could you assign @heschlie? Thanks!

chrislovecnm commented 8 years ago

He is not a member of the kubernetes org, so alas I cannot. Will keep it assigned to you..

caseydavenport commented 8 years ago

Ah, right. Fine to keep assigned to me!

chrislovecnm commented 8 years ago

SGTM

Buzer commented 8 years ago

I'm not too familiar with etcd's (or kubernetes services) internals, but how well it would deal with various error situations? And does etcd allow writing to any node (ex. do non-leader nodes internally forward requests that they cannot handle to current leader or is it clients' responsibility)?

Job approach sounds good. I was thinking it initially, but couldn't find a way to disable automatic pool creation with a quick look.

I assume 4.2. will be handled by #777?

stonith commented 8 years ago

I've been trying to get this working and found that latest(master) referenced in the k8s hosted calico.yaml doesn't seem to work for me, it can't route to the internet but other version do work such as v1.0.0-beta-4-gfd4cf3c.

Also, with the v1 changes, updating the pool is slightly different:

sudo cat << EOF | calicoctl replace -f -
- apiVersion: v1
  kind: ipPool
  metadata:
    cidr: 192.168.0.0/16
  spec:
    ipip:
      enabled: true
    nat-outgoing: true
EOF

chrislovecnm commented 8 years ago

@caseydavenport I need a tested method. Can you reach out to me?

chrislovecnm commented 8 years ago

Another test failed.

  1m        1m      1   {default-scheduler }                            Normal      Scheduled   Successfully assigned dns-controller-844861676-aqetd to ip-172-20-157-53.us-west-2.compute.internal
  1m        1s      86  {kubelet ip-172-20-157-53.us-west-2.compute.internal}           Warning     FailedSync  Error syncing pod, skipping: failed to "SetupNetwork" for "dns-controller-844861676-aqetd_kube-system" with SetupNetworkError: "Failed to setup network for pod \"dns-controller-844861676-aqetd_kube-system(2fdf1feb-a910-11e6-9022-029c1f6e6435)\" using network plugins \"cni\": nodes \"ip-172-20-157-53\" not found; Skipping pod"

Created on full private vpc cluster. Using kops head, and required nodeup.

https://raw.githubusercontent.com/projectcalico/calico/master/master/getting-started/kubernetes/installation/hosted/k8s-backend/calico.yaml

Install command

caseydavenport commented 8 years ago

@chrislovecnm will reach out on Monday. That looks like the wrong manifest.

@heschlie has been out sick.

heschlie commented 8 years ago

First, I'm currently using the following kops if it makes a difference:

$ kops version
Version git-18879f7

@chrislovecnm Here is the manifest I have been trying to get deployed, @caseydavenport might want to review it make sure it is sane:

https://gist.github.com/heschlie/4c0a137d1a6e9c3dec6d651866e52b26

The job in the above manifest never gets to run, but it is necessary, I run the container manually below to get calicoctl to setup the networking.

I am deploying the cluster with the following command:

kops create cluster --zones us-west-2c $NAME --networking cni --master-size m4.large

I am using the m4.large at the suggestion of #728

Once the cluster master is online I need to ssh to it and do a couple things.

scp calico.yaml admin@$MASTER_IP:/admin/home/
ssh admin@$MASTER_IP to get the internal IP
Add api.$NAME and api.internal.$NAME to Route53
sudo docker run --rm --net=host -e ETCD_ENDPOINTS=http://127.0.0.1:4001 calico/ctl:v0.22.0 pool add 172.20.96.0/19 --ipip --nat-outgoing
kubectl apply -f calico.yaml

I have not been able to get a working deployment. My main issue is when running kops with --networking cni is that it doesn't seem to create the necessary DNS entries in Route53 and thus the nodes cannot connect to it.

I've tried adding the DNS entries by hand and it get the process further along but docker seems to be having trouble pulling all of the images, and the master is struggling to create the kubedns pods and I'm seeing this when I describe those pods:

container "kubedns" is unhealthy, it will be killed and re-created

Even after getting the DNS entries (api.$NAME, api.internal.$NAME) in Route53 the kube-dns-v20 pods were still not coming online. kubedns seems to be trying to hit the API at 100.64.0.1:443 but is not able to reach it:

$ kubectl logs -n kube-system kube-dns-v20-3531996453-hkrty kubedns
I1113 18:08:34.945961       1 server.go:94] Using https://100.64.0.1:443 for kubernetes master, kubernetes API: <nil>
I1113 18:08:34.946567       1 server.go:99] v1.5.0-alpha.0.1651+7dcae5edd84f06-dirty
I1113 18:08:34.946588       1 server.go:101] FLAG: --alsologtostderr="false"
I1113 18:08:34.946643       1 server.go:101] FLAG: --dns-port="10053"
I1113 18:08:34.946650       1 server.go:101] FLAG: --domain="cluster.local."
I1113 18:08:34.946667       1 server.go:101] FLAG: --federations=""
I1113 18:08:34.946673       1 server.go:101] FLAG: --healthz-port="8081"
I1113 18:08:34.946676       1 server.go:101] FLAG: --kube-master-url=""
I1113 18:08:34.946680       1 server.go:101] FLAG: --kubecfg-file=""
I1113 18:08:34.946744       1 server.go:101] FLAG: --log-backtrace-at=":0"
I1113 18:08:34.946752       1 server.go:101] FLAG: --log-dir=""
I1113 18:08:34.946771       1 server.go:101] FLAG: --log-flush-frequency="5s"
I1113 18:08:34.946811       1 server.go:101] FLAG: --logtostderr="true"
I1113 18:08:34.946823       1 server.go:101] FLAG: --stderrthreshold="2"
I1113 18:08:34.946827       1 server.go:101] FLAG: --v="0"
I1113 18:08:34.946832       1 server.go:101] FLAG: --version="false"
I1113 18:08:34.946848       1 server.go:101] FLAG: --vmodule=""
I1113 18:08:34.946928       1 server.go:138] Starting SkyDNS server. Listening on port:10053
I1113 18:08:34.946977       1 server.go:145] skydns: metrics enabled on : /metrics:
I1113 18:08:34.946993       1 dns.go:166] Waiting for service: default/kubernetes
I1113 18:08:34.949742       1 logs.go:41] skydns: ready for queries on cluster.local. for tcp://0.0.0.0:10053 [rcache 0]
I1113 18:08:34.949813       1 logs.go:41] skydns: ready for queries on cluster.local. for udp://0.0.0.0:10053 [rcache 0]
I1113 18:09:04.948943       1 dns.go:172] Ignoring error while waiting for service default/kubernetes: Get https://100.64.0.1:443/api/v1/namespaces/default/services/kubernetes: dial tcp 100.64.0.1:443: i/o timeout. Sleeping 1s before retrying.
E1113 18:09:04.949791       1 reflector.go:214] pkg/dns/dns.go:155: Failed to list *api.Service: Get https://100.64.0.1:443/api/v1/services?resourceVersion=0: dial tcp 100.64.0.1:443: i/o timeout
E1113 18:09:04.949862       1 reflector.go:214] pkg/dns/dns.go:154: Failed to list *api.Endpoints: Get https://100.64.0.1:443/api/v1/endpoints?resourceVersion=0: dial tcp 100.64.0.1:443: i/o timeout

I'm still learning the intricacies of kubernetes, but this doesn't look right to me, maybe someone can chime in with some info so I can understand what I might be doing wrong or need to do to get this operational.

The kube-dns pods seem to be the last remaining issue atm, but I can't establish if Calico is working until the kube-dns pods are up so there could likely still be more issues.

Buzer commented 8 years ago

The DNS entries are created by the dns-controller and it should start after the CNI is configured (ex. after Calico starts). Judging from your kubedns logs, you likely have something similar in dns controller logs (or errors about accessing route53). You can also try to exec into the dns controller if it's started to see if you have network connectivity there or not.

Few things you might want to check (issues on which I have ran into):

Check if /etc/cni/net.d/calico-kubeconfig exists and kubeconfig points to existing file
Confirm that there are NAT rules in place (iptables-save | grep felix-masq-ipam-pools)

chrislovecnm commented 8 years ago

Does calico not allow for a full manifest install? I have to run another docker? How is that managed by k8s?

caseydavenport commented 8 years ago

Calico can be installed entirely through a k8s manifest - it's basically just a DaemonSet and a ReplicaSet. An example of that can be found here.

It's also possible to use Jobs to provide arbitrary configuration options to Calico, and a Secret for any certificates you might want to provide (e.g for etcd).

As was hinted at in a few places above, the only things we should need to do:

Set ETCD_ENDPOINTS in the linked manifest to point at the etcd node(s).
Configure the Calico IP pool used for allocating pod addresses (can be done via a Job).

I'll sync with @heschlie on this early tomorrow and let you know @chrislovecnm.

heschlie commented 8 years ago

@Buzer It looks like the dns-controller is starting before I get a chance to deploy the calico.yaml manifest. I've checked that the proper config files exist in /etc/cni/net.d/ as well and appear to be setup correctly, and that the CNI binaries are in /opt/cni/bin, and the NAT rules are in place (though the job is still not running in the yaml file so I need to run calicoctl via docker run still) but the dns-controller never seems to be able to talk to the API.

I also tried restarting docker and kublet, no luck, and tried rebooting the master as well just in case it caused it to come online.

It seems as though the CNI plugin just isn't being used on the containers. I also double checked that the kubelet was set to use CNI, and it appears to be as well.

Will sync with @caseydavenport tomorrow am, just wanted to verify I couldn't get it up and running with the extra info.

heschlie commented 8 years ago

@caseydavenport and I found that the pods necessary to get CNI up and running weren't able to run on the tainted master. We nailed down the proper annotations to add and applied it to the calico-node DS, the configure-calico Job, and the calico-policy-controller RS. After that the cluster came up properly, and policies were being enforced appropriately.

Here is the manifest that works to get the cluster online, the etcd_endpoints will need to be changed before deploying it.

https://gist.github.com/heschlie/4c0a137d1a6e9c3dec6d651866e52b26

Here is the process to bring up the cluster:

Deploy cluster with kops create cluster --zones --master-size m4.large $ZONES --networking cni $CLUSTER_NAME
Wait for master to come online, get the IP from the AWS console
ssh to master ssh admin@$MASTER_IP
Download calico.yaml manifest TODO: Get more permanent location for this manifest wget https://gist.githubusercontent.com/heschlie/4c0a137d1a6e9c3dec6d651866e52b26/raw/a9645c9f310b51837ff5a2769a66a2b1a3c24342/calico.yaml
Change etcd_endpoints in ConfigMap to match your endpoint(s) typically something like http://$ZONE.internal.$NAME where $ZONE is one of the zones you picked, and $NAME is the cluster name. Repeat for each zone, e.g.

etcd_endpoints: "http://etcd-us-west-2c.internal.k8s.testing.example.com:4001,http://etcd-us-east-2c.internal.k8s.testing.example.com:4001"

Deploy manifest kubectl apply -f calico.yaml

There are still two lingering problems one can see above:

etcd_endpoints has to be populated manually. As @caseydavenport mentioned above, could kops template this somehow?
The need to manually SSH into the master to deploy the manifest.

I know there is an issue talking about making deploying a CNI provider as simple as --networking calico which could end up resolving both of those problems.

@kris-nova @chrislovecnm I'd like to leave those last two steps in your hands if that is alright.

stonith commented 8 years ago

This is the following I've done to get kops working with calico:

create cluster: kops create cluster --cloud=aws --master-zones=<master_zone> --zones=<zone_A>,<zone_B> --master-size=t2.large --ssh-public-key=~/.ssh/id_rsa.pub --kubernetes-version=1.4.5 --networking=cni <cluster_name> --yes
ssh to master, wget https://raw.githubusercontent.com/projectcalico/calico/master/v2.0/getting-started/kubernetes/installation/hosted/calico.yaml
edit etcd_endpoints to the master's private IP in calico.yaml
edit cidr to 100.64.0.0/10 and add ipip: enabled: true in calico.yaml
manually add the api.internal.<cluster_name> entry to R53 to point to master's private ip (because of https://github.com/projectcalico/calico/pull/163)
kubectl -f apply calico.yaml

Network policies work but the kube-dns service doesn't seem to. It defaults to 100.64.0.10 in kops but it doesn't answer to dns traffic from pods.

UPDATE: Something changed between when I originally tested this workflow and yesterday but now the kube-dns service is reachable for me.

jayv commented 8 years ago

Supposedly IPIP is not required on AWS and bad for performance, do this for all instances: aws ec2 modify-instance-attribute --instance-id $INSTANCE_ID --source-dest-check "{\"Value\": false}"

See section 3 on: http://docs.projectcalico.org/v1.5/getting-started/kubernetes/installation/aws

stonith commented 8 years ago

@jayv IPIP is required for cross-AZ communication, that example is in a single AZ.

jayv commented 8 years ago

Ah bummer, why can't we have nice things :(

caseydavenport commented 8 years ago

@stonith @jayv yeah, it is a bummer.

Ideally we just use ipip across AZ boundaries. See this issue: https://github.com/projectcalico/calico-containers/issues/1310

chrislovecnm commented 7 years ago

Closing as the PR is completed

kubernetes / kops

Integrate with Calico #709