Closed caseydavenport closed 7 years ago
@caseydavenport This issue is to track the testing & documentation for Calico + kops. 😀
@chrislovecnm Yes, that's correct :) Updated.
@caseydavenport I am assigning this to you.
@caseydavenport you can coordinate with @razic on this. He is dropping in support for weave. Here is the issue https://github.com/kubernetes/kops/issues/777 as well.
@chrislovecnm @razic happy to coordinate.
I'll also introduce @heschlie.
@caseydavenport who do you want this assigned to?
Currently following was required when installing Calico on multi-AZ kops cluster which was created with cni networking:
Modify calico.yaml with following
1.1. Add annotations (this should be fixed with https://github.com/projectcalico/calico/pull/163)
1.2. Change latest tags to actual versions
1.3. Change etcd_endpoints to list of etcd nodes (http://etcd-$AZ.internal.$NAME:4001
)
SSH to master
4.1. Run docker run --rm --net=host -e ETCD_ENDPOINTS=http://127.0.0.1:4001 calico/ctl pool add $NETWORK/$SIZE --ipip --nat-outgoing
4.2. SSH Apply the modified calico.yaml
I think 4.1. is likely the only part that's somewhat hard to do with just #777 assuming manifests are templates and some variables (in this case etcd endpoints and some user-defined things like the CIDR for pool) will be provided to them. One possible way to do it purely within the manifest could be to run it as a job, make it write some key to etcd & modify Calico pod so that the so that it will wait until that key exists before starting Calico, but that's pretty hacky solution.
@Buzer thanks for the detailed steps :)
Sounds like we need to get projectcalico/calico#163 merged and into a release to address 1.1 and 1.2 above.
For 1.3, etcd_endpoints
will likely need to be templated, or we could set up a Kubernetes service which fronts the etcd cluster with a well-known clusterIP, similar to kubedns?
For 4.1, we do something similar for kubeadm
- we tell Calico not to create a pool by default, and use a Job to configure Calico. This seems to work nicely once it has the right annotations to run as a critical pod, run on master allowed.
For 4.2, I suspect we can do this as part of the install in some way so users don't need to SSH in manually?
@chrislovecnm could you assign @heschlie? Thanks!
He is not a member of the kubernetes org, so alas I cannot. Will keep it assigned to you..
Ah, right. Fine to keep assigned to me!
SGTM
I'm not too familiar with etcd's (or kubernetes services) internals, but how well it would deal with various error situations? And does etcd allow writing to any node (ex. do non-leader nodes internally forward requests that they cannot handle to current leader or is it clients' responsibility)?
Job approach sounds good. I was thinking it initially, but couldn't find a way to disable automatic pool creation with a quick look.
I assume 4.2. will be handled by #777?
I've been trying to get this working and found that latest(master) referenced in the k8s hosted calico.yaml doesn't seem to work for me, it can't route to the internet but other version do work such as v1.0.0-beta-4-gfd4cf3c
.
Also, with the v1 changes, updating the pool is slightly different:
sudo cat << EOF | calicoctl replace -f -
- apiVersion: v1
kind: ipPool
metadata:
cidr: 192.168.0.0/16
spec:
ipip:
enabled: true
nat-outgoing: true
EOF
@caseydavenport I need a tested method. Can you reach out to me?
Another test failed.
1m 1m 1 {default-scheduler } Normal Scheduled Successfully assigned dns-controller-844861676-aqetd to ip-172-20-157-53.us-west-2.compute.internal
1m 1s 86 {kubelet ip-172-20-157-53.us-west-2.compute.internal} Warning FailedSync Error syncing pod, skipping: failed to "SetupNetwork" for "dns-controller-844861676-aqetd_kube-system" with SetupNetworkError: "Failed to setup network for pod \"dns-controller-844861676-aqetd_kube-system(2fdf1feb-a910-11e6-9022-029c1f6e6435)\" using network plugins \"cni\": nodes \"ip-172-20-157-53\" not found; Skipping pod"
Created on full private vpc cluster. Using kops head, and required nodeup.
https://raw.githubusercontent.com/projectcalico/calico/master/master/getting-started/kubernetes/installation/hosted/k8s-backend/calico.yaml
Install command
@chrislovecnm will reach out on Monday. That looks like the wrong manifest.
@heschlie has been out sick.
First, I'm currently using the following kops if it makes a difference:
$ kops version
Version git-18879f7
@chrislovecnm Here is the manifest I have been trying to get deployed, @caseydavenport might want to review it make sure it is sane:
https://gist.github.com/heschlie/4c0a137d1a6e9c3dec6d651866e52b26
The job in the above manifest never gets to run, but it is necessary, I run the container manually below to get calicoctl to setup the networking.
I am deploying the cluster with the following command:
kops create cluster --zones us-west-2c $NAME --networking cni --master-size m4.large
I am using the m4.large at the suggestion of #728
Once the cluster master is online I need to ssh to it and do a couple things.
scp calico.yaml admin@$MASTER_IP:/admin/home/
ssh admin@$MASTER_IP
to get the internal IPsudo docker run --rm --net=host -e ETCD_ENDPOINTS=http://127.0.0.1:4001 calico/ctl:v0.22.0 pool add 172.20.96.0/19 --ipip --nat-outgoing
kubectl apply -f calico.yaml
I have not been able to get a working deployment. My main issue is when running kops with --networking cni
is that it doesn't seem to create the necessary DNS entries in Route53 and thus the nodes cannot connect to it.
I've tried adding the DNS entries by hand and it get the process further along but docker seems to be having trouble pulling all of the images, and the master is struggling to create the kubedns pods and I'm seeing this when I describe those pods:
container "kubedns" is unhealthy, it will be killed and re-created
Even after getting the DNS entries (api.$NAME, api.internal.$NAME) in Route53 the kube-dns-v20 pods were still not coming online. kubedns seems to be trying to hit the API at 100.64.0.1:443 but is not able to reach it:
$ kubectl logs -n kube-system kube-dns-v20-3531996453-hkrty kubedns
I1113 18:08:34.945961 1 server.go:94] Using https://100.64.0.1:443 for kubernetes master, kubernetes API: <nil>
I1113 18:08:34.946567 1 server.go:99] v1.5.0-alpha.0.1651+7dcae5edd84f06-dirty
I1113 18:08:34.946588 1 server.go:101] FLAG: --alsologtostderr="false"
I1113 18:08:34.946643 1 server.go:101] FLAG: --dns-port="10053"
I1113 18:08:34.946650 1 server.go:101] FLAG: --domain="cluster.local."
I1113 18:08:34.946667 1 server.go:101] FLAG: --federations=""
I1113 18:08:34.946673 1 server.go:101] FLAG: --healthz-port="8081"
I1113 18:08:34.946676 1 server.go:101] FLAG: --kube-master-url=""
I1113 18:08:34.946680 1 server.go:101] FLAG: --kubecfg-file=""
I1113 18:08:34.946744 1 server.go:101] FLAG: --log-backtrace-at=":0"
I1113 18:08:34.946752 1 server.go:101] FLAG: --log-dir=""
I1113 18:08:34.946771 1 server.go:101] FLAG: --log-flush-frequency="5s"
I1113 18:08:34.946811 1 server.go:101] FLAG: --logtostderr="true"
I1113 18:08:34.946823 1 server.go:101] FLAG: --stderrthreshold="2"
I1113 18:08:34.946827 1 server.go:101] FLAG: --v="0"
I1113 18:08:34.946832 1 server.go:101] FLAG: --version="false"
I1113 18:08:34.946848 1 server.go:101] FLAG: --vmodule=""
I1113 18:08:34.946928 1 server.go:138] Starting SkyDNS server. Listening on port:10053
I1113 18:08:34.946977 1 server.go:145] skydns: metrics enabled on : /metrics:
I1113 18:08:34.946993 1 dns.go:166] Waiting for service: default/kubernetes
I1113 18:08:34.949742 1 logs.go:41] skydns: ready for queries on cluster.local. for tcp://0.0.0.0:10053 [rcache 0]
I1113 18:08:34.949813 1 logs.go:41] skydns: ready for queries on cluster.local. for udp://0.0.0.0:10053 [rcache 0]
I1113 18:09:04.948943 1 dns.go:172] Ignoring error while waiting for service default/kubernetes: Get https://100.64.0.1:443/api/v1/namespaces/default/services/kubernetes: dial tcp 100.64.0.1:443: i/o timeout. Sleeping 1s before retrying.
E1113 18:09:04.949791 1 reflector.go:214] pkg/dns/dns.go:155: Failed to list *api.Service: Get https://100.64.0.1:443/api/v1/services?resourceVersion=0: dial tcp 100.64.0.1:443: i/o timeout
E1113 18:09:04.949862 1 reflector.go:214] pkg/dns/dns.go:154: Failed to list *api.Endpoints: Get https://100.64.0.1:443/api/v1/endpoints?resourceVersion=0: dial tcp 100.64.0.1:443: i/o timeout
I'm still learning the intricacies of kubernetes, but this doesn't look right to me, maybe someone can chime in with some info so I can understand what I might be doing wrong or need to do to get this operational.
The kube-dns pods seem to be the last remaining issue atm, but I can't establish if Calico is working until the kube-dns pods are up so there could likely still be more issues.
The DNS entries are created by the dns-controller and it should start after the CNI is configured (ex. after Calico starts). Judging from your kubedns logs, you likely have something similar in dns controller logs (or errors about accessing route53). You can also try to exec into the dns controller if it's started to see if you have network connectivity there or not.
Few things you might want to check (issues on which I have ran into):
kubeconfig
points to existing fileiptables-save | grep felix-masq-ipam-pools
)Does calico not allow for a full manifest install? I have to run another docker? How is that managed by k8s?
Calico can be installed entirely through a k8s manifest - it's basically just a DaemonSet
and a ReplicaSet
. An example of that can be found here.
It's also possible to use Job
s to provide arbitrary configuration options to Calico, and a Secret
for any certificates you might want to provide (e.g for etcd).
As was hinted at in a few places above, the only things we should need to do:
I'll sync with @heschlie on this early tomorrow and let you know @chrislovecnm.
@Buzer It looks like the dns-controller is starting before I get a chance to deploy the calico.yaml manifest. I've checked that the proper config files exist in /etc/cni/net.d/
as well and appear to be setup correctly, and that the CNI binaries are in /opt/cni/bin
, and the NAT rules are in place (though the job is still not running in the yaml file so I need to run calicoctl via docker run still) but the dns-controller never seems to be able to talk to the API.
I also tried restarting docker and kublet, no luck, and tried rebooting the master as well just in case it caused it to come online.
It seems as though the CNI plugin just isn't being used on the containers. I also double checked that the kubelet was set to use CNI, and it appears to be as well.
Will sync with @caseydavenport tomorrow am, just wanted to verify I couldn't get it up and running with the extra info.
@caseydavenport and I found that the pods necessary to get CNI up and running weren't able to run on the tainted master. We nailed down the proper annotations to add and applied it to the calico-node DS, the configure-calico Job, and the calico-policy-controller RS. After that the cluster came up properly, and policies were being enforced appropriately.
Here is the manifest that works to get the cluster online, the etcd_endpoints
will need to be changed before deploying it.
https://gist.github.com/heschlie/4c0a137d1a6e9c3dec6d651866e52b26
Here is the process to bring up the cluster:
kops create cluster --zones --master-size m4.large $ZONES --networking cni $CLUSTER_NAME
ssh admin@$MASTER_IP
wget https://gist.githubusercontent.com/heschlie/4c0a137d1a6e9c3dec6d651866e52b26/raw/a9645c9f310b51837ff5a2769a66a2b1a3c24342/calico.yaml
http://$ZONE.internal.$NAME
where $ZONE is one of the zones you picked, and $NAME is the cluster name. Repeat for each zone, e.g.etcd_endpoints: "http://etcd-us-west-2c.internal.k8s.testing.example.com:4001,http://etcd-us-east-2c.internal.k8s.testing.example.com:4001"
kubectl apply -f calico.yaml
There are still two lingering problems one can see above:
I know there is an issue talking about making deploying a CNI provider as simple as --networking calico
which could end up resolving both of those problems.
@kris-nova @chrislovecnm I'd like to leave those last two steps in your hands if that is alright.
This is the following I've done to get kops working with calico:
kops create cluster --cloud=aws --master-zones=<master_zone> --zones=<zone_A>,<zone_B> --master-size=t2.large --ssh-public-key=~/.ssh/id_rsa.pub --kubernetes-version=1.4.5 --networking=cni <cluster_name> --yes
wget https://raw.githubusercontent.com/projectcalico/calico/master/v2.0/getting-started/kubernetes/installation/hosted/calico.yaml
etcd_endpoints
to the master's private IP in calico.yamlcidr
to 100.64.0.0/10
and add ipip: enabled: true
in calico.yamlapi.internal.<cluster_name>
entry to R53 to point to master's private ip (because of https://github.com/projectcalico/calico/pull/163)Network policies work but the kube-dns service doesn't seem to. It defaults to 100.64.0.10
in kops but it doesn't answer to dns traffic from pods.
UPDATE: Something changed between when I originally tested this workflow and yesterday but now the kube-dns service is reachable for me.
Supposedly IPIP is not required on AWS and bad for performance, do this for all instances:
aws ec2 modify-instance-attribute --instance-id $INSTANCE_ID --source-dest-check "{\"Value\": false}"
See section 3 on: http://docs.projectcalico.org/v1.5/getting-started/kubernetes/installation/aws
@jayv IPIP is required for cross-AZ communication, that example is in a single AZ.
Ah bummer, why can't we have nice things :(
@stonith @jayv yeah, it is a bummer.
Ideally we just use ipip across AZ boundaries. See this issue: https://github.com/projectcalico/calico-containers/issues/1310
Closing as the PR is completed
CNI support has been added, hooray! https://github.com/kubernetes/kops/pull/621/files
With the above merged, it should be easy to add Calico.
This issue is to track the testing / documentation for Calico + kops.