projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
6.02k stars 1.34k forks source link

calicoctl doesn't properly handle TLS errors when connecting to etcd #5264

Closed giacomobartoli closed 6 months ago

giacomobartoli commented 4 years ago

Expected Behavior

Running the command calicoctl get nodes I am expecting to see two kubernetes nodes

Current Behavior

The command get stucked after having correctly installed calico for OSX

Steps to Reproduce (for bugs)

  1. ibmcloud ks cluster config --cluster btct7mhf06d8fo6mmi6g --admin --network Ouput: The configuration for btct7mhf06d8fo6mmi6g was downloaded successfully. Network Config: /Users/it001366/.bluemix/plugins/container-service/clusters/mycluster-free-btct7mhf06d8fo6mmi6g-admin/calicoctl.cfg
  2. mv /Users/it001366/Downloads/calicoctl-darwin-amd64 /usr/local/bin/calicoctl
  3. chmod +x /usr/local/bin/calicoctl
  4. sudo mkdir /etc/calico
  5. sudo mv /Users/it001366/.bluemix/plugins/container-service/clusters/mycluster-free-btct7mhf06d8fo6mmi6g-admin/calicoctl.cfg /etc/calico
  6. calicoctl get nodes After (6) there is no answer and the command is pending..

Context

I am trying to follow this tutorial to install Calico CLI: https://cloud.ibm.com/docs/containers?topic=containers-network_policies#cli_install

Your Environment

caseydavenport commented 4 years ago

Note that there has been a report that calicoctl v3.16.1 isn't working on MacOS here: https://github.com/projectcalico/calicoctl/issues/2182

this might be related? I'd suggest trying an earlier version and see if you have the same results - will help us tell if this is the same issue or something different.

giacomobartoli commented 4 years ago

Hi @caseydavenport and thanks for your support. Issue projectcalico/calicoctl#2182 is different: when the user run 'calicloct' it returns no command as output. (I guess the user downloaded the wrong binary file). In my case, the command line is kind of stuck, waiting for something..

giacomobartoli commented 4 years ago

Moreover, previous versions have the same behaviour

caseydavenport commented 4 years ago

After (6) there is no answer and the command is pending..

Ah, right. the command is pending. I'd double check that:

You might also want to try running with debug logging on to see if there are any clues:

calicoctl -l debug get nodes
giacomobartoli commented 4 years ago

Hi @caseydavenport Yes, I confirm that I have access to the cluster and my calicoctl is point at it. I run the command you suggested and this is the output:

INFO[0000] Log level set to debug                       
INFO[0000] Executing config command                     
DEBU[0000] Resource: projectcalico.org/v3, Kind=Node    
DEBU[0000] Data: - apiVersion: projectcalico.org/v3
  kind: Node
  metadata:
    creationTimestamp: null
  spec: {} 
DEBU[0000] Loading config from JSON or YAML data        
DEBU[0000] Datastore type: etcdv3                       
INFO[0000] Loaded client config: apiconfig.CalicoAPIConfigSpec{DatastoreType:"etcdv3", EtcdConfig:apiconfig.EtcdConfig{EtcdEndpoints:"https://c6.mil01.containers.cloud.ibm.com:20131", EtcdDiscoverySrv:"", EtcdUsername:"", EtcdPassword:"", EtcdKeyFile:"/Users/it001366/.bluemix/plugins/container-service/clusters/mycluster-free-btct7mhf06d8fo6mmi6g-admin/admin-key.pem", EtcdCertFile:"/Users/it001366/.bluemix/plugins/container-service/clusters/mycluster-free-btct7mhf06d8fo6mmi6g-admin/admin.pem", EtcdCACertFile:"/Users/it001366/.bluemix/plugins/container-service/clusters/mycluster-free-btct7mhf06d8fo6mmi6g-admin/ca.pem", EtcdKey:"", EtcdCert:"", EtcdCACert:""}, KubeConfig:apiconfig.KubeConfig{Kubeconfig:"", K8sAPIEndpoint:"", K8sKeyFile:"", K8sCertFile:"", K8sCAFile:"", K8sAPIToken:"", K8sInsecureSkipTLSVerify:false, K8sDisableNodePoll:false, K8sUsePodCIDR:false}} 
DEBU[0000] Using datastore type 'etcdv3'                
INFO[0000] Client: {{{CalicoAPIConfig projectcalico.org/v3} {      0 {{0 0 <nil>}} <nil> <nil> map[] map[] [] []  []} {etcdv3 {https://c6.mil01.containers.cloud.ibm.com:20131    /Users/it001366/.bluemix/plugins/container-service/clusters/mycluster-free-btct7mhf06d8fo6mmi6g-admin/admin-key.pem /Users/it001366/.bluemix/plugins/container-service/clusters/mycluster-free-btct7mhf06d8fo6mmi6g-admin/admin.pem /Users/it001366/.bluemix/plugins/container-service/clusters/mycluster-free-btct7mhf06d8fo6mmi6g-admin/ca.pem   } {      false false false}}} 0xc000452090 0xc000450190} 
DEBU[0000] Processing List request                       list-interface=Node rev=
DEBU[0000] Get Global Resource key from /calico/resources/v3/projectcalico.org/nodes 
DEBU[0000] Didn't match regex                           
DEBU[0000] List options is a parent prefix, ensure path ends in /  list-interface=Node rev=
DEBU[0000] Adding / to path                              list-interface=Node rev=
DEBU[0000] Calling Get on etcdv3 client                  etcdv3-etcdKey=/calico/resources/v3/projectcalico.org/nodes/ list-interface=Node rev=
caseydavenport commented 4 years ago

DEBU[0000] Calling Get on etcdv3 client etcdv3-etcdKey=/calico/resources/v3/projectcalico.org/nodes/ list-interface=Node rev=

Yeah, it looks like it's performing a request against the etcd cluster to get the node information and isn't receiving a response back.

marcelo-devsres commented 4 years ago

short version

Yeah, it looks like it's performing a request against the etcd cluster to get the node information and isn't receiving a response back.

Even though it might be, I'd say it might be any kind of basic TLS problems like invalid CA, invalid or expired certs. I found out (the hard way) calicoctl is not giving any insights for problems of this kind, and it keeps on a 'trying-to-establish-connection' loop instead of exiting with a meaningful error. This was on both 3.15 and 3.16.

Wanna know the rationale? Well, long story downstairs.

long story

I've met the exact same situation on a 100% linux production environment with calico + etcd with tls validation. calicoctl was stuck and debug didnt help with any meaningful messages. Came here to post a full report, but since I found this issue, I decided to comment instead.

I already solved my problem:

My problem was my certificates were expired.

Check this diagnostic 'thread':

INFO[0000] Log level set to debug
INFO[0000] Executing config command
DEBU[0000] Resource: projectcalico.org/v3, Kind=Node
DEBU[0000] Data: - apiVersion: projectcalico.org/v3 kind: Node metadata: creationTimestamp: null spec: {} status: {} DEBU[0000] Loading config from JSON or YAML data
DEBU[0000] Datastore type: etcdv3
INFO[0000] Loaded client config: apiconfig.CalicoAPIConfigSpec{DatastoreType:"etcdv3", EtcdConfig:apiconfig.EtcdConfig{EtcdEndpoints:"https://192.168.0.11:2379,https://192.168.0.12:2379,https://192.168.0.13:2379", EtcdDiscoverySrv:"", EtcdUsername:"", EtcdPassword:"", EtcdKeyFile:"/etc/calico/tls/acof/tls.key", EtcdCertFile:"/etc/calico/tls/acof/tls.crt", EtcdCACertFile:"/etc/calico/tls/acof/tls.ca", EtcdKey:"", EtcdCert:"", EtcdCACert:""}, KubeConfig:apiconfig.KubeConfig{Kubeconfig:"", K8sAPIEndpoint:"", K8sKeyFile:"", K8sCertFile:"", K8sCAFile:"", K8sAPIToken:"", K8sInsecureSkipTLSVerify:false, K8sDisableNodePoll:false, K8sUsePodCIDR:false, KubeconfigInline:"", K8sClientQPS:0}} DEBU[0000] Using datastore type 'etcdv3'
INFO[0000] Client: {{{CalicoAPIConfig projectcalico.org/v3} { 0 {{0 0 }} map[] map[] [] [] []} {etcdv3 {https://192.168.0.11:2379,https://192.168.0.12:2379,https://192.168.0.13:2379 /etc/calico/tls/acof/tls.key /etc/calico/tls/acof/tls.crt /etc/calico/tls/acof/tls.ca } { false false false 0}}} 0xc00000ec18 0xc0001fed70} DEBU[0000] Processing List request list-interface=Node rev= DEBU[0000] Get Global Resource key from /calico/resources/v3/projectcalico.org/nodes DEBU[0000] Didn't match regex
DEBU[0000] List options is a parent prefix, ensure path ends in / list-interface=Node rev= DEBU[0000] Adding / to path list-interface=Node rev= DEBU[0000] Calling Get on etcdv3 client etcdv3-etcdKey=/calico/resources/v3/projectcalico.org/nodes/ list-interface=Node rev=

On etcd logs:

Oct 15 16:54:08 etcd1.intranet rkt[1085]: 2020-10-15 19:54:08.313941 I | embed: rejected connection from "10.10.10.42:40514" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "") Oct 15 16:54:10 etcd1.intranet rkt[1085]: 2020-10-15 19:54:10.975503 I | embed: rejected connection from "10.10.10.42:40520" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "") Oct 15 16:54:14 etcd1.intranet rkt[1085]: 2020-10-15 19:54:14.509551 I | embed: rejected connection from "10.10.10.42:40528" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "") Oct 15 16:54:20 etcd1.intranet rkt[1085]: 2020-10-15 19:54:20.949211 I | embed: rejected connection from "10.10.10.42:40538" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "")


So, it seems calico tries to establish a conection, it fails on the TLS part, and every 2 seconds or so it retries, and it keeps retrying for ever.

In my case, one single attempt scored 61 connection attemps:

journalctl -b -u etcd | fgrep expired | wc -l

61


---

So, for curiosity sake, I also tried other TLS failures to see what happens. **ALL of them** ended with calicoctl stuck on a connection loop without any meaningful messages.

I tried these with curl:

* Invalid CA:

curl https://etcd1.local:2379

curl: (60) Peer's Certificate issuer is not recognized.

* Name mismatch

curl --cacert tls/sp2/ca.pem --resolve hostname.invalido:2379:10.99.17.11 https://hostname.invalido:2379

curl: (51) Unable to communicate securely with peer: requested domain name does not match the server's certificate.

* CA ok, name ok, but no client certs:

curl --cacert tls/sp2/ca.pem https://etcd1.local:2379

curl: (58) NSS: client certificate not found (nickname not specified)

These were calicoctl tests:

* Invalid CA:

ETCD_ENDPOINTS=https://etcd1.local:2379 calicoctl -l debug get nodes

... INFO[0000] Loaded client config: apiconfig.CalicoAPIConfigSpec{DatastoreType:"etcdv3", EtcdConfig:apiconfig.EtcdConfig{EtcdEndpoints:"https://etcd1.local:2379", EtcdDiscoverySrv:"", EtcdUsername:"", EtcdPassword:"", EtcdKeyFile:"", EtcdCertFile:"", EtcdCACertFile:"", EtcdKey:"", EtcdCert:"", EtcdCACert:""}, KubeConfig:apiconfig.KubeConfig{Kubeconfig:"", K8sAPIEndpoint:"", K8sKeyFile:"", K8sCertFile:"", K8sCAFile:"", K8sAPIToken:"", K8sInsecureSkipTLSVerify:false, K8sDisableNodePoll:false, K8sUsePodCIDR:false, KubeconfigInline:"", K8sClientQPS:0}} ... ^C


* Name mismatch (it was on /etc/hosts)

Criei uma entrada no /etc/hosts para etcd.devsres.com, o nome não resolve.

ETCD_ENDPOINTS=https://etcd.devsres.com:2379 ETCD_CA_CERT_FILE=tls/tls.ca calicoctl -l debug get nodes

... INFO[0000] Loaded client config: apiconfig.CalicoAPIConfigSpec{DatastoreType:"etcdv3", EtcdConfig:apiconfig.EtcdConfig{EtcdEndpoints:"https://etcd.devsres.com:2379", EtcdDiscoverySrv:"", EtcdUsername:"", EtcdPassword:"", EtcdKeyFile:"", EtcdCertFile:"", EtcdCACertFile:"tls/tls.ca", EtcdKey:"", EtcdCert:"", EtcdCACert:"tls/tls.ca"}, KubeConfig:apiconfig.KubeConfig{Kubeconfig:"", K8sAPIEndpoint:"", K8sKeyFile:"", K8sCertFile:"", K8sCAFile:"", K8sAPIToken:"", K8sInsecureSkipTLSVerify:false, K8sDisableNodePoll:false, K8sUsePodCIDR:false, KubeconfigInline:"", K8sClientQPS:0}} ... ^C

* CA ok, name ok, but no client certs:

ETCD_ENDPOINTS=https://sp2srvvpkv00001:2379 ETCD_CA_CERT_FILE=tls/tls.ca calicoctl -l debug get nodes

... INFO[0000] Loaded client config: apiconfig.CalicoAPIConfigSpec{DatastoreType:"etcdv3", EtcdConfig:apiconfig.EtcdConfig{EtcdEndpoints:"https://sp2srvvpkv00001:2379", EtcdDiscoverySrv:"", EtcdUsername:"", EtcdPassword:"", EtcdKeyFile:"", EtcdCertFile:"", EtcdCACertFile:"tls/tls.ca", EtcdKey:"", EtcdCert:"", EtcdCACert:""}, KubeConfig:apiconfig.KubeConfig{Kubeconfig:"", K8sAPIEndpoint:"", K8sKeyFile:"", K8sCertFile:"", K8sCAFile:"", K8sAPIToken:"", K8sInsecureSkipTLSVerify:false, K8sDisableNodePoll:false, K8sUsePodCIDR:false, KubeconfigInline:"", K8sClientQPS:0}} ... ^C



I believe this should be easily reproduceable, and give @giacomobartoli a hint that if he has full conectivity with the server, he might want to debug all TLS stages and look for misconfigurations like the one I've met.
lmm commented 4 years ago

Thanks for the details @marcelo-devsres that's really helpful. I would expect calicoctl to at the very least log the etcd failures and ideally report the hard failure back to the user instead of retrying forever.

giacomobartoli commented 4 years ago

@marcelo-devsres does calico version 3.16.4 fix this bug?

giacomobartoli commented 4 years ago

This is the log I get from calico debug get globalpoliciesnetwork

MacBook-Pro-di-Giacomo:Downloads Giacomo$ calicoctl -l debug  get globalnetworkpolicies
INFO[0000] Log level set to debug                       
INFO[0000] Executing config command                     
DEBU[0000] Resource: projectcalico.org/v3, Kind=GlobalNetworkPolicy 
DEBU[0000] Data: - apiVersion: projectcalico.org/v3
  kind: GlobalNetworkPolicy
  metadata:
    creationTimestamp: null
  spec: {} 
DEBU[0000] Loading config from JSON or YAML data        
DEBU[0000] Datastore type: etcdv3                       
INFO[0000] Loaded client config: apiconfig.CalicoAPIConfigSpec{DatastoreType:"etcdv3", EtcdConfig:apiconfig.EtcdConfig{EtcdEndpoints:"https://c6.mil01.containers.cloud.ibm.com:20131", EtcdDiscoverySrv:"", EtcdUsername:"", EtcdPassword:"", EtcdKeyFile:"/Users/Giacomo/.bluemix/plugins/container-service/clusters/mycluster-free-btct7mhf06d8fo6mmi6g-admin/admin-key.pem", EtcdCertFile:"/Users/Giacomo/.bluemix/plugins/container-service/clusters/mycluster-free-btct7mhf06d8fo6mmi6g-admin/admin.pem", EtcdCACertFile:"/Users/Giacomo/.bluemix/plugins/container-service/clusters/mycluster-free-btct7mhf06d8fo6mmi6g-admin/ca.pem", EtcdKey:"", EtcdCert:"", EtcdCACert:""}, KubeConfig:apiconfig.KubeConfig{Kubeconfig:"", K8sAPIEndpoint:"", K8sKeyFile:"", K8sCertFile:"", K8sCAFile:"", K8sAPIToken:"", K8sInsecureSkipTLSVerify:false, K8sDisableNodePoll:false, K8sUsePodCIDR:false, KubeconfigInline:"", K8sClientQPS:0}} 
DEBU[0000] Using datastore type 'etcdv3'                
INFO[0000] Client: {{{CalicoAPIConfig projectcalico.org/v3} {      0 {{0 0 <nil>}} <nil> <nil> map[] map[] [] []  []} {etcdv3 {https://c6.mil01.containers.cloud.ibm.com:20131    /Users/Giacomo/.bluemix/plugins/container-service/clusters/mycluster-free-btct7mhf06d8fo6mmi6g-admin/admin-key.pem /Users/Giacomo/.bluemix/plugins/container-service/clusters/mycluster-free-btct7mhf06d8fo6mmi6g-admin/admin.pem /Users/Giacomo/.bluemix/plugins/container-service/clusters/mycluster-free-btct7mhf06d8fo6mmi6g-admin/ca.pem   } {      false false false  0}}} 0xc00000e058 0xc0001fa090} 
DEBU[0000] Processing List request                       list-interface=GlobalNetworkPolicy rev=
DEBU[0000] Get Global Resource key from /calico/resources/v3/projectcalico.org/globalnetworkpolicies 
DEBU[0000] Didn't match regex                           
DEBU[0000] List options is a parent prefix, ensure path ends in /  list-interface=GlobalNetworkPolicy rev=
DEBU[0000] Adding / to path                              list-interface=GlobalNetworkPolicy rev=
DEBU[0000] Calling Get on etcdv3 client                  etcdv3-etcdKey=/calico/resources/v3/projectcalico.org/globalnetworkpolicies/ list-interface=GlobalNetworkPolicy rev=
marcelo-devsres commented 4 years ago

@marcelo-devsres does calico version 3.16.4 fix this bug?

Still stuck.

ETCDs require TLS certs for connection. I didnt specify certs, and calicoctl is stuck, instead of exiting with some TLS error message.

# docker run --rm -v it -e ETCD_ENDPOINTS=http//etcd1.intra:2379 calico/ctl:v3.16.4 -l debug get nodes
...
time="2020-11-06T18:38:55Z" level=info msg="Loaded client config: apiconfig.CalicoAPIConfigSpec{DatastoreType:\"etcdv3\", EtcdConfig:apiconfig.EtcdConfig{EtcdEndpoints:\"http//etcd1.intra:2379\", EtcdDiscoverySrv:\"\", EtcdUsername:\"\", EtcdPassword:\"\", EtcdKeyFile:\"\", EtcdCertFile:\"\", EtcdCACertFile:\"\", EtcdKey:\"\", EtcdCert:\"\", EtcdCACert:\"\"}, KubeConfig:apiconfig.KubeConfig{Kubeconfig:\"\", K8sAPIEndpoint:\"\", K8sKeyFile:\"\", K8sCertFile:\"\", K8sCAFile:\"\", K8sAPIToken:\"\", K8sInsecureSkipTLSVerify:false, K8sDisableNodePoll:false, K8sUsePodCIDR:false, KubeconfigInline:\"\", K8sClientQPS:0}}"

time="2020-11-06T18:36:44Z" level=debug msg="Calling Get on etcdv3 client" etcdv3-etcdKey=/calico/resources/v3/projectcalico.org/nodes/ list-interface=Node rev=
^C
giacomobartoli commented 3 years ago

@caseydavenport so, how can I run the command calicoctl without incorring into this issue?

Alex-ander-s commented 3 years ago

@giacomobartoli @marcelo-devsres It depends on your cluster setup how you should set the ETCD_ENDPOINTS variable. On our system we basically issued the internal certificates for the connection to the ETCD to IP adresses which means we also have to use the IP adress in ETCD_ENDPOINTS variable instead of the hostname. So maybe replacing your ETCD_ENDPOINTS=http://etcd1.intra:2379 with ETCD_ENDPOINTS=http://YOUR-ETCD-IP:2379 or ETCD_ENDPOINTS=https://YOUR-ETCD-IP:2379 does the trick. just my 2 ¢ ...