rancher / rio

Application Deployment Engine for Kubernetes
https://rio.io
Apache License 2.0
2.27k stars 228 forks source link

Failure running with minikube: failed calling webhook "api-validator.rio.io" #1058

Closed gtirloni closed 3 years ago

gtirloni commented 3 years ago

Describe the bug

The documentation says rio works with minikube but we get this error running the rio-demo app:

FATA[0000] Internal error occurred: failed calling webhook "api-validator.rio.io": Post "https://rio-api-validator.rio-system.svc:443/?timeout=30s": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0

To Reproduce

  1. minikube start --kubernetes-version=v1.19.1
  2. rio install
  3. rio run -p 80:8080 https://github.com/rancher/rio-demo

Expected behavior

rio-demo is executed without errors.

Kubernetes version & type (GKE, on-prem): kubectl version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.1", GitCommit:"206bcadf021e76c27513500ca24182692aabd17e", GitTreeState:"clean", BuildDate:"2020-09-09T11:26:42Z", GoVersion:"go1.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.1", GitCommit:"206bcadf021e76c27513500ca24182692aabd17e", GitTreeState:"clean", BuildDate:"2020-09-09T11:18:22Z", GoVersion:"go1.15", Compiler:"gc", Platform:"linux/amd64"}

Type: Rio version: rio info

$ rio info
Rio Version: v0.8.0-rc2 (04372696)
Rio CLI Version: v0.8.0-rc2 (04372696)
Cluster Domain: tms9cm.on-rio.io
Cluster Domain IPs: 172.17.0.2
System Namespace: rio-system
Wildcard certificates: tms9cm.on-rio.io(true)

Additional context rio system logs output:

rio-controller | 2020/09/11 18:05:57 [INFO] acme: Registering account for cert@rancher.dev
rio-controller | 2020/09/11 18:06:10 [INFO] [*.tms9cm.on-rio.io] acme: Obtaining bundled SAN certificate
rio-controller | 2020/09/11 18:06:12 [INFO] [*.tms9cm.on-rio.io] AuthURL: https://acme-v02.api.letsencrypt.org/acme/authz-v3/7149180227
rio-controller | 2020/09/11 18:06:12 [INFO] [*.tms9cm.on-rio.io] acme: use dns-01 solver
rio-controller | 2020/09/11 18:06:12 [INFO] [*.tms9cm.on-rio.io] acme: Preparing to solve DNS-01
rio-controller | 2020/09/11 18:06:14 [INFO] [*.tms9cm.on-rio.io] acme: Trying to solve DNS-01
rio-controller | 2020/09/11 18:06:14 [INFO] [*.tms9cm.on-rio.io] acme: Checking DNS record propagation using [10.96.0.10:53]
rio-controller | 2020/09/11 18:06:14 [INFO] Wait for propagation [timeout: 30s, interval: 5s]
rio-controller | 2020/09/11 18:06:16 [INFO] [*.tms9cm.on-rio.io] acme: Waiting for DNS record propagation.
rio-controller | 2020/09/11 18:06:21 [INFO] [*.tms9cm.on-rio.io] acme: Waiting for DNS record propagation.
rio-controller | 2020/09/11 18:06:26 [INFO] [*.tms9cm.on-rio.io] acme: Waiting for DNS record propagation.
rio-controller | 2020/09/11 18:06:32 [INFO] [*.tms9cm.on-rio.io] acme: Waiting for DNS record propagation.
rio-controller | 2020/09/11 18:06:37 [INFO] [*.tms9cm.on-rio.io] acme: Waiting for DNS record propagation.
rio-controller | 2020/09/11 18:06:50 [INFO] [*.tms9cm.on-rio.io] The server validated our request
rio-controller | 2020/09/11 18:06:50 [INFO] [*.tms9cm.on-rio.io] acme: Cleaning DNS-01 challenge
rio-controller | 2020/09/11 18:06:51 [INFO] [*.tms9cm.on-rio.io] acme: Validations succeeded; requesting certificates
rio-controller | 2020/09/11 18:06:55 [INFO] [*.tms9cm.on-rio.io] Server responded with a certificate.
rio-controller | 2020/09/11 18:06:57 [INFO] [*.tms9cm.on-rio.io] acme: Obtaining bundled SAN certificate
rio-controller | 2020/09/11 18:06:59 [INFO] [*.tms9cm.on-rio.io] AuthURL: https://acme-v02.api.letsencrypt.org/acme/authz-v3/7149180227
rio-controller | 2020/09/11 18:06:59 [INFO] [*.tms9cm.on-rio.io] acme: authorization already valid; skipping challenge
rio-controller | 2020/09/11 18:06:59 [INFO] [*.tms9cm.on-rio.io] acme: Validations succeeded; requesting certificates
rio-controller | 2020/09/11 18:07:04 [INFO] [*.tms9cm.on-rio.io] Server responded with a certificate.
rio-controller | 2020/09/11 18:07:15 http: TLS handshake error from 172.18.0.1:64261: remote error: tls: bad certificate
kuetemeier commented 3 years ago

I have the same problem with K3S on a cloud server. k3s version v1.19.1+k3s1 (b66760fc)

kuetemeier commented 3 years ago

After some further research I think go 1.5 causes the issue: https://github.com/golang/go/issues/39568

There was some discussion about CN, SAN and RFCs. The reality is, that kubernetes 1.19 uses go 1.5 and that this version is no longer supporting the deprecated CN field. This is a problem for self signed certificates, that do not use SAN.

I found a referenced issue about linkerd: https://github.com/linkerd/linkerd2/issues/4918

So, the solution (until this get's fixed) is to use kubernetes 1.18 (it's working with v1.18.9+k3s1 (630bebf9) and v0.8.0-rc2 (04372696) ) just tested.

Perhaps we could get an update for linkerd and (if required certmanager and all other components). Until then we cannot use the new etcd cluster feature in k3s (fun fact - even etcd hat a problem with go 1.5: https://github.com/NixOS/nixpkgs/issues/59364 ).

YuikoTakada commented 3 years ago

Hi, as @kuetemeier says, I also think it's cert issue due to go1.15. I did below and saw same issue.

minikube start --kubernetes-version=v1.19.1
rio install
rio run -p 80:8080 https://github.com/rancher/rio-demo
$ kubectl get secret -n rio-system rio-api-validator -o go-template='{{index .data "tls.crt"}}' | base64 -d > tls.crt
$ openssl x509 -in tls.crt -noout -text

In above output, there is no SAN in X509v3 extensions.

        X509v3 extensions:
            X509v3 Key Usage: critical
                Certificate Sign, CRL Sign
            X509v3 Extended Key Usage:
                TLS Web Server Authentication, TLS Web Client Authentication
            X509v3 Basic Constraints: critical
                CA:TRUE

This issue will be fixed by creating a new cert file?

StrongMonkey commented 3 years ago

This is fixed in v0.8.0-rc3

YuikoTakada commented 3 years ago

close this ticket? solution of this issue is using rio v0.8.0-rc2 with k8s v18.x or upgrading to v0.8.0-rc3.

StrongMonkey commented 3 years ago

At this time This should be fixed in https://github.com/rancher/rio/releases/tag/v0.8.0. If you still see the issue, please re-open.

boredland commented 3 years ago

still having this in 0.8.0

StrongMonkey commented 3 years ago

@boredland Can you describe your issue if you are seeing this in v0.8.0?

boredland commented 3 years ago

sure:

  1. ran rio install using rio v0.8.0 on DOKS 1.19
  2. ran rio dashboard - fails:
    time="2020-11-30T17:11:05Z" level=info msg="Rancher version dev is starting"
    time="2020-11-30T17:11:05Z" level=info msg="Rancher arguments {Config:{Kubeconfig: UserKubeconfig: HTTPSPort:443 HTTPPort:80 Namespace: WebhookConfig:{WebhookAuthentication:false WebhookKubeconfig: WebhookURL: CacheTTLSeconds:0}} AuditLogPath:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0 Features:}"
    time="2020-11-30T17:11:05Z" level=info msg="Starting API controllers"
    I1130 17:11:05.685587       7 leaderelection.go:241] attempting to acquire leader lease  kube-system/cattle-controllers...
    I1130 17:11:05.685789       7 leaderelection.go:241] attempting to acquire leader lease  kube-system/cloud-controllers...
    time="2020-11-30T17:11:06Z" level=info msg="Starting apiregistration.k8s.io/v1, Kind=APIService controller"
    time="2020-11-30T17:11:06Z" level=info msg="Refreshing all schemas"
    time="2020-11-30T17:11:06Z" level=info msg="Starting apiextensions.k8s.io/v1beta1, Kind=CustomResourceDefinition controller"
    time="2020-11-30T17:11:07Z" level=info msg="Refreshing all schemas"
    time="2020-11-30T17:11:07Z" level=fatal msg="unable to retrieve the complete list of server APIs: tap.linkerd.io/v1alpha1: the server is currently unable to handle the request"
  3. ran rio up:
    FATA[0005] failed to create dev/**** rio.cattle.io/v1, Kind=Service for  dev/****: Internal error occurred: failed calling webhook "api-validator.rio.io": Post "https://rio-api-validator.rio-system.svc:443/?timeout=30s": EOF 
StrongMonkey commented 3 years ago

@boredland That looks like a different issue. Can you check if linkerd is installed properly in your setup? Looks like this has caused rio-controller to crash(which also served as webhook server).

boredland commented 3 years ago

How do I check that?

StrongMonkey commented 3 years ago

There is a linkerd-install pod. You should be able to check the logs of that pod.

boredland commented 3 years ago

perhaps I need to upgrade linkerd?

service/linkerd-identity created
deployment.apps/linkerd-identity created
service/linkerd-controller-api created
deployment.apps/linkerd-controller created
service/linkerd-dst created
deployment.apps/linkerd-destination created
cronjob.batch/linkerd-heartbeat created
service/linkerd-web created
deployment.apps/linkerd-web created
configmap/linkerd-prometheus-config created
service/linkerd-prometheus created
deployment.apps/linkerd-prometheus created
deployment.apps/linkerd-proxy-injector created
service/linkerd-proxy-injector created
service/linkerd-sp-validator created
deployment.apps/linkerd-sp-validator created
service/linkerd-tap created
deployment.apps/linkerd-tap created
configmap/linkerd-config-addons created
serviceaccount/linkerd-grafana created
configmap/linkerd-grafana-config created
service/linkerd-grafana created
deployment.apps/linkerd-grafana created
+ linkerd check
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API
kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version
linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API
linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist
linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor
linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running
linkerd-version
---------------
√ can determine the latest version
‼ cli is up-to-date
    is running version 2.8.1 but the latest stable version is 2.9.0
    see https://linkerd.io/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 2.8.1 but the latest stable version is 2.9.0
    see https://linkerd.io/checks/#l5d-version-control for hints
√ control plane and cli versions match
linkerd-addons
--------------
√ 'linkerd-config-addons' config map exists
linkerd-grafana
---------------
√ grafana add-on service account exists
√ grafana add-on config map exists
√ grafana pod is running
Status check results are √
+ [[ 0 -ne 0 ]]
boredland commented 3 years ago

After upgrading linkerd I'm not facing this error anymore.