rancher / rio

Application Deployment Engine for Kubernetes
https://rio.io
Apache License 2.0
2.27k stars 228 forks source link

rio install gets stuck with active firewall #988

Open tullo opened 4 years ago

tullo commented 4 years ago

Describe the bug rio install gets stuck with active firewall (ufw)

To Reproduce Steps to reproduce the behavior:

  1. download rio
  2. rio install on fresh ubuntu box (19.10)
  3. installation gets stuck
  4. ufw disable
  5. installation succeeds
  6. dashboard works
  7. ufw enable
  8. dashboard no longer accessible

Expected behavior Installation does not hang and k3s deployments do not crash

Kubernetes version & type (GKE, on-prem): kubectl version (cloud server)

Client Version: v1.17.0+k3s.1
Server Version: v1.17.0+k3s.1

Type: Rio version: rio info

Rio Version: v0.7.0 (4afd4901)
Rio CLI Version: v0.7.0 (4afd4901)

Additional context rio system logs output:

rio-controller | time="2020-01-07T20:48:44Z" level=info msg="Starting rio-controller, version: v0.7.0, git commit: 4afd4901"
rio-controller | time="2020-01-07T20:49:14Z" level=fatal msg="Get https://10.43.0.1:443/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions: dial tcp 10.43.0.1:443: i/o timeout"
tullo commented 4 years ago

The svclb-gateway-proxy pods are still Pending They never made it to Running even with firewall off

sudo k3s kubectl get all -n rio-system
NAME                                  READY   STATUS             RESTARTS   AGE
pod/svclb-gateway-proxy-hh8rj         0/2     Pending            0          5h59m
pod/gateway-6c496445d7-c2nxn          1/1     Running            0          5h59m
pod/gateway-proxy-7bdfc54996-tjxb6    1/1     Running            0          5h59m
pod/cert-manager-759c4847bc-fttkm     1/1     Running            0          5h59m
pod/gloo-7f984f76cc-j9lzn             1/1     Running            0          5h59m
pod/socat-rnr5m                       1/1     Running            0          5h55m
pod/webhook-678fc6f47c-p6zn5          1/1     Running            0          5h55m
pod/buildkitd-798d9df44d-8c4zf        2/2     Running            0          5h55m
pod/dashboard-8698cbd7b7-8b6ws        1/1     Running            1          5h38m
pod/rio-controller-597fb9d959-9nwrm   0/1     CrashLoopBackOff   41         5h59m

NAME                        TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
service/rio-api-validator   ClusterIP      10.43.105.101   <none>        443/TCP                      6h5m
service/gateway-proxy       LoadBalancer   10.43.3.200     <pending>     80:31700/TCP,443:32442/TCP   5h59m
service/gloo                ClusterIP      10.43.158.254   <none>        9977/TCP,9988/TCP,9966/TCP   5h59m
service/cert-manager        ClusterIP      10.43.7.29      <none>        80/TCP                       5h59m
service/cert-manager-v0     ClusterIP      10.43.87.233    <none>        80/TCP                       5h59m
service/webhook             ClusterIP      10.43.1.141     <none>        8090/TCP                     5h56m
service/buildkitd           ClusterIP      10.43.204.134   <none>        8080/TCP,80/TCP              5h55m
service/webhook-v0          ClusterIP      10.43.243.251   <none>        8090/TCP                     5h55m
service/buildkitd-v0        ClusterIP      10.43.98.4      <none>        8080/TCP,80/TCP              5h55m
service/dashboard           ClusterIP      10.43.124.238   <none>        80/TCP                       5h38m
service/dashboard-v0        ClusterIP      10.43.33.23     <none>        80/TCP                       5h38m

NAME                                 DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/svclb-gateway-proxy   1         1         0       1            0           <none>          5h59m
daemonset.apps/socat                 1         1         1       1            1           <none>          5h55m

NAME                             READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/gateway          1/1     1            1           5h59m
deployment.apps/gateway-proxy    1/1     1            1           5h59m
deployment.apps/cert-manager     1/1     1            1           5h59m
deployment.apps/gloo             1/1     1            1           5h59m
deployment.apps/webhook          1/1     1            1           5h55m
deployment.apps/buildkitd        1/1     1            1           5h55m
deployment.apps/dashboard        1/1     1            1           5h38m
deployment.apps/rio-controller   0/1     1            0           6h5m

NAME                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/gateway-6c496445d7          1         1         1       5h59m
replicaset.apps/gateway-proxy-7bdfc54996    1         1         1       5h59m
replicaset.apps/cert-manager-759c4847bc     1         1         1       5h59m
replicaset.apps/gloo-7f984f76cc             1         1         1       5h59m
replicaset.apps/webhook-678fc6f47c          1         1         1       5h55m
replicaset.apps/buildkitd-798d9df44d        1         1         1       5h55m
replicaset.apps/dashboard-8698cbd7b7        1         1         1       5h38m
replicaset.apps/rio-controller-597fb9d959   1         1         0       5h59m
StrongMonkey commented 4 years ago

@tullo svlb-gateway pending seems to be another issue(since you are using k3s, it by default use svclb for its own traefik ingress controller thus port will conflict). Try kubectl delete svc traefik -n kube-system to see if this solves your issue. The real issue behind is the error Get https://10.43.0.1:443/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions: dial tcp 10.43.0.1:443: i/o timeout, which appears when you turn on firewall. This is the kubernetes api endpoint controller will need to talk to which is blocked by firewall.

krishofmans commented 4 years ago

Running into the same issue, k3d/k3s on mac, attempted to run with k3d c --publish 8081:80 --publish 8082:443 --server-arg "--no-deploy=traefik"

also attempted with a "regular" install, so with traefik and deleting the service as suggested but workloads still aren't available.

Complete cluster looks like this: image

rio-controller | time="2020-02-07T14:06:31Z" level=info msg="Starting rio-controller, version: v0.7.0, git commit: 4afd4901"
rio-controller | time="2020-02-07T14:06:31Z" level=info msg="Updating CRD services.rio.cattle.io"
rio-controller | time="2020-02-07T14:06:31Z" level=info msg="Updating CRD stacks.rio.cattle.io"
rio-controller | I0207 14:06:32.923118       1 leaderelection.go:241] attempting to acquire leader lease  rio-system/rio...
rio-controller | time="2020-02-07T14:06:32Z" level=info msg="listening at :443"
rio-controller | I0207 14:06:32.952197       1 leaderelection.go:251] successfully acquired lease rio-system/rio
rio-controller | time="2020-02-07T14:06:34Z" level=info msg="Starting /v1, Kind=ConfigMap controller"
rio-controller | time="2020-02-07T14:06:35Z" level=info msg="Starting /v1, Kind=Endpoints controller"
rio-controller | time="2020-02-07T14:06:35Z" level=info msg="Starting apps/v1, Kind=StatefulSet controller"
rio-controller | time="2020-02-07T14:06:35Z" level=info msg="Starting apps/v1, Kind=Deployment controller"
rio-controller | time="2020-02-07T14:06:35Z" level=info msg="Starting apps/v1, Kind=DaemonSet controller"
rio-controller | time="2020-02-07T14:06:35Z" level=info msg="Starting cert-manager.io/v1alpha2, Kind=Certificate controller"
rio-controller | time="2020-02-07T14:06:35Z" level=info msg="Starting /v1, Kind=Service controller"
rio-controller | time="2020-02-07T14:06:35Z" level=info msg="Starting tekton.dev/v1alpha1, Kind=TaskRun controller"
rio-controller | time="2020-02-07T14:06:35Z" level=info msg="Starting extensions/v1beta1, Kind=Ingress controller"
rio-controller | time="2020-02-07T14:06:35Z" level=info msg="Starting /v1, Kind=Secret controller"
rio-controller | time="2020-02-07T14:06:35Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=ClusterRole controller"
rio-controller | time="2020-02-07T14:06:35Z" level=info msg="Starting rio.cattle.io/v1, Kind=Service controller"
rio-controller | time="2020-02-07T14:06:35Z" level=info msg="Starting rio.cattle.io/v1, Kind=Stack controller"
rio-controller | time="2020-02-07T14:06:35Z" level=info msg="Starting rio.cattle.io/v1, Kind=ExternalService controller"
rio-controller | time="2020-02-07T14:06:35Z" level=info msg="Starting admin.rio.cattle.io/v1, Kind=ClusterDomain controller"
rio-controller | time="2020-02-07T14:06:35Z" level=info msg="Starting admin.rio.cattle.io/v1, Kind=PublicDomain controller"
rio-controller | time="2020-02-07T14:06:35Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding controller"
rio-controller | time="2020-02-07T14:06:35Z" level=info msg="Starting rio.cattle.io/v1, Kind=Router controller"
rio-controller | time="2020-02-07T14:06:35Z" level=info msg="Starting gitwatcher.cattle.io/v1, Kind=GitCommit controller"
rio-controller | time="2020-02-07T14:06:35Z" level=info msg="Starting gloo.solo.io/v1, Kind=Settings controller"
rio-controller | E0207 14:06:45.202967       1 controller.go:135] error syncing 'default/competent-johnson-v0gcx87': handler service-build: failed to update default/competent-johnson-v0gcx87-ee709-4e40c tekton.dev/v1alpha1, Kind=TaskRun for service-build default/competent-johnson-v0gcx87: Internal error occurred: failed calling webhook "webhook.tekton.dev": Post https://tekton-pipelines-webhook.tekton-pipelines.svc:443/?timeout=27s: dial tcp 10.43.64.249:443: connect: no route to host, handler template: skip processing, requeuing
rio info
Rio Version: v0.7.0 (4afd4901)
Rio CLI Version: v0.7.0 (4afd4901)
Cluster Domain: aiaai1.on-rio.io
Cluster Domain IPs: 172.19.0.2
System Namespace: rio-system
Wildcard certificates: aiaai1.on-rio.io(true)
StrongMonkey commented 4 years ago

@krishofmans looks like you need to push 80 and 443 if you are using k3d.

krishofmans commented 4 years ago

@StrongMonkey the trick that worked for me was installing rio like this on my k3s docker rio install --ip-address=127.0.0.1 before it tried to resolve something in the docker ip, which it could not reach from inside the docker container. If it would have been a native binary k3s it would have worked probably.

StrongMonkey commented 4 years ago

yes, if you are on k3d for mac that's the only workaround you can use because of networking problems. K3s binary on linux won't have such issue