Access https://kubernetes:443 timeout in OLM operators on virtual cluster on OpenShift

jinsongo commented 3 years ago

What steps did you take and what happened: [A clear and concise description on how to REPRODUCE the bug.]

Reference https://github.com/kubernetes-sigs/cluster-api-provider-nested/blob/main/virtualcluster/doc/demo.md to install virtual cluster on OpenShift 4.7.13, but using the following workaround for some security constraints problems:

Follows as example:

# oc logs vc-syncer-55c5bc5898-8przt -n vc-manager
 I0707 01:55:52.772652       1 mccontroller.go:195] start mc-controller "-mccontroller"
F0707 01:55:52.773095       1 syncer.go:265] listen tcp :80: bind: permission denied
# oc describe daemonset.apps/vn-agent -n vc-manager
Warning  FailedCreate  81s (x18 over 7m23s)  daemonset-controller  Error creating: pods "vn-agent-" is forbidden:            unable to validate against any security context constraint: [provider restricted: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used spec.containers[0].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used]
# kubectl vc create -f https://raw.githubusercontent.com/kubernetes-sigs/cluster-api-provider-nested/master/virtualcluster/config/sampleswithspec/virtualcluster_1_nodeport.yaml -o vc-1.kubeconfig
2021-07-07 02:19:13.903398 C | etcdmain: cannot access data directory: mkdir /var/lib/etcd: permission denied

Workarounds:

oc adm policy add-scc-to-user anyuid -z vc-syncer -n vc-manager
oc adm policy add-scc-to-user privileged -z vn-agent -n vc-manager
oc adm policy add-scc-to-user anyuid -z default -n default-62383d-vc-sample-1

In order to activate the workaround, the related pods need be deleted to restart.

Install OLM on virtual cluster

kubectl --kubeconfig ./vc-1.kubeconfig apply -f https://github.com/operator-framework/operator-lifecycle-manager/releases/download/v0.17.0/crds.yaml
kubectl --kubeconfig ./vc-1.kubeconfig apply -f https://github.com/operator-framework/operator-lifecycle-manager/releases/download/v0.17.0/olm.yaml

Check OLM pods that are CrashLoopBackOff status

# oc --kubeconfig ./vc-1.kubeconfig get pod -n olm
NAME                                READY   STATUS             RESTARTS   AGE
catalog-operator-7f656b6d67-jxqtp   0/1     CrashLoopBackOff   61         5h17m
olm-operator-8d9bf86c9-brv7r           0/1     CrashLoopBackOff   61         5h17m
# oc --kubeconfig ./vc-1.kubeconfig logs catalog-operator-7f656b6d67-jxqtp -n olm
time="2021-07-08T07:56:28Z" level=panic msg="error configuring operator: Get https://kubernetes:443/api?timeout=32s: dial tcp 172.30.69.161:443: i/o timeout"
# oc --kubeconfig ./vc-1.kubeconfig logs olm-operator-8d9bf86c9-brv7r -n olm
2021-07-08T08:46:04.270Z        ERROR   controller-runtime.manager      Failed to get API Group-Resources       {"error": "Get https://kubernetes:443/api?timeout=32s: dial tcp 172.30.69.161:443: i/o timeout"}
time="2021-07-08T08:46:04Z" level=fatal msg="error configuring controller manager" error="Get https://kubernetes:443/api?timeout=32s: dial tcp 172.30.69.161:443: i/o timeout"

BTW, I also deployed nginx on the virtual cluster, looks like it does work well.

# oc --kubeconfig ./vc-1.kubeconfig get pod -n nginx
NAME                       READY   STATUS    RESTARTS   AGE
my-nginx-59c9f8dff-5t85p    1/1     Running   0          4h57m
my-nginx-59c9f8dff-mdtxg   1/1     Running   0          4h57m
my-nginx-59c9f8dff-vpt5v    1/1     Running   0          4h57m

What did you expect to happen: OLM pods can be running normally.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

On virtual cluster:

# oc --kubeconfig ./vc-1.kubeconfig describe svc kubernetes
Name:              kubernetes
Namespace:         default
Labels:            component=apiserver
                   provider=kubernetes
Annotations:       transparency.tenancy.x-k8s.io/clusterIP: 172.30.69.161
Selector:          <none>
Type:              ClusterIP
IP:                10.32.0.1
Port:              https  443/TCP
TargetPort:        6443/TCP
Endpoints:         10.254.28.88:6443
Session Affinity:  None
Events:            <none>

On super cluster:

# oc describe svc kubernetes -n default-99e968-vc-sample-1-default
Name:              kubernetes
Namespace:         default-99e968-vc-sample-1-default
Labels:            component=apiserver
                   provider=kubernetes
                   tenancy.x-k8s.io/vcname=vc-sample-1
                   tenancy.x-k8s.io/vcnamespace=default
Annotations:       tenancy.x-k8s.io/cluster: default-99e968-vc-sample-1
                   tenancy.x-k8s.io/clusterIP: 10.32.0.1
                   tenancy.x-k8s.io/namespace: default
                   tenancy.x-k8s.io/ownerReferences: null
                   tenancy.x-k8s.io/uid: 0cd40990-e31d-4e43-bfea-b2d5f11e020d
                   tenancy.x-k8s.io/vcname: vc-sample-1
                   tenancy.x-k8s.io/vcnamespace: default
                   transparency.tenancy.x-k8s.io/clusterIP: 172.30.69.161
Selector:          <none>
Type:              ClusterIP
IP:                172.30.69.161
Port:              https  443/TCP
TargetPort:        6443/TCP
Endpoints:         <none>
Session Affinity:  None
Events:            <none>

I checked, the endpoint 10.254.28.88:6443 is reachable. But https://kubernetes:443 is resolved as the kubernetes service IP 172.30.69.161 in super cluster, not the kubernetes service IP 10.32.0.1 in virtual cluster. And Endpoints is <none> for the kubernetes service from super cluster. Furthermore, I also debugged by "telnet 10.32.0.1 443" in the pod on virtual cluster, that could not be forwarded to endpoints because no connection can be established.

Use the following workaround, the https://kubernetes:443 timeout probelm can be resolved !

echo "
apiVersion: v1
kind: Endpoints
metadata:
  name: kubernetes
subsets:
- addresses:
  - ip: 10.254.28.88
  ports:
  - name: https
    port: 6443
    protocol: TCP
" | oc -n default-99e968-vc-sample-1-default apply -f -

Environment:

cluster-api-provider-nested version: built by the latest code to pick up https://github.com/kubernetes-sigs/cluster-api-provider-nested/pull/145
Minikube/KIND version: OpenShift 4.7.13
Kubernetes version: (use kubectl version): 1.20.0
OS (e.g. from /etc/os-release): Red Hat Enterprise Linux 8

/kind bug [One or more /area label. See https://github.com/kubernetes-sigs/cluster-api-provider-nested/labels?q=area for the list of labels]

jinsongo commented 3 years ago

@Fei-Guo @christopherhein @gyliu513 @vincent-pli

Fei-Guo commented 3 years ago

We should sync the kubernetes ep from virtual cluster to super cluster. Based on the dws code for ep, we only filter ep if the service has selector. The kubernetes service does not have selector so the ep should be synced.

    vService := &v1.Service{}
    err := c.MultiClusterController.Get(request.ClusterName, request.Namespace, request.Name, vService)
    if err != nil && !errors.IsNotFound(err) {
        return reconciler.Result{Requeue: true}, fmt.Errorf("fail to query service from tenant master %s", request.ClusterName)
    }
    if err == nil {
        if vService.Spec.Selector != nil {
            // Supermaster ep controller handles the service ep lifecycle, quit.
            return reconciler.Result{}, nil
        }
    }

Can you double check why the ep is not synced from virtualcluster to super cluster?

FYI, this is what I see from my local setup

kubectl get ep -n tenant1admin-f7ea3a-vc-sample-1-default
NAME         ENDPOINTS         AGE
kubernetes   172.17.0.7:6443   54d

vincent-pli commented 3 years ago

@wangjsty Upper comment is right, could you check the log of syncer see if there is some exception there.

To @Fei-Guo , seems we change the /etc/hosts in the container to let https://kubernetes:443 to point to the tenant's SVC "kubernetes" in super cluster, how do we do this? thanks.

Fei-Guo commented 3 years ago

@vincent-pli Here https://github.com/kubernetes-sigs/cluster-api-provider-nested/blob/6e8b8db5f596623fb5d7112cbd2bf0845ca27d3e/virtualcluster/pkg/syncer/conversion/mutate.go#L117

jinsongo commented 3 years ago

@Fei-Guo, No endpoint created, that's why I manually created one as workaround.

# kubectl get ep -n default-3fbd77-vc-sample-1-default
No resources found in default-3fbd77-vc-sample-1-default namespace

Fei-Guo commented 3 years ago

@wangjsty You can try to update the EP in vc, e.g., adding a dummy label and see if it is created in super and by the mean time, check the syncer log for any errors.

jinsongo commented 3 years ago

@Fei-Guo @vincent-pli I reproduced the problem again, and here are some logs from syncer:

E0709 04:13:10.212959       1 dws.go:77] failed reconcile endpoints default/kubernetes CREATE of cluster default-3fbd77-vc-sample-1 endpoints "kubernetes" is forbidden: endpoint address 10.254.28.64 is not allowed
E0709 04:13:10.212998       1 mccontroller.go:445] endpoints-mccontroller dws request is rejected: endpoints "kubernetes" is forbidden: endpoint address 10.254.28.64 is not allowed

E0709 04:27:06.132262       1 dws.go:66] failed reconcile serviceaccount olm/default CREATE of cluster default-3fbd77-vc-sample-1 pServiceAccount default-3fbd77-vc-sample-1-olm/ exists but its delegated UID is different
E0709 04:27:06.132309       1 mccontroller.go:461] olm/default dws request reconcile failed: pServiceAccount default-3fbd77-vc-sample-1-olm/ exists but its delegated UID is different
E0709 04:27:06.134087       1 dws.go:66] failed reconcile serviceaccount operators/default CREATE of cluster default-3fbd77-vc-sample-1 pServiceAccount default-3fbd77-vc-sample-1-operators/ exists but its delegated UID is different
E0709 04:27:06.134104       1 mccontroller.go:461] operators/default dws request reconcile failed: pServiceAccount default-3fbd77-vc-sample-1-operators/ exists but its delegated UID is different
I0709 04:27:06.136775       1 mutate.go:306] vc default-3fbd77-vc-sample-1 does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy.
I0709 04:27:06.140429       1 mutate.go:306] vc default-3fbd77-vc-sample-1 does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy.
E0709 04:27:06.231048       1 dws.go:104] failed reconcile Pod olm/olm-operator-8d9bf86c9-k46gw UPDATE of cluster default-3fbd77-vc-sample-1 Operation cannot be fulfilled on pods "olm-operator-8d9bf86c9-k46gw": the object has been modified; please apply your changes to the latest version and try again
E0709 04:27:06.231100       1 mccontroller.go:461] olm/olm-operator-8d9bf86c9-k46gw dws request reconcile failed: Operation cannot be fulfilled on pods "olm-operator-8d9bf86c9-k46gw": the object has been modified; please apply your changes to the latest version and try again
I0709 04:27:12.999392       1 mutate.go:306] vc default-3fbd77-vc-sample-1 does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy.
I0709 04:27:13.881893       1 mutate.go:306] vc default-3fbd77-vc-sample-1 does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy.
I0709 04:27:13.895515       1 mutate.go:306] vc default-3fbd77-vc-sample-1 does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy.
E0709 04:27:13.949619       1 dws.go:104] failed reconcile Pod olm/packageserver-59f5468bcd-cvb2f UPDATE of cluster default-3fbd77-vc-sample-1 Operation cannot be fulfilled on pods "packageserver-59f5468bcd-cvb2f": the object has been modified; please apply your changes to the latest version and try again
E0709 04:27:13.949646       1 mccontroller.go:461] olm/packageserver-59f5468bcd-cvb2f dws request reconcile failed: Operation cannot be fulfilled on pods "packageserver-59f5468bcd-cvb2f": the object has been modified; please apply your changes to the latest version and try again
E0709 04:27:13.962594       1 dws.go:104] failed reconcile Pod olm/packageserver-59f5468bcd-xmwhc UPDATE of cluster default-3fbd77-vc-sample-1 Operation cannot be fulfilled on pods "packageserver-59f5468bcd-xmwhc": the object has been modified; please apply your changes to the latest version and try again
E0709 04:27:13.962623       1 mccontroller.go:461] olm/packageserver-59f5468bcd-xmwhc dws request reconcile failed: Operation cannot be fulfilled on pods "packageserver-59f5468bcd-xmwhc": the object has been modified; please apply your changes to the latest version and try again

E0709 04:37:14.077736       1 dws.go:83] failed reconcile endpoints olm/packageserver-service DELETE of cluster default-3fbd77-vc-sample-1 To be deleted pEndpoints default-3fbd77-vc-sample-1-olm/packageserver-service delegated UID is different from deleted object.
E0709 04:37:14.077781       1 mccontroller.go:461] olm/packageserver-service dws request reconcile failed: To be deleted pEndpoints default-3fbd77-vc-sample-1-olm/packageserver-service delegated UID is different from deleted object.
E0709 04:37:14.083498       1 dws.go:83] failed reconcile endpoints olm/packageserver-service DELETE of cluster default-3fbd77-vc-sample-1 To be deleted pEndpoints default-3fbd77-vc-sample-1-olm/packageserver-service delegated UID is different from deleted object.
E0709 04:37:14.083535       1 mccontroller.go:461] olm/packageserver-service dws request reconcile failed: To be deleted pEndpoints default-3fbd77-vc-sample-1-olm/packageserver-service delegated UID is different from deleted object.
E0709 04:37:14.094135       1 dws.go:83] failed reconcile endpoints olm/packageserver-service DELETE of cluster default-3fbd77-vc-sample-1 To be deleted pEndpoints default-3fbd77-vc-sample-1-olm/packageserver-service delegated UID is different from deleted object.
E0709 04:37:14.094187       1 mccontroller.go:461] olm/packageserver-service dws request reconcile failed: To be deleted pEndpoints default-3fbd77-vc-sample-1-olm/packageserver-service delegated UID is different from deleted object.

jinsongo commented 3 years ago

@vincent-pli @Fei-Guo I will use oc adm policy add-scc-to-user privileged -z vc-syncer -n vc-manager as workaround, then try again. currently, I'm just use "anyuid" here, that could not be enough for vc-syncer

jinsongo commented 3 years ago

@vincent-pli @Fei-Guo I tried, but oc adm policy add-scc-to-user privileged -z vc-syncer -n vc-manager could not help.

vincent-pli commented 3 years ago

Seems it's not problem of permission, OCP set some restriction when user try to create ep manually: https://github.com/openshift/kubernetes/blob/90622d8244d0124fe2d44c336e68d4a4f03da1b6/openshift-kube-apiserver/admission/network/restrictedendpoints/endpoint_admission.go#L113-L128

and this:

cluster-config-v1 configmap in kube-system namespace

    The observed configmap install-config is decoded and the networking.podCIDR and networking.serviceCIDR is extracted and used as input for admissionPluginConfig.openshift.io/RestrictedEndpointsAdmission.configuration.restrictedCIDRs and servicesSubnet

https://github.com/openshift/cluster-kube-apiserver-operator/blob/master/README.md

@wangjsty

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot commented 2 years ago

@k8s-triage-robot: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-nested/issues/164#issuecomment-989461496): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues and PRs according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue or PR with `/reopen` >- Mark this issue or PR as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes-retired / cluster-api-provider-nested

Access https://kubernetes:443 timeout in OLM operators on virtual cluster on OpenShift #164