operator-framework / operator-lifecycle-manager

A management framework for extending Kubernetes with Operators
https://olm.operatorframework.io
Apache License 2.0
1.72k stars 545 forks source link

packages.operators apiregistration fails to authenticate to packageserver endpoint. #3136

Open epheo opened 10 months ago

epheo commented 10 months ago

Hi, After installing OLM (either with operator-sdk or install.sh), packageserver returns connect: connection refused while connecting to operatorhubio-catalog while I don't see any issue using a grpc_cli debugging container.

This is a very simple singlenode install of kubernetes with all pods patched on a same bridge.

$ kubectl version
Client Version: v1.29.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.0

The clusterserviceversions stays in Installing phase.

$ kubectl get csv packageserver -n olm
NAME            DISPLAY          VERSION   REPLACES   PHASE
packageserver   Package Server   0.26.0               Installing
$ k get apiservices v1.packages.operators.coreos.com -o yaml
[...]
  conditions:
  - lastTransitionTime: "2023-12-19T22:40:59Z"
    message: 'failing or missing response from https://10.32.0.29:5443/apis/packages.operators.coreos.com/v1:
      bad status from https://10.32.0.29:5443/apis/packages.operators.coreos.com/v1:
      403'
    reason: FailedDiscoveryCheck
    status: "False"
    type: Available

From a grpci_cli debuging container I can reach and list services of the operatorhubio-catalog.olm.svc endpoint.

$ kubectl run -it --rm --restart=Never --image=webplates/grpc-cli:latest grpccli ls operatorhubio-catalog.olm.svc.cluster.local:50051 api.Registry
ListPackages
GetPackage
GetBundle
GetBundleForChannel
GetChannelEntriesThatReplace
GetBundleThatReplaces
GetChannelEntriesThatProvide
GetLatestChannelEntriesThatProvide
GetDefaultBundleThatProvides
ListBundles

Within the operatorhubio-catalog pod the served configs seems ok.

<<K9s-Shell>> Pod: olm/operatorhubio-catalog-r52b7 | Container: registry-server
/ $ ps
PID   USER     TIME  COMMAND
    1 1001      0:35 /bin/opm serve /configs --cache-dir=/tmp/cache
 1922 1001      0:00 sh
 1942 1001      0:00 ps

/ $ grpc_health_probe -addr 127.0.0.1:50051
status: SERVING

/ $ /bin/opm validate /configs
/ $ /bin/opm version
Version: version.Version{OpmVersion:"v1.33.0", GitCommit:"5e23ef59", BuildDate:"2023-11-28T15:00:47Z", GoOs:"linux", GoArch:"amd64"}
/ $

All containers appears as running and livenessprobes seems to have been satisfied.

$ k get all -n olm
Warning: kubevirt.io/v1 VirtualMachineInstancePresets is now deprecated and will be removed in v2.
NAME                                                                  READY   STATUS      RESTARTS      AGE
pod/0b9f8e8106e6bc92a5b3edb6791ceaab0e8a22f5493895798082899af768bmj   0/1     Completed   0             12h
pod/9b9d47c94b554c8bd984f185a7385db635c1dbd74e304e2f4d34960f8bdvm5j   0/1     Completed   0             12h
pod/catalog-operator-7676fc5cc8-jr6th                                 1/1     Running     0             13h
pod/olm-operator-7c897bd449-jgnlk                                     1/1     Running     0             13h
pod/operatorhubio-catalog-r52b7                                       1/1     Running     0             13h
pod/packageserver-5966d674f8-fmjsn                                    1/1     Running     0             13h
pod/packageserver-5966d674f8-hwpxn                                    1/1     Running     0             13h

NAME                            TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)     AGE
service/operatorhubio-catalog   ClusterIP   10.32.0.101   <none>        50051/TCP   13h
service/packageserver-service   ClusterIP   10.32.0.138   <none>        5443/TCP    51s

NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/catalog-operator   1/1     1            1           13h
deployment.apps/olm-operator       1/1     1            1           13h
deployment.apps/packageserver      2/2     2            2           13h

NAME                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/catalog-operator-7676fc5cc8   1         1         1       13h
replicaset.apps/olm-operator-7c897bd449       1         1         1       13h
replicaset.apps/packageserver-5966d674f8      2         2         2       13h

NAME                                                                        COMPLETIONS   DURATION   AGE
job.batch/0b9f8e8106e6bc92a5b3edb6791ceaab0e8a22f5493895798082899af72da17   1/1           9s         12h
job.batch/9b9d47c94b554c8bd984f185a7385db635c1dbd74e304e2f4d34960f8bdc287   1/1           7s         12h

But a log from a packageserver pod returns:

time="2023-12-20T11:26:48Z" level=warning msg="error getting bundle stream" action="refresh cache" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.32.0.101:50051: connect: connection refused\"" source="{operatorhubio-catalog olm}"
W1220 11:26:49.978844       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {operatorhubio-catalog.olm.svc:50051 operatorhubio-catalog.olm.svc:50051 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 10.32.0.101:50051: connect: connection refused". Reconnecting...

I included what felt relevant from the olm-operator operatorhubio-catalog and packageserver logs.

catalog-operator.log operatorhubio-catalog.log packageserver.log olm-operator.log

epheo commented 10 months ago

update: The connection refused logs from the packageserver pod are only happening during the instantiation of opm and package-server can connect correctly using grpc afterward.

Actual issue appears to concern the packageserver endpoint authentication as healthz livez and readyz endpoints all returns 200 ok but the apis/packages.operators.coreos.com/v1 endpoint returns 403 Forbidden.

message: 'failing or missing response from https://10.32.0.210:5443/apis/packages.operators.coreos.com/v1:
  bad status from https://10.32.0.210:5443/apis/packages.operators.coreos.com/v1:
      403'

If I run another package-server with --authorization-always-allow-paths /apis/packages.operators.coreos.com/v1 the endpoint is returning the expect result.

/bin/package-server -v=4 --secure-port 5444 --global-namespace olm --debug --authorization-always-allow-paths /apis/packages.operators.coreos.com/v1
dnstools# curl -k https://10.200.0.94:5444/apis/packages.operators.coreos.com/v1
{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "packages.operators.coreos.com/v1",
  "resources": [
    {
      "name": "packagemanifests",
      "singularName": "packagemanifest",
      "namespaced": true,
      "kind": "PackageManifest",
      "verbs": [
        "get",
        "list"
      ]
    },
    {
      "name": "packagemanifests/icon",
      "singularName": "",
      "namespaced": true,
      "kind": "PackageManifest",
      "verbs": [
        "get"
      ]
    }
  ]

https://github.com/openshift/library-go/blob/7a65fdb398e28782ee1650959a5e0419121e97ae/pkg/config/serving/server.go#L63 refers to system:masters which matches the certificate I use to create OLM ressources.

What component/configuration may I be missing in my kubernetes deployment ?