operator-framework / operator-lifecycle-manager

A management framework for extending Kubernetes with Operators
https://olm.operatorframework.io
Apache License 2.0
1.68k stars 542 forks source link

Packageserver cant connect to the grpc server #1288

Open cmoulliard opened 4 years ago

cmoulliard commented 4 years ago

Issue

The olm Packageserver cannot connect to the grpc server created from an image built using operator-registry

The following CatalogSource has been deployed successfully on kubernetes 1.15

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: prometheus-manifests
spec:
  displayName: Prometheus Operator
  publisher: Snowdrop
  sourceType: grpc
  image: quay.io/cmoulliard/olm-index:0.1.0

but no packagemanifests are created within the namespace demo

When I look to the packagerserver running as a pod within the olm namespace, I see this error

W0212 20:57:16.234189       1 clientconn.go:1120] grpc: 
addrConn.createTransport failed to connect to
 {prometheus-manifests.demo.svc:50051 0  <nil>}. 
Err :connection error: 
desc = "transport: Error while dialing dial tcp 10.109.228.141:50051:
 i/o timeout". Reconnecting...
I0212 20:57:16.234271       1 balancer_conn_wrappers.go:127] pickfirstBalancer: HandleSubConnStateChange: 0xc000029dd0, TRANSIENT_FAILURE
I0212 20:57:17.239568       1 balancer_conn_wrappers.go:127] pickfirstBalancer: HandleSubConnStateChange: 0xc000029dd0, CONNECTING

A service resource has been well created to access it

kind: Service
apiVersion: v1
metadata:
  name: prometheus-manifests
  namespace: demo
  ownerReferences:
    - apiVersion: operators.coreos.com/v1alpha1
      kind: CatalogSource
      name: prometheus-manifests
      uid: 0b863564-d8c0-471a-b3c7-63c1c1433153
      controller: false
      blockOwnerDeletion: false
spec:
  ports:
    - name: grpc
      protocol: TCP
      port: 50051
      targetPort: 50051
  selector:
    olm.catalogSource: prometheus-manifests
  clusterIP: 10.107.184.132
  type: ClusterIP
  sessionAffinity: None

Here is the pod resource created for the grpc server

kind: Pod
apiVersion: v1
metadata:
  name: prometheus-manifests-kdnmf
  generateName: prometheus-manifests-
  namespace: demo
  selfLink: /api/v1/namespaces/demo/pods/prometheus-manifests-kdnmf
  uid: a57ef224-9729-44c0-a591-c885bb1695e7
  resourceVersion: '15873'
  creationTimestamp: '2020-02-12T20:56:56Z'
  labels:
    olm.catalogSource: prometheus-manifests
  ownerReferences:
    - apiVersion: operators.coreos.com/v1alpha1
      kind: CatalogSource
      name: prometheus-manifests
      uid: 0b863564-d8c0-471a-b3c7-63c1c1433153
      controller: false
      blockOwnerDeletion: false
spec:
  volumes:
    - name: default-token-rzx22
      secret:
        secretName: default-token-rzx22
        defaultMode: 420
  containers:
    - name: registry-server
      image: 'quay.io/cmoulliard/olm-index:0.1.0'
      ports:
        - name: grpc
          containerPort: 50051
          protocol: TCP
      resources:
        limits:
          cpu: 100m
          memory: 100Mi
        requests:
          cpu: 10m
          memory: 50Mi
      volumeMounts:
        - name: default-token-rzx22
          readOnly: true
          mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      livenessProbe:
        exec:
          command:
            - grpc_health_probe
            - '-addr=localhost:50051'
        initialDelaySeconds: 10
        timeoutSeconds: 1
        periodSeconds: 10
        successThreshold: 1
        failureThreshold: 3
      readinessProbe:
        exec:
          command:
            - grpc_health_probe
            - '-addr=localhost:50051'
        initialDelaySeconds: 5
        timeoutSeconds: 1
        periodSeconds: 10
        successThreshold: 1
        failureThreshold: 3
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      imagePullPolicy: IfNotPresent
  restartPolicy: Always
  terminationGracePeriodSeconds: 30
  dnsPolicy: ClusterFirst
  nodeSelector:
    beta.kubernetes.io/os: linux
  serviceAccountName: default
  serviceAccount: default
  nodeName: k8s-115
  securityContext: {}
  schedulerName: default-scheduler
  tolerations:
    - operator: Exists
  priority: 0
  enableServiceLinks: true
status:
  phase: Running

If I ssh to the vm running the cluster, I can use the grpcurl tool

[root@k8s-115 ~]# grpcurl -plaintext 10.107.184.132:50051 list api.Registry
api.Registry.GetBundle
api.Registry.GetBundleForChannel
api.Registry.GetBundleThatReplaces
api.Registry.GetChannelEntriesThatProvide
api.Registry.GetChannelEntriesThatReplace
api.Registry.GetDefaultBundleThatProvides
api.Registry.GetLatestChannelEntriesThatProvide
api.Registry.GetPackage
api.Registry.ListPackages

but listPackages is empty

 grpcurl -plaintext 10.107.184.132:50051 api.Registry.ListPackages
[root@k8s-115 ~]# 

Additional info

kubernetes cluster: 1.15 olm version: 0.14.1 image index : quay.io/cmoulliard/olm-index:0.1.0 operator-registry: master

flickerfly commented 4 years ago

Just to try to summarize:

You have a catalog being deployed into a namespace different than the namespace that the PackageServer is deployed into. The PackageServer is resolving the connection to the pod name prometheus-manifests.demo.svc at IP 10.109.228.141, the service though is setup on IP 10.107.184.132 and works just fine for hitting it with grpcurl.

So it appears PackageServer is trying to get to the pod directly rather than through the pod. If it was running in the same namespace as is default, that would work.

Does the Catalog Operator correctly access the catalog?

I wonder if you could actually setup the PackageServer in the same namespace and if that wouldn't solve your problem. I don't know if that would be an official solution, but might be a work-around.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

gmarcy commented 4 years ago

seeing the same issue on the OKD4 cluster I just installed today following https://medium.com/@craig_robinson/openshift-4-4-okd-bare-metal-install-on-vmware-home-lab-6841ce2d37eb

W0504 17:39:03.726845 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {community-operators.openshift-marketplace.svc:50051 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 172.30.183.136:50051: i/o timeout". Reconnecting...

gmarcy commented 4 years ago

noticed this in case it helps

$ kubectl -n openshift-marketplace get event ... 14m Warning Unhealthy pod/community-operators-5b7f9bb9bf-b2v9v Readiness probe failed: timeout: failed to connect service "localhost:50051" within 1s 105s Warning Unhealthy pod/community-operators-5b7f9bb9bf-b2v9v Liveness probe failed: timeout: failed to connect service "localhost:50051" within 1s 15m Warning Unhealthy pod/community-operators-5b7f9bb9bf-b2v9v Readiness probe failed: command timed out 14m Warning Unhealthy pod/community-operators-5b7f9bb9bf-b2v9v Liveness probe failed: command timed out ...

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Kampe commented 4 years ago

seeing the same issues with the readyness and liveness probes failing to contact localhost for the pods

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

ghost commented 3 years ago

any update on above issue Im facing the same issue in Mac m1 with go-grpc consul setup