google / knative-gcp

GCP event implementations to use with Knative Eventing.
https://github.com/knative/eventing
Apache License 2.0
160 stars 74 forks source link

Channel live locks when PullSubscription already exists #352

Closed Harwayne closed 5 years ago

Harwayne commented 5 years ago

I created ~320 Triggers on a Broker that is backed by a PubSub Channel. The Channel has created PullSubscriptions for ~5 of the subscribers, and has gotten permanently live locked.

  subscribablestatus:
    subscribers:
    - observedGeneration: 1
      ready: "True"
      uid: 9f79a0ca-f515-11e9-b13c-42010aa80029
    - observedGeneration: 1
      ready: "True"
      uid: 7975668a-f517-11e9-b13c-42010aa80029
    - observedGeneration: 1
      ready: "True"
      uid: 7cdf8679-f517-11e9-b13c-42010aa80029
    - message: PullSubscription cre-sub-90da4450-f517-11e9-b13c-42010aa80029 is not
        ready
      observedGeneration: 1
      ready: "False"
      uid: 90da4450-f517-11e9-b13c-42010aa80029
k get pullsubscription -oyaml cre-sub-90e327a7-f517-11e9-b13c-42010aa80029
apiVersion: pubsub.cloud.google.com/v1alpha1
kind: PullSubscription
metadata:
  annotations:
    metrics-resource-group: channels.messaging.cloud.google.com
    metrics-resource-name: default-kne-trigger
  creationTimestamp: "2019-10-22T22:01:57Z"
  finalizers:
  - cloud-run-events-pubsub-pullsubscription-controller
  generation: 1
  labels:
    events.cloud.google.com/channel: cloud-run-events-channel-controller
    events.cloud.google.com/channel-name: default-kne-trigger
    events.cloud.google.com/channel-subscriber: cre-sub-90e327a7-f517-11e9-b13c-42010aa80029
    events.cloud.google.com/controller-uid: c8b19f38-f514-11e9-b13c-42010aa80029
  name: cre-sub-90e327a7-f517-11e9-b13c-42010aa80029
  namespace: default
  ownerReferences:
  - apiVersion: messaging.cloud.google.com/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: Channel
    name: default-kne-trigger
    uid: c8b19f38-f514-11e9-b13c-42010aa80029
  resourceVersion: "18345"
  selfLink: /apis/pubsub.cloud.google.com/v1alpha1/namespaces/default/pullsubscriptions/cre-sub-90e327a7-f517-11e9-b13c-42010aa80029
  uid: 9102af29-f517-11e9-b13c-42010aa80029
spec:
  ackDeadline: 30s
  mode: CloudEventsBinary
  retentionDuration: 168h0m0s
  secret:
    key: key.json
    name: google-cloud-key
  sink:
    uri: http://cre-default-kne-ingress-chan-publish.default.svc.cluster.local
  topic: cre-chan-c8b19f38-f514-11e9-b13c-42010aa80029
  transformer:
    uri: http://default-broker-filter.default.svc.cluster.local/triggers/default/generated-szs4x/90e0dac4-f517-11e9-b13c-42010aa80029
status:
  conditions:
  - lastTransitionTime: "2019-10-22T22:01:58Z"
    status: "True"
    type: Deployed
  - lastTransitionTime: "2019-10-22T22:02:04Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2019-10-22T22:01:57Z"
    status: "True"
    type: SinkProvided
  - lastTransitionTime: "2019-10-22T22:02:04Z"
    status: "True"
    type: Subscribed
  - lastTransitionTime: "2019-10-22T22:01:57Z"
    severity: Info
    status: "True"
    type: TransformerProvided
  observedGeneration: 1
  projectId: harwayne2
  sinkUri: http://cre-default-kne-ingress-chan-publish.default.svc.cluster.local
  subscriptionId: cre-pull-9102af29-f517-11e9-b13c-42010aa80029
  transformerUri: http://default-broker-filter.default.svc.cluster.local/triggers/default/generated-szs4x/90e0dac4-f517-11e9-b13c-42010aa80029
controller-794ff5f8dc-xjp94 controller {"level":"info","ts":"2019-10-22T22:35:41.055Z","logger":"controller.cloud-run-events-channel-controller","caller":"channel/channel.go:249","msg":"Channel \"default-kne-trigger\" will create subscription cre-sub-90e327a7-f517-11e9-b13c-42010aa80029","events.cloud.google.com/controller":"cloud-run-events-channel-controller"}                                             
controller-794ff5f8dc-xjp94 controller {"level":"error","ts":"2019-10-22T22:35:41.068Z","logger":"controller.cloud-run-events-channel-controller","caller":"controller/controller.go:357","msg":"Reconcile error","events.cloud.google.com/controller":"cloud-run-events-channel-controller","error":"pullsubscriptions.pubsub.cloud.google.com \"cre-sub-90e327a7-f517-11e9-b13c-42010aa80029\" already exists","stacktrace":"gke-internal/knative/cloudrun/vendor/knative.dev/pkg/controller.(*Impl).handleErr\n\t/usr/local/google/home/harwayne/go/src/gke-internal/knative/cloudrun/vendor/knative.dev/pkg/controller/controller.go:357\ngke-internal/knative/cloudrun/vendor/knative.dev/pkg/controller.(*Impl).processNextWorkItem\n\t/usr/local/google/home/harwayne/go/src/gke-internal/knative/cloudrun/vendor/knative.dev/pkg/controller/controller.go:343\ngke-internal/knative/cloudrun/vendor/knative.dev/pkg/controller.(*Impl).Run.func2\n\t/usr/local/google/home/harwayne/go/src/gke-internal/knative/cloudrun/vendor/knative.dev/pkg/controller/controller.go:291"}
controller-794ff5f8dc-xjp94 controller {"level":"info","ts":"2019-10-22T22:35:41.068Z","logger":"controller.cloud-run-events-channel-controller","caller":"controller/controller.go:344","msg":"Reconcile failed. Time taken: 23.376327ms.","events.cloud.google.com/controller":"cloud-run-events-channel-controller","knative.dev/traceid":"e2c70ffe-23dc-4053-ad62-4a602d296f80","knative.dev/key":"default/default-kne-trigger"}
controller-794ff5f8dc-xjp94 controller {"level":"info","ts":"2019-10-22T22:35:41.069Z","logger":"controller.cloud-run-events-channel-controller.event-broadcaster","caller":"record/event.go:258","msg":"Event(v1.ObjectReference{Kind:\"Channel\", Namespace:\"default\", Name:\"default-kne-trigger\", UID:\"c8b19f38-f514-11e9-b13c-42010aa80029\", APIVersion:\"messaging.cloud.google.com/v1alpha1\", ResourceVersion:\"21039\", FieldPath:\"\"}): type: 'Warning' reason: 'CreateSubscriberFailed' Creating Subscriber \"cre-sub-90e327a7-f517-11e9-b13c-42010aa80029\" failed","events.cloud.google.com/controller":"cloud-run-events-channel-controller"}
controller-794ff5f8dc-xjp94 controller {"level":"info","ts":"2019-10-22T22:35:41.069Z","logger":"controller.cloud-run-events-channel-controller.event-broadcaster","caller":"record/event.go:258","msg":"Event(v1.ObjectReference{Kind:\"Channel\", Namespace:\"default\", Name:\"default-kne-trigger\", UID:\"c8b19f38-f514-11e9-b13c-42010aa80029\", APIVersion:\"messaging.cloud.google.com/v1alpha1\", ResourceVersion:\"21039\", FieldPath:\"\"}): type: 'Warning' reason: 'InternalError' pullsubscriptions.pubsub.cloud.google.com \"cre-sub-90e327a7-f517-11e9-b13c-42010aa80029\" already exists","events.cloud.google.com/controller":"cloud-run-events-channel-controller"}

Theory: The Channel reconciler was reconciling a Channel, in particular it was creating the PullSubscription. It tried to write out the new status, but the Subscription controller raced with it and had already injected a new subscriber into the Channel's spec, so the Channel reconciler's UpdateStatus call was rejected. On every subsequent reconcile loop, the PullSubscription's UID was not in the status, so it was marked for creation. It already existed, so creation failed.

Possible fix: If creating a PullSubscription fails because it already exists, then check if its owner is the current Channel. If so, add it to the subUpdates slice. If not, then return the error. This would modify the following code: https://github.com/google/knative-gcp/blob/68086a12567deadd93d2e58f854a2c8b09bc467d/pkg/reconciler/channel/channel.go#L261-L265

Harwayne commented 5 years ago

If the possible fix works, then we will need to check every reconciler that creates objects and ensure it has the same 'adopt if owned by me' behavior.

yolocs commented 5 years ago

/assign