open-feature / open-feature-operator

A Kubernetes feature flag operator
https://openfeature.dev
Apache License 2.0
182 stars 35 forks source link

OpenFeature is disabled? #579

Closed agardnerIT closed 4 months ago

agardnerIT commented 10 months ago

In the same investigation as open-feature/open-feature-operator#575 and open-feature/flagd#1063

2023-12-06T23:11:23Z    DEBUG   controller-runtime.webhook.webhooks received request    {"webhook": "/mutate-v1-pod", "UID": "8f021e26-6d9b-47f2-90f2-588236ec388d", "kind": "/v1, Kind=Pod", "resource": {"group":"","version":"v1","resource":"pods"}}2023-12-06T23:11:23Z    DEBUG   controller-runtime.webhook.webhooks wrote response  {"webhook": "/mutate-v1-pod", "code": 200, "reason": "OpenFeature is disabled", "UID": "8f021e26-6d9b-47f2-90f2-588236ec388d", "allowed": true}

Also noticed:

2023-12-06T23:10:31Z    ERROR   setup   podMutator backfill permissions error   {"error": "Index with name field:metadata.annotations.openfeature.dev/allowkubernetessync does not exist"}

OFO v0.5.2 deployed w/ Argo (Helm) + flagd proxy.

Application and Flag Config

apiVersion: core.openfeature.dev/v1beta1
kind: FeatureFlag
metadata:
  name: sample-flags
  namespace: openfeatureplayground-team70-preprod-cd
  labels:
    app: open-feature-demo
spec:
  flagSpec:
    flags:
      new-welcome-message:
        state: ENABLED
        variants:
          'on': true
          'off': false
        defaultVariant: 'off'
---
apiVersion: core.openfeature.dev/v1beta1
kind: FeatureFlagSource
metadata:
  name: flag-source
  namespace: openfeatureplayground-team70-preprod-cd
spec:
  sources:
  - source: openfeatureplayground-team70-preprod-cd/sample-flags
    provider: flagd-proxy
---
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: "openfeatureplayground-team70"
  namespace: "openfeatureplayground-team70-preprod-cd"
  labels:
    dt.owner: "team70"
spec:
  replicas: 2
  strategy:
    canary:
      steps:
      - setWeight: 50
      - pause: {duration: 5s}
      - setWeight: 100
  revisionHistoryLimit: 0
  selector:
    matchLabels:
      app.kubernetes.io/name: openfeaturedemo
  template:
    metadata:
      annotations:
        openfeature.dev/enabled: "true"
        openfeature.dev/featureflagsource: "openfeatureplayground-team70-preprod-cd/flag-source"
      labels:
        dt.owner: "team70"
        app.kubernetes.io/name: openfeaturedemo
        app.kubernetes.io/part-of: "openfeatureplayground-team70"
        app.kubernetes.io/version: "0.12.1"
        dynatrace-release-stage: "preprod"
    spec:
      containers:
      - name: openfeature-demo
        image: ghcr.io/open-feature/playground-app:v0.12.1
        args:
          - flagd
        env:
        - name: DT_RELEASE_VERSION
          valueFrom:
            fieldRef:
              fieldPath: metadata.labels['app.kubernetes.io/version']
        - name: DT_RELEASE_PRODUCT
          valueFrom:
            fieldRef:
              fieldPath: metadata.labels['app.kubernetes.io/part-of']
        - name: DT_RELEASE_STAGE
          valueFrom:
            fieldRef:
              fieldPath: metadata.labels['dynatrace-release-stage']
        - name: DT_CUSTOM_PROP
          value: "owner=team70 project=openfeatureplayground stage=preprod"
        - name: DT_TAGS
          value: "dt.owner=team70"
        ports:
        - name: http
          containerPort: 30000
          protocol: TCP
        resources:
          requests:
            memory: 3Mi
            cpu: 5m

kubectl version

$ kubectl version
Client Version: v1.28.1-eks-43840fb
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.3-eks-4f4795d
toddbaert commented 10 months ago

@agardnerIT I'm not too familiar with Argo Rollout, but I'm a but suspicious that the use of the openfeature.dev/enabled isn't working in that context (we'd typically see it in deployments). Specifically I'm wondering if it even makes it onto the pod (where it needs to be) from there.

I see in the Rollout docs that you can reference a workload (deployment) with workloadRef instead of using a template directly. Can you try that with a deployment that has the annotation?

odubajDT commented 10 months ago

I guess the difference between the index creation here and the idex used when backfilling the permissions here might be the problem.

The index creation is happening with spec.template.metadata.annotations.openfeature.dev/openfeature.dev/allowkubernetessync but instead the permission backilling function is searching for metadata.annotations.openfeature.dev/openfeature.dev/allowkubernetessync. I would say these need to be aligned. Honestly not sure which one of them is the correct one, will try to search when this was changed. It might also be me, as I made bigger refactoring of the webhooks recently and this probably slipped through.

odubajDT commented 10 months ago

Ok, I found to problem and it was me introducing it, before the refactoring, the indexer creation was creating an index to metadata.annotations.openfeature.dev/openfeature.dev/allowkubernetessync, currently it's creating an index for spec.template.metadata.annotations.openfeature.dev/openfeature.dev/allowkubernetessync.

It makes sense as this annotation is injected to the pod and not to the deployment. Therefore the indexed field is the annotation in the pod and not the annotation in the pod template, which is part of the deployment.

I already have a fix and will submit a PR soon

Sorry about the problems

odubajDT commented 10 months ago

Fix available here https://github.com/open-feature/open-feature-operator/pull/582

The bug was introduced here https://github.com/open-feature/open-feature-operator/commit/d234410a809760ba1c8592f95be56891e0cae855#diff-2873f79a86c0d8b3335cd7731b0ecf7dd4301eb19a82ef7a1cba7589b5252261L188

toddbaert commented 10 months ago

Reopening this pending resolution of this comment.

odubajDT commented 9 months ago

According to Adam's manifests, I think the enablement of OF should be ok. Also if the OF would not bet enabled, we wouldn't receive the problems with backfilling permissions -> we won't be able to reach this part of code at all.

Maybe Argo was creating a Pod in a completely different namespace at that time (completely unrelated to this Deployment)? Here it would make sense that the podMutator kicks in, checks if this Pod is annotated, finds out it's not, writes out the message and lets the pod to be bind to a node without any mutation? I think here we need to have more info about what was happening in the system generally... Just thinking out loud what the possibilities might be, do not have any proof for this theory.

@agardnerIT can you maybe provide us with some more insight? Thank you!

edxz7 commented 9 months ago

Hi I'm also facing the same error

I start playing with the killercoda scenario for the openfeature operator for a POC, but after following the instructions I noticed that the changes made to the CRDs had not effects in the demo app, so I further investigate why and I found that:

Screen Shot 2023-12-23 at 18 11 11

Note: I previously did the changes mentioned here to migrate the CRs for versions of the OpenFeature Operator above 0.5.0 because the example files are outdated

I also tried installing everything in my k8s cluster following the instructions provided in your docs

After not seeing the flagd sidecar being injected, I checked the logs of the pod running the operator and again I found the same error

2023-12-23T20:20:26Z    ERROR   setup   podMutator backfill permissions error   {"error": "Index with name field:metadata.annotations.openfeature.dev/allowkubernetessync does not exist"}

and also I have in the logs:

2023-12-23T20:20:39Z    DEBUG   controller-runtime.webhook.webhooks received request    {"webhook": "/mutate-v1-pod", "UID": "d18b1152-6286-43df-bc5c-e2659363a1c5", "kind": "/v1, Kind=Pod", "resource": {"group":"","version":"v1","resource":"pods"}}
2023-12-23T20:20:39Z    DEBUG   controller-runtime.webhook.webhooks wrote response  {"webhook": "/mutate-v1-pod", "code": 200, "reason": "OpenFeature is disabled", "UID": "d18b1152-6286-43df-bc5c-e2659363a1c5", "allowed": true}

so the issue is not related to Argo, also is worth to mention that I tried using the versions 0.5.1 and 0.5.0 of the OF operator but all they have the same bug, you can confirm this by installing those versions in the killercoda scenario I mentioned above

toddbaert commented 9 months ago

@edxz7 @agardnerIT I've released 0.5.3, which should (at the very least partially) resolve this issue. Please confirm when you have time.

edxz7 commented 9 months ago

Hello @toddbaert

I did a quick test in the killercoda scenario for the openfeature operator with the version you just released (0.5.3) and the error message: ERROR setup podMutator backfill permissions error has gone but openfeature is still disabled, I got the same logs as before:

2023-12-30T06:32:09Z    DEBUG   controller-runtime.webhook.webhooks     wrote response  {"webhook": "/mutate-v1-pod", "code": 200, "reason": "OpenFeature is disabled", "UID": "fe49eaf0-2906-4f44-aac1-e2bb1203c873", "allowed": true}

and still the OFO doesn't inject the sidecar I don't know if this is the expected behavior (having openfeature disabled by default) and if it's the case, How can I enable it manually?

Btw, I also try to test it in my k8s cluster (which is running in eks) but the installation of the new version ends with an error, this are the logs of the installation:

clusterrolebinding.rbac.authorization.k8s.io/open-feature-operator-flagd-kubernetes-sync configured
clusterrolebinding.rbac.authorization.k8s.io/open-feature-operator-manager-rolebinding unchanged
clusterrolebinding.rbac.authorization.k8s.io/open-feature-operator-proxy-rolebinding unchanged
configmap/open-feature-operator-manager-config unchanged
service/open-feature-operator-controller-manager-metrics-service unchanged
service/open-feature-operator-webhook-service unchanged
deployment.apps/open-feature-operator-controller-manager unchanged
certificate.cert-manager.io/open-feature-operator-serving-cert unchanged
issuer.cert-manager.io/open-feature-operator-selfsigned-issuer unchanged
mutatingwebhookconfiguration.admissionregistration.k8s.io/open-feature-operator-mutating-webhook-configuration configured
error: no matching resources found

I don't know why that error happens, but the pods runs without issues and they throw the same logs as in the killercoda scenario, that is, the error message ERROR setup podMutator backfill permissions error has gone but the message OpenFeature is disabled is also present.

toddbaert commented 9 months ago

I was just looking at the killercoda demo and there may be other issues, but the main one is that demo hasn't been updated since our change to v1beta1 (migration path outlined here). It's using older CR definitions that are no longer supported in beta. I have created an issue to update the killercoda example, and to lock it to a specific OFO version to prevent this from happening again.

@edxz7 sorry about that!

I believe outside of killercoda, things are be OK. I just ran through https://openfeature.dev/docs/tutorials/ofo/ and it works well. If you can run kind on your machine, I recommend going through that - the content is similar and up-to-date.

agardnerIT commented 9 months ago

Hi gents, I think I've now updated the killercoda OpenFeature Operator demo for these fixes.

Please try: https://killercoda.com/agardnerit/scenario/testing-ground and LMK here. If all is well, I'll open a PR.

One thing I noticed is a typo in the docs tutorial. When the the welcome message flag is changed, the second text is: Fib3r: Math at the speed of the internet! not Welcome to Fib3r: Fibonacci as a Service! so the docs need to be updated. (https://github.com/open-feature/openfeature.dev/issues/321 created to track this fix)

Change Summary:

  1. Pin to specific OFO version (see intro_foreground.sh
  2. Updated text to match the docs tutorial (see note about bug above)
  3. Due to the way flagd appends the port number & the way killercoda builds URLS, there's a weird setup and replace in the CRD. First we get the external URL from killercoda, replace their PORT with 30002 and remove the beginning https:// (otherwise we end up with duplicates). Then set FLAGD_HOST_WEB to the correct value + FLAGD_PORT_WEB to 443 (if ommited, flagd will append :8013 which breaks things. Finally FLAGD_TLS: true to re-add https:// otherwise we end up with http://https://KILLERCODA-URL...
beeme1mr commented 9 months ago

Thanks @agardnerIT, I've updated the tutorial to include the correct banner text.

The tutorial works great but I there were a few minor issues I noticed.

Thanks!

agardnerIT commented 8 months ago

Sorry it's taken so long. But I think the above changes have been fixed. You should see DEBUG_VERSION=6 at the top of the terminal output.

If you give me the thumbs up, I'll get the PR done.

agardnerIT commented 7 months ago

@beeme1mr just chasing this up. Can we merge the tutorial into the main repo?

beeme1mr commented 7 months ago

Yes, please open a PR when you have a moment.

beeme1mr commented 4 months ago

@agardnerIT please update the Killercoda tutorials when you have a moment. I'm going to close this ticket because the underlying issue has been addressed.