openfaas / faas-netes

Serverless Functions For Kubernetes
https://www.openfaas.com
MIT License
2.12k stars 472 forks source link

Function CRD - unusual behaviour with constraints: [ ] #829

Closed aslanpour closed 2 years ago

aslanpour commented 3 years ago

I have generated a CRD type yaml file for my function to be able to create my function using kubectl. It is working perfectly, but in one case I see the function is constantly rescheduled every 300s. This case is when I set constraints as null like this: constraints: [], but this pattern constraints: has no issue.

Expected Behaviour

The function is expected to be scheduled once and not rescheduled every 300s.

Current Behaviour

The function is re-scheduled every 300s.

Are you a GitHub Sponsor (Yes/No?)

Check at: https://github.com/sponsors/openfaas

List All Possible Solutions and Workarounds

Perhaps verifying constraints value and treating [] (i.e., zero length) as "" or null will be good.

I have to either comment constraints: [], set it as constraints:, or give it a value.

Which Solution Do You Recommend?

I think I should work as normal with constraints: [].

Steps to Reproduce (for bugs)

  1. Function CRD:
    ---
    apiVersion: openfaas.com/v1
    kind: Function
    metadata:
    name: test
    namespace: openfaas-fn
    spec:
    name: test
    image: aslanpour/irrigation:latest
    environment:
    REDIS_SERVER_IP: 10.43.242.161
    REDIS_SERVER_PORT: "3679"
    exec_timeout: 10s
    handler_wait_duration: 10s
    read_timeout: 10s
    write_debug: "true"
    write_timeout: 10s
    labels:
    com.openfaas.scale.max: "1"
    com.openfaas.scale.min: "1"
    com.openfaas.scale.zero: "true"
    annotations:
    linkerd.io/inject: disabled
    limits:
    memory: ""
    cpu: 280m
    requests:
    memory: ""
    cpu: 140m
    constraints: []
  2. command to deploy:
  3. kubectl create -f test.crd.yaml
  4. Function is deployed
  5. Right after deployment: kubectl get function -n openfaas-fn test-6477fc76cc-srcxj 1/1 Running 0 3s 10.42.0.47 master <none> <none>
  6. After around 5 minutes and 15sec after deployment: kubectl get function -n openfaas-fn test-6477fc76cc-srcxj 1/1 Running 0 15s 10.42.0.47 master <none> <none>
  7. kubectl describe function/test -n openfaas-fn
    Name:         test
    Namespace:    openfaas-fn
    Labels:       <none>
    Annotations:  <none>
    API Version:  openfaas.com/v1
    Kind:         Function
    Metadata:
    Creation Timestamp:  2021-08-18T23:57:41Z
    Generation:          1
    Managed Fields:
    API Version:  openfaas.com/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        .:
        f:annotations:
          .:
          f:linkerd.io/inject:
        f:constraints:
        f:environment:
          .:
          f:REDIS_SERVER_IP:
          f:REDIS_SERVER_PORT:
          f:exec_timeout:
          f:handler_wait_duration:
          f:read_timeout:
          f:write_debug:
          f:write_timeout:
        f:image:
        f:labels:
          .:
          f:com.openfaas.scale.max:
          f:com.openfaas.scale.min:
          f:com.openfaas.scale.zero:
        f:limits:
          .:
          f:cpu:
          f:memory:
        f:name:
        f:requests:
          .:
          f:cpu:
          f:memory:
    Manager:         kubectl-create
    Operation:       Update
    Time:            2021-08-18T23:57:41Z
    Resource Version:  3286655
    UID:               ead1826a-479c-43d4-be2e-4ae5b2ab571e
    Spec:
    Annotations:
    linkerd.io/inject:  disabled
    Constraints:
    Environment:
    REDIS_SERVER_IP:        10.43.242.161
    REDIS_SERVER_PORT:      3679
    exec_timeout:           10s
    handler_wait_duration:  10s
    read_timeout:           10s
    write_debug:            true
    write_timeout:          10s
    Image:                    aslanpour/irrigation:latest
    Labels:
    com.openfaas.scale.max:   1
    com.openfaas.scale.min:   1
    com.openfaas.scale.zero:  true
    Limits:
    Cpu:     280m
    Memory:
    Name:      test
    Requests:
    Cpu:     140m
    Memory:
    Events:
    Type    Reason  Age                    From               Message
    ----    ------  ----                   ----               -------
    Normal  Synced  3m41s (x2 over 5m47s)  openfaas-operator  Function synced successfully
    Normal  Synced  3m40s                  openfaas-operator  Function synced successfully

    Context

    This is rescheduling the function unwantedly.

Your Environment

CLI: commit: 72816d486cf76c3089b915dfb0b66b85cf096634 version: 0.13.13

alexellis commented 3 years ago

Hi @aslanpour please can you format any code blocks that you've written using three backticks? It's very difficult to read the example at the moment. Thanks

aslanpour commented 3 years ago

done

LucasRoesler commented 2 years ago

:thinking: @alexellis i took a peek at the operator today and i couldn't see any clear distinction between these two cases for constraints, for example we do things like this len(constraints) > 0 that would ignore the difference between a naked constraints: and one with the empty list constraints: []

But then I thought about this "every 300s" part. That sounds like a resync, so I jumped to the deploymentNeedsUpdate and thought about how this code is working. Because of how the json package works, i think it is possible that when we do the diff between the current state and the previous state in the annotation, we might have a difference due to comparing the nil slice vs an empty slice. If this is right, we shoudl be able see this in the operator logs because we do this

glog.V(2).Infof("Change detected for %s diff\n%s", function.Name, diff)

I will give it a try in a bit

LucasRoesler commented 2 years ago

I just tried this with

---
apiVersion: openfaas.com/v1
kind: Function
metadata:
  name: test
  namespace: openfaas-fn
spec:
  name: test
  image: ghcr.io/openfaas/nodeinfo:latest
  environment:
    exec_timeout: 10s
    handler_wait_duration: 10s
    read_timeout: 10s
    write_debug: "true"
    write_timeout: 10s
  labels:
    com.openfaas.scale.max: "1"
    com.openfaas.scale.min: "1"
    com.openfaas.scale.zero: "true"
  limits:
    memory: ""
    cpu: 280m
  requests:
    memory: ""
    cpu: 140m
  constraints: []

And I can not reproduce. @aslanpour if this is still happening, perhaps you need to check the k8s events to see why it is being rescheduled. At least for me the function pod seems to be stable

$ kubectl get po -n openfaas-fn
NAME                    READY   STATUS    RESTARTS   AGE
test-7dfd96f45c-vqflr   1/1     Running   0          8m44s

I setup my cluster using kind with this config

# cluster.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  kubeadmConfigPatches:
  - |
    kind: InitConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        node-labels: "ingress-ready=true"
  extraPortMappings:
  - containerPort: 80
    hostPort: 80
    protocol: TCP
  - containerPort: 443
    hostPort: 443
    protocol: TCP
  - containerPort: 31112 # this is the NodePort created by the helm chart
    hostPort: 8080 # this is your port on localhost
    protocol: TCP
$ kind create cluster --image kindest/node:v1.22.1 --config=cluster.yaml
$ arkade install openfaas -a=false --operator
$ kubectl rollout status deploy gateway -n openfaas
$ kubectl apply -f function.yaml
aslanpour commented 2 years ago

Just to let you know, I am using K3s.

alexellis commented 2 years ago

@aslanpour could you check the event log?

Are you able to reproduce the problem using a function from the store?

faas-cli generate --from-store nodeinfo --namespace openfaas-fn > nodeinfo.yaml

Here are two versions - one with defaults, and one without.

---
apiVersion: openfaas.com/v1
kind: Function
metadata:
  name: nodeinfo
  namespace: openfaas-fn
spec:
  name: nodeinfo
  image: ghcr.io/openfaas/nodeinfo:latest
  labels: {}
  annotations: {}
---
apiVersion: openfaas.com/v1
kind: Function
metadata:
  name: nodeinfo2
  namespace: openfaas-fn
spec:
  name: nodeinfo2
  image: ghcr.io/openfaas/nodeinfo:latest
  labels: {}
  annotations: {}
  constraints: []

Alex

alexellis commented 2 years ago

I was also unable to reproduce this using Linode and x3 K3s nodes.

image

Probably what's happening is that you have a service or webhook in the background that is mutating your pods?

Without a repro, this falls into the realms of R&D and I think we should draw a line under this now. @LucasRoesler thanks for the time you spent on it.

@aslanpour if you do reproduce this, or figure it out. Feel free to comment.

alexellis commented 2 years ago

/add label: support,question