openfaas / faas-netes

Serverless Functions For Kubernetes
https://www.openfaas.com
MIT License
2.12k stars 473 forks source link

Trouble Executing CRD/Operator on ARM #614

Closed mjallday closed 4 years ago

mjallday commented 4 years ago

Expected Behaviour

Following instructions on how to apply a function on ARM OpenFaaS cluster.

Current Behaviour

Function CRD is created but do not see related resources created.

$ faas-cli generate my-fn  -f ./my-fn.yml > fn.yaml
$ kubectl -n openfaas-fn apply -f fn.yaml
function.openfaas.com/my-fn created
kubectl -n openfaas-fn get functions
NAME                      AGE
my-fn         16s
kubectl -n openfaas-fn get all
NAME                            READY   STATUS    RESTARTS   AGE
pod/nodeinfo-5f86ccf74f-nd47d   1/1     Running   0          23h

NAME               TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
service/nodeinfo   ClusterIP   172.20.248.1   <none>        8080/TCP   23h

NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/nodeinfo   1/1     1            1           23h

NAME                                  DESIRED   CURRENT   READY   AGE
replicaset.apps/nodeinfo-5f86ccf74f   1         1         1       23h

I expect to see the function created here. There are no events attached to "my-fn"

kubectl -n openfaas-fn describe functions
Name:         my-fn
Namespace:    openfaas-fn
Labels:       app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/version=1.0.0
              helm.sh/chart=openfaas-functions-1.0.0
Annotations:  helm.fluxcd.io/antecedent: openfaas-fn:helmrelease/functions
API Version:  openfaas.com/v1
Kind:         Function
Metadata:
  Creation Timestamp:  2020-04-17T02:36:20Z
  Generation:          1
  Resource Version:    512642
  Self Link:           /apis/openfaas.com/v1/namespaces/openfaas-fn/functions/my-fn
  UID:                 38927f6d-8054-11ea-9b5f-06a170d92848
Spec:
  Annotations:
    com.openfaas.health.http.initialDelay:  2s
    com.openfaas.health.http.path:          /healthz
  Environment:
    PODINFO_PORT:      8080
    PODINFO_UI_COLOR:  <nil>
  Image:               quay.io/verygoodsecurity/my-fn:latest
  Labels:
    com.openfaas.scale.max:  1
    com.openfaas.scale.min:  1
  Limits:
    Cpu:                      200m
    Memory:                   256Mi
  Name:                       my-fn
  Read Only Root Filesystem:  true
  Requests:
    Cpu:     10m
    Memory:  128Mi
Events:      <none>

Running via the CLI (openfaas-cli) or webui works fine.

Steps to Reproduce (for bugs)

  1. Deploy ARM64 cluster on k8s
  2. Run commands above

Context

Your Environment

CLI: commit: 2d183c713b32385831dc7f69c073e57c06e3b76c version: 0.12.2

* Docker version `docker version` (e.g. Docker 17.0.05 ):

* What version and distriubtion of Kubernetes are you using? `kubectl version`

kubectl version Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.5", GitCommit:"20c265fef0741dd71a66480e35bd69f18351daea", GitTreeState:"clean", BuildDate:"2019-10-15T19:16:51Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.9-eks-f459c0", GitCommit:"f459c0672169dd35e77af56c24556530a05e9ab1", GitTreeState:"clean", BuildDate:"2020-03-18T04:24:17Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}


* Operating System and version (e.g. Linux, Windows, MacOS):

* Link to your project or a code example to reproduce issue:

* What network driver are you using and what CIDR? i.e. Weave net / Flannel
alexellis commented 4 years ago

It all seems to be working as designed? What's the ask here?

mjallday commented 4 years ago

Sorry, it may not be clear. The running pod is from another function that I manually created via the ui.

There should be a deployment with a pod for the function I created via the operator but it’s not being created.

The running pod is called node info which is one of the built in functions.

I’m trying to debug why my pod is not being launched.

I expect I would see a pod/deployment named my-fn with an image quay.io/verygoodsecurity/my-fn:latest running.

On Fri, Apr 17, 2020 at 00:05 Alex Ellis notifications@github.com wrote:

It all seems to be working as designed? What's the ask here?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/openfaas/faas-netes/issues/614#issuecomment-615081039, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB5S6FDL26JYA3PN3BASFDRM75UPANCNFSM4MKMWVSQ .

-- Marshall Jones

mjallday commented 4 years ago

looking at logs from pods

kubectl -n openfaas logs -f gateway-58548dc44d-7md2h -c faas-netes

W0417 15:18:37.183131       1 reflector.go:326] k8s.io/client-go/informers/factory.go:135: watch of *v1.Endpoints ended with: too old resource version: 592310 (592584)
W0417 15:43:35.194190       1 reflector.go:326] k8s.io/client-go/informers/factory.go:135: watch of *v1.Endpoints ended with: too old resource version: 594372 (595259)
W0417 16:00:54.204263       1 reflector.go:326] k8s.io/client-go/informers/factory.go:135: watch of *v1.Endpoints ended with: too old resource version: 597052 (597118)

this looks unrelated. when i deploy a new function i don't see any log output from any of the pods.

is there a particular pod i can watch to see what the operator is doing?

alexellis commented 4 years ago

@stefanprodan should be able to help with this. Can you please copy/paste "Your Environment" into this issue at the top too?

mjallday commented 4 years ago

Thanks, here's the definition of the function btw.

cat fn.yaml

---
apiVersion: openfaas.com/v1alpha2
kind: Function
metadata:
  name: my-fn
spec:
  name: my-fn
  image: quay.io/verygoodsecurity/my-fn:latest

I've tried a few others. All have similar results - the function resource gets created but no pods or deployments are created.

There are no events on the function resource so can't tell why it's not being applied. Creating them via the API or OpenFaaS UI works just fine.

kubectl -n openfaas describe pod gateway-58548dc44d-7md2h

Name:           gateway-58548dc44d-7md2h
Namespace:      openfaas
Priority:       0
Node:           ip-10-14-99-10.us-west-2.compute.internal/10.14.99.10
Start Time:     Thu, 16 Apr 2020 10:25:42 -0700
Labels:         app=gateway
                pod-template-hash=58548dc44d
Annotations:    kubernetes.io/psp: eks.privileged
                prometheus.io.port: 8082
                prometheus.io.scrape: true
Status:         Running
IP:             10.14.101.201
Controlled By:  ReplicaSet/gateway-58548dc44d
Containers:
  gateway:
    Container ID:   docker://ba9da345b6e1a9890e7f88adf68fcd6aafb5873d039c217c2f1b82b7f58ae4d5
    Image:          openfaas/gateway:0.18.13-arm64
    Image ID:       docker-pullable://openfaas/gateway@sha256:a5ebb0005d623c81b8e8b47a89dd5f90139ecc2bcff83dc8ea93281b94024d45
    Port:           8080/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Thu, 16 Apr 2020 10:25:45 -0700
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:      50m
      memory:   120Mi
    Liveness:   http-get http://:8080/healthz delay=0s timeout=5s period=10s #success=1 #failure=3
    Readiness:  http-get http://:8080/healthz delay=0s timeout=5s period=10s #success=1 #failure=3
    Environment:
      read_timeout:             65s
      write_timeout:            65s
      upstream_timeout:         60s
      functions_provider_url:   http://127.0.0.1:8081/
      direct_functions:         true
      direct_functions_suffix:  openfaas-fn.svc.cluster.local
      function_namespace:       openfaas-fn
      faas_nats_address:        nats.openfaas.svc.cluster.local
      faas_nats_port:           4222
      faas_nats_channel:        faas-request
      basic_auth:               true
      secret_mount_path:        /var/secrets
      auth_proxy_url:           http://basic-auth-plugin.openfaas:8080/validate
      auth_pass_body:           false
      scale_from_zero:          true
      max_idle_conns:           1024
      max_idle_conns_per_host:  1024
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from openfaas-openfaas-controller-token-fcjbf (ro)
      /var/secrets from auth (ro)
  faas-netes:
    Container ID:   docker://5120a9e59cb1d253ccc02043b9f6c8e324f490f921ca4ff1fdd842507683d923
    Image:          openfaas/faas-netes:0.10.2-arm64
    Image ID:       docker-pullable://openfaas/faas-netes@sha256:5d3323092ec536df47de651c65aea3db09d1798e3b30a6ee2a965e37a28bf72c
    Port:           8081/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Thu, 16 Apr 2020 10:25:49 -0700
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:     50m
      memory:  120Mi
    Environment:
      port:                                   8081
      function_namespace:                     openfaas-fn
      read_timeout:                           60s
      write_timeout:                          60s
      image_pull_policy:                      Always
      http_probe:                             true
      set_nonroot_user:                       false
      readiness_probe_initial_delay_seconds:  2
      readiness_probe_timeout_seconds:        1
      readiness_probe_period_seconds:         2
      liveness_probe_initial_delay_seconds:   2
      liveness_probe_timeout_seconds:         1
      liveness_probe_period_seconds:          2
    Mounts:
      /tmp from faas-netes-temp-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from openfaas-openfaas-controller-token-fcjbf (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  faas-netes-temp-volume:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  auth:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  basic-auth
    Optional:    false
  openfaas-openfaas-controller-token-fcjbf:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  openfaas-openfaas-controller-token-fcjbf
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  beta.kubernetes.io/arch=arm64
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>
alexellis commented 4 years ago

Does it use auth?

What are the logs of the controller and what events are in the namespace?

I wonder if your EKS version of v1.14 isn't supported by the client-go 1.17 version we updated to recently? @stefanprodan @LucasRoesler

mjallday commented 4 years ago

Does it use auth?

Auth only exists on the UI. This is mostly deployed using the helm chart specified in https://github.com/openfaas/faas-netes/commit/77851960b31b980f0328d55fd0f8c2b168bac8b7

The only customization I've done is add the ARM64 values. Here's the exact HelmRelease i'm applying

---
apiVersion: helm.fluxcd.io/v1
kind: HelmRelease
metadata:
  name: openfaas
  namespace: openfaas
  annotations:
    flux.weave.works/automated: "false"
spec:
  chart:
    git: https://github.com/openfaas/faas-netes
    path: chart/openfaas
    ref: 77851960b31b980f0328d55fd0f8c2b168bac8b7
  values:

    ingress:
      enabled: false

    functionNamespace: openfaas-fn
    generateBasicAuth: true
    istio:
      mtls: false

    basic_auth: true

    # arm64 specific values
    gateway:
      image: openfaas/gateway:0.18.13-arm64
      directFunctions: true
      replicas: 1

    oauth2Plugin:
      enabled: false

    faasnetes:
      image: openfaas/faas-netes:0.10.2-arm64
      httpProbe: true

    operator:
      image: openfaas/faas-netes:0.10.2-arm64
      create: false

    queueWorker:
      image: openfaas/queue-worker:0.9.0-arm64

    prometheus:
      image: prom/prometheus:v2.11.0
      create: true
      resources:
        requests:
          memory: "125Mi"

    alertmanager:
      image: prom/alertmanager:v0.18.0
      create: true

    faasIdler:
      image: openfaas/faas-idler:0.3.0-arm64

    basicAuthPlugin:
      image: openfaas/basic-auth-plugin:0.18.13-arm64
      replicas: 1

    ingressOperator:
      create: false

    nodeSelector:
      beta.kubernetes.io/arch: arm64

What are the logs of the controller and what events are in the namespace?

I've looked at the output from the gateway pod. Nothing interesting there (only health check and ui access logs turn out, no other output). Happy to supply logs from other pods but also don't see anything relevant, just tell me what you'd like to see.

No events on the function itself. It's like the operator isn't even processing it.

alexellis commented 4 years ago

The question about auth is about whether your quay.io image is using auth, you redacted the image name, why?

alexellis commented 4 years ago

Can you provide a values.yaml instead of a FluxCD config? Flux isn't supported on ARM64 unless people build their own images or use community images, and I'm not going to do that.

mjallday commented 4 years ago

Ah, I see. I’d say it’s working because it works if I pull the image using the ui.

Let me confirm tho by using a public image. Will update shortly.

alexellis commented 4 years ago

Did you setup a proper configuration for the image pull secrets? https://docs.openfaas.com/deployment/kubernetes/#use-a-private-registry-with-kubernetes

Also, did you run kubectl get events -n openfaas-fn ?

mjallday commented 4 years ago

Did you setup a proper configuration for the image pull secrets? https://docs.openfaas.com/deployment/kubernetes/#use-a-private-registry-with-kubernetes

Yes, but for clarity here's a new function with a public image to remove that from concern:

cat fn2.yaml
---
apiVersion: openfaas.com/v1alpha2
kind: Function
metadata:
  name: fn-2
spec:
  name: fn-2
  image: functions/nodeinfo:arm64
kubectl -n openfaas-fn apply -f fn2.yaml
function.openfaas.com/fn-2 created
kubectl -n openfaas-fn get functions
NAME                      AGE
fn-2                      13s
my-fn                      15h
kubectl get events -n openfaas-fn
LAST SEEN   TYPE     REASON              OBJECT                            MESSAGE
21m         Normal   Scheduled           pod/fn-manual-7657c974b6-dhptv    Successfully assigned openfaas-fn/fn-manual-7657c974b6-dhptv to ip-10-14-100-161.us-west-2.compute.internal
21m         Normal   Pulling             pod/fn-manual-7657c974b6-dhptv    Pulling image "quay.io/verygoodsecurity/test-fn:latest"
21m         Normal   Pulled              pod/fn-manual-7657c974b6-dhptv    Successfully pulled image "quay.io/verygoodsecurity/test-fn:latest"
21m         Normal   Created             pod/fn-manual-7657c974b6-dhptv    Created container fn-manual
21m         Normal   Started             pod/fn-manual-7657c974b6-dhptv    Started container fn-manual
21m         Normal   SuccessfulCreate    replicaset/fn-manual-7657c974b6   Created pod: fn-manual-7657c974b6-dhptv
21m         Normal   ScalingReplicaSet   deployment/fn-manual              Scaled up replica set fn-manual-7657c974b6 to 1
34s         Normal   ChartSynced         helmrelease/functions             Chart managed by HelmRelease processed

(fn-manual is something i created via the UI to ensure i'm not going crazy here, everything works just fine when i'm deploying via the ui)

helmrelease/functions is a HelmRelease object that's rendering a series of functions btw. this is something we use successfully on our x86 cluster.

alexellis commented 4 years ago

What architecture is your image that you're creating? Are you building it on an ARM64 device?

mjallday commented 4 years ago

it's arm64. that's why i switched to the nodeinfo so that we don't need to worry about the image that i'm building. i've already deployed and successfully run the nodeinfo (arm tagged version) on this cluster.

image

see here you can see if i manually deploy it it's in the running state.

alexellis commented 4 years ago

This works fine for me:

 kubectl get deploy,pod,function,service -n openfaas-fn
NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/nodeinfo   1/1     1            1           6m7s
deployment.apps/fn-2       1/1     1            1           41s

NAME                            READY   STATUS    RESTARTS   AGE
pod/nodeinfo-6dbd6bfc98-2qxcj   1/1     Running   0          6m7s
pod/fn-2-5bb7b8d977-pxrws       1/1     Running   0          41s

NAME                             AGE
function.openfaas.com/nodeinfo   6m7s
function.openfaas.com/fn-2       41s

NAME               TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/nodeinfo   ClusterIP   10.43.52.60     <none>        8080/TCP   6m7s
service/fn-2       ClusterIP   10.43.135.167   <none>        8080/TCP   41s

And kubectl version shows a more modern version:

kubectl version
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.0", GitCommit:"70132b0f130acc0bed193d9ba59dd186f0e634cf", GitTreeState:"clean", BuildDate:"2019-12-07T21:20:10Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.2+k3s1", GitCommit:"cdab19b09a84389ffbf57bebd33871c60b1d6b28", GitTreeState:"clean", BuildDate:"2020-01-27T18:08:16Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/arm64"}

My suspicion is that you have something misconfigured, or the latest update to Go client 1.17 broke compatibility with EKS on 1.14.

. Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.9-eks-f459c0", GitCommit:"f459c0672169dd35e77af56c24556530a05e9ab1", GitTreeState:"clean", BuildDate:"2020-03-18T04:24:17Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}

Perhaps you should try upgrading EKS to 1.15?

https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html

The Kubernetes compatibility matrix implies that the 1.17 go-client doesn't work with 1.14, but 1.15: https://github.com/kubernetes/client-go#compatibility-matrix

GKE also has 1.15 available so it seems like the minimum version you should target?

https://cloud.google.com/kubernetes-engine/docs/release-notes#no-channel

mjallday commented 4 years ago

looks like there might be an arm eks release for 1.15 so we'll look at launching that cluster and see if that's where the problem lies and report back.

thanks for verifying everything looks ok

alexellis commented 4 years ago

What's the usecase here? Is it commercial?

mjallday commented 4 years ago

it's a commercial use-case. we're running image classification software that's optimized for ARM. in this particular use-case it's running tensorflow.

mjallday commented 4 years ago

just to update: we deployed the same setup on our x86 cluster and aren't seeing any issues (not on eks tho so not exact parity). we tried upgrading eks arm to 1.15 but doesn't look like that version is compatible yet.

we'll update once we have any progress to share on this.

mjallday commented 4 years ago

We managed to tweak the config and successfully upgraded the cluster. Everything is now working as expected. I will close this issue.

Thanks for the help in debugging, glad we got it solved!

kubectl version

Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.5", GitCommit:"20c265fef0741dd71a66480e35bd69f18351daea", GitTreeState:"clean", BuildDate:"2019-10-15T19:16:51Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.11-eks-af3caf", GitCommit:"af3caf6136cd355f467083651cc1010a499f59b1", GitTreeState:"clean", BuildDate:"2020-03-27T21:51:36Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}