numaproj / numaflow

Kubernetes-native platform to run massively parallel data/streaming jobs
https://numaflow.numaproj.io/
Apache License 2.0
1.1k stars 112 forks source link

numaflow-server crashes on start with server.configs.insecure=true #1734

Closed th0ger closed 4 months ago

th0ger commented 5 months ago

Describe the bug Numaflow-server installed with helm is not able to start with UX TLS setting disabled (server.configs.insecure=true).

To Reproduce

kind create cluster
helm repo add numaflow https://numaproj.io/helm-charts
helm repo update
helm install numaflow numaflow/numaflow --version "0.0.2" -f values.yaml

with

server:
  configs:
    # -- Whether to disable TLS for UX server.
    insecure: true
    # -- Port to listen on for UX server, defaults to 8443 or 8080 if insecure is set.
    # port: 8443

The server.configs.insecure value was changed from the default value.

$ watch kubectl get svc
NAME                  TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
kubernetes            ClusterIP   10.96.0.1       <none>        443/TCP    7m39s
numaflow-dex-server   ClusterIP   10.96.240.68    <none>        5556/TCP   7m33s
numaflow-server       ClusterIP   10.96.45.218    <none>        8443/TCP   7m33s
numaflow-webhook      ClusterIP   10.96.128.182   <none>        443/TCP    7m33s

It crashes/restarts every minute:

kubectl get pods
NAME                                   READY   STATUS             RESTARTS      AGE
numaflow-controller-854d57798c-89796   1/1     Running            0             7m22s
numaflow-dex-server-7c98b855db-pwt9v   1/1     Running            0             7m22s
numaflow-server-669d687d8-jn99r        0/1     CrashLoopBackOff   6 (43s ago)   7m22s
numaflow-webhook-586fc66c64-drf7c      1/1     Running            0             7m22s

No error logs found:

[GIN-debug] GET    /dex/*name                --> github.com/numaproj/numaflow/server/cmd/server.(*server).Start.NewDexReverseProxy.func3 (3 handlers)
[GIN-debug] POST   /dex/*name                --> github.com/numaproj/numaflow/server/cmd/server.(*server).Start.NewDexReverseProxy.func3 (3 handlers)
[GIN-debug] PUT    /dex/*name                --> github.com/numaproj/numaflow/server/cmd/server.(*server).Start.NewDexReverseProxy.func3 (3 handlers)
[GIN-debug] PATCH  /dex/*name                --> github.com/numaproj/numaflow/server/cmd/server.(*server).Start.NewDexReverseProxy.func3 (3 handlers)
[GIN-debug] HEAD   /dex/*name                --> github.com/numaproj/numaflow/server/cmd/server.(*server).Start.NewDexReverseProxy.func3 (3 handlers)
[GIN-debug] OPTIONS /dex/*name                --> github.com/numaproj/numaflow/server/cmd/server.(*server).Start.NewDexReverseProxy.func3 (3 handlers)
[GIN-debug] DELETE /dex/*name                --> github.com/numaproj/numaflow/server/cmd/server.(*server).Start.NewDexReverseProxy.func3 (3 handlers)
[GIN-debug] CONNECT /dex/*name                --> github.com/numaproj/numaflow/server/cmd/server.(*server).Start.NewDexReverseProxy.func3 (3 handlers)
[GIN-debug] TRACE  /dex/*name                --> github.com/numaproj/numaflow/server/cmd/server.(*server).Start.NewDexReverseProxy.func3 (3 handlers)
[GIN-debug] GET    /livez                    --> github.com/numaproj/numaflow/server/routes.Routes.func1 (3 handlers)
[GIN-debug] GET    /auth/v1/login            --> github.com/numaproj/numaflow/server/apis/v1.(*noAuthHandler).Login-fm (3 handlers)
[GIN-debug] POST   /auth/v1/login            --> github.com/numaproj/numaflow/server/apis/v1.(*noAuthHandler).LoginLocalUsers-fm (3 handlers)
[GIN-debug] GET    /auth/v1/logout           --> github.com/numaproj/numaflow/server/apis/v1.(*noAuthHandler).Logout-fm (3 handlers)
[GIN-debug] GET    /auth/v1/callback         --> github.com/numaproj/numaflow/server/apis/v1.(*noAuthHandler).Callback-fm (3 handlers)
[GIN-debug] GET    /api/v1/authinfo          --> github.com/numaproj/numaflow/server/apis/v1.(*handler).AuthInfo-fm (3 handlers)
[GIN-debug] GET    /api/v1/namespaces        --> github.com/numaproj/numaflow/server/apis/v1.(*handler).ListNamespaces-fm (3 handlers)
[GIN-debug] GET    /api/v1/cluster-summary   --> github.com/numaproj/numaflow/server/apis/v1.(*handler).GetClusterSummary-fm (3 handlers)
[GIN-debug] POST   /api/v1/namespaces/:namespace/pipelines --> github.com/numaproj/numaflow/server/apis/v1.(*handler).CreatePipeline-fm (3 handlers)
[GIN-debug] GET    /api/v1/namespaces/:namespace/pipelines --> github.com/numaproj/numaflow/server/apis/v1.(*handler).ListPipelines-fm (3 handlers)
[GIN-debug] GET    /api/v1/namespaces/:namespace/pipelines/:pipeline --> github.com/numaproj/numaflow/server/apis/v1.(*handler).GetPipeline-fm (3 handlers)
[GIN-debug] GET    /api/v1/namespaces/:namespace/pipelines/:pipeline/health --> github.com/numaproj/numaflow/server/apis/v1.(*handler).GetPipelineStatus-fm (3 handlers)
[GIN-debug] PUT    /api/v1/namespaces/:namespace/pipelines/:pipeline --> github.com/numaproj/numaflow/server/apis/v1.(*handler).UpdatePipeline-fm (3 handlers)
[GIN-debug] DELETE /api/v1/namespaces/:namespace/pipelines/:pipeline --> github.com/numaproj/numaflow/server/apis/v1.(*handler).DeletePipeline-fm (3 handlers)
[GIN-debug] PATCH  /api/v1/namespaces/:namespace/pipelines/:pipeline --> github.com/numaproj/numaflow/server/apis/v1.(*handler).PatchPipeline-fm (3 handlers)
[GIN-debug] POST   /api/v1/namespaces/:namespace/isb-services --> github.com/numaproj/numaflow/server/apis/v1.(*handler).CreateInterStepBufferService-fm (3 handlers)
[GIN-debug] GET    /api/v1/namespaces/:namespace/isb-services --> github.com/numaproj/numaflow/server/apis/v1.(*handler).ListInterStepBufferServices-fm (3 handlers)
[GIN-debug] GET    /api/v1/namespaces/:namespace/isb-services/:isb-service --> github.com/numaproj/numaflow/server/apis/v1.(*handler).GetInterStepBufferService-fm (3 handlers)
[GIN-debug] PUT    /api/v1/namespaces/:namespace/isb-services/:isb-service --> github.com/numaproj/numaflow/server/apis/v1.(*handler).UpdateInterStepBufferService-fm (3 handlers)
[GIN-debug] DELETE /api/v1/namespaces/:namespace/isb-services/:isb-service --> github.com/numaproj/numaflow/server/apis/v1.(*handler).DeleteInterStepBufferService-fm (3 handlers)
[GIN-debug] GET    /api/v1/namespaces/:namespace/pipelines/:pipeline/isbs --> github.com/numaproj/numaflow/server/apis/v1.(*handler).ListPipelineBuffers-fm (3 handlers)
[GIN-debug] GET    /api/v1/namespaces/:namespace/pipelines/:pipeline/watermarks --> github.com/numaproj/numaflow/server/apis/v1.(*handler).GetPipelineWatermarks-fm (3 handlers)
[GIN-debug] PUT    /api/v1/namespaces/:namespace/pipelines/:pipeline/vertices/:vertex --> github.com/numaproj/numaflow/server/apis/v1.(*handler).UpdateVertex-fm (3 handlers)
[GIN-debug] GET    /api/v1/namespaces/:namespace/pipelines/:pipeline/vertices/metrics --> github.com/numaproj/numaflow/server/apis/v1.(*handler).GetVerticesMetrics-fm (3 handlers)
[GIN-debug] GET    /api/v1/namespaces/:namespace/pipelines/:pipeline/vertices/:vertex/pods --> github.com/numaproj/numaflow/server/apis/v1.(*handler).ListVertexPods-fm (3 handlers)
[GIN-debug] GET    /api/v1/metrics/namespaces/:namespace/pods --> github.com/numaproj/numaflow/server/apis/v1.(*handler).ListPodsMetrics-fm (3 handlers)
[GIN-debug] GET    /api/v1/namespaces/:namespace/pods/:pod/logs --> github.com/numaproj/numaflow/server/apis/v1.(*handler).PodLogs-fm (3 handlers)
[GIN-debug] GET    /api/v1/namespaces/:namespace/events --> github.com/numaproj/numaflow/server/apis/v1.(*handler).GetNamespaceEvents-fm (3 handlers)
[GIN-debug] GET    /api/v1/sysinfo           --> github.com/numaproj/numaflow/server/routes.Routes.func2 (3 handlers)
{
    "level": "info",
    "ts": "2024-05-17T11:58:48.81624813Z",
    "logger": "numaflow.server",
    "caller": "server/start.go:115",
    "msg": "Starting server (TLS disabled) on :8080",
    "version": "Version: v1.2.1, BuildDate: 2024-05-07T08:25:20Z,
                     GitCommit: 89ea33f1d69785f6f5f17f1d5854ac189003918a, 
                     GitTag: v1.2.1, GitTreeState: clean, 
                     GoVersion: go1.21.9, Compiler: gc, Platform: linux/amd64",
    "disable-auth": true,
    "dex-server-addr": "https://numaflow-dex-server:5556/dex",
    "server-addr": "https://localhost:8443"
}
<line-wrapped for readability>

Expected behavior Don't crash.

Environment (please complete the following information):


Message from the maintainers:

Impacted by this bug? Give it a 👍. We often sort issues this way to know what to prioritize.

For quick help and support, join our slack channel.

th0ger commented 5 months ago

I notice the logs saying

    "msg": "Starting server (TLS disabled) on :8080",
    "disable-auth": true,
    "server-addr": "https://localhost:8443"

The first two lines as expected, but is the server-addr port supposed to be 8443?

whynowy commented 5 months ago

@th0ger - thanks for reporting the issue! The helm chart template needs to be fixed. Created an issue - https://github.com/numaproj/helm-charts/issues/10.

th0ger commented 5 months ago

@whynowy You're welcome. I did indeed wonder if this was a helm or service issue. But it was not obvious to me to test it with manifests/kustomize.

whynowy commented 5 months ago

@whynowy You're welcome. I did indeed wonder if this was a helm or service issue. But it was not obvious to me to test it with manifests/kustomize.

I can help you with a kuztomize manifests change if that would get you unblocked.

whynowy commented 5 months ago

@th0ger

cat kustomization.yaml

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - https://github.com/numaproj/numaflow/config/cluster-install?ref=v1.2.1

patches:
  - patch: |
      apiVersion: v1
      kind: ConfigMap
      metadata:
        name: numaflow-cmd-params-config
      data:
        server.insecure: "true"
  - patch: |
      - op: replace
        path: /spec/template/spec/containers/0/livenessProbe/httpGet/port
        value: 8080
      - op: replace
        path: /spec/template/spec/containers/0/livenessProbe/httpGet/scheme
        value: HTTP
    target:
      kind: Deployment
      name: numaflow-server
  - patch: |
      - op: replace
        path: /spec/ports/0/targetPort
        value: 8080
      - op: replace
        path: /spec/ports/0/port
        value: 8080
    target:
      kind: Service
      name: numaflow-server
whynowy commented 4 months ago

@th0ger - with latest fix in the helm charts, the issue should have been fixed. Let me know if it works for you when you get a chance. Thanks!

th0ger commented 4 months ago

You forgot to release the cart 0.0.3, again ;-)

$ helm repo update
$ helm search repo numaflow/numaflow --versions
NAME                    CHART VERSION   APP VERSION     DESCRIPTION
numaflow/numaflow       0.0.2                           A Helm chart for installing Numaflow in Kubernetes
numaflow/numaflow       0.0.1                           A Helm chart for installing Numaflow in Kubernetes

But the fix works great!

$ git clone git@github.com:numaproj/helm-charts.git
$ helm install numaflow-git ./helm-charts/charts/numaflow/ -f values.yaml
$ kubectl get svc | grep numaflow-server
numaflow-server       ClusterIP   10.96.127.131   <none>        8080/TCP   2m50s

Pods no longer crashing. Port 8080 changed as expected. I can port-forward and run the ui on http://localhost:8080.

whynowy commented 4 months ago

You forgot to release the cart 0.0.3, again ;-)

$ helm repo update
$ helm search repo numaflow/numaflow --versions
NAME                    CHART VERSION   APP VERSION     DESCRIPTION
numaflow/numaflow       0.0.2                           A Helm chart for installing Numaflow in Kubernetes
numaflow/numaflow       0.0.1                           A Helm chart for installing Numaflow in Kubernetes

But the fix works great!

$ git clone git@github.com:numaproj/helm-charts.git
$ helm install numaflow-git ./helm-charts/charts/numaflow/ -f values.yaml
$ kubectl get svc | grep numaflow-server
numaflow-server       ClusterIP   10.96.127.131   <none>        8080/TCP   2m50s

Pods no longer crashing. Port 8080 changed as expected. I can port-forward and run the ui on http://localhost:8080.

Thanks @th0ger !

I'll close this issue.

@chandankumar4 - could you please release 0.0.3?

chandankumar4 commented 4 months ago

Have automated the release process of numaflow here and released the helm chart version 0.0.3. Thanks

th0ger commented 4 months ago

@chandankumar4 @whynowy chart 0.0.3 works and initial issue fixed.