numaproj / numaflow

Kubernetes-native platform to run massively parallel data/streaming jobs
https://numaflow.numaproj.io/
Apache License 2.0
1.13k stars 113 forks source link

Pipeline never reaches "Paused" phase in the case of previously unreconciled Pipeline #1991

Closed juliev0 closed 1 month ago

juliev0 commented 2 months ago

Describe the bug

If you submit a Pipeline with the lifecycle field set to "Paused", it will stay in "Pausing" instead of ever timing out.

{"level":"error","ts":"2024-08-22T01:28:15.648231917Z","logger":"numaflow.controller-manager","caller":"pipeline/controller.go:153","msg":"Updated desired pipeline phase failed: {error 26 0  rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp: lookup simple-pipeline-daemon-svc.example-namespace.svc on 10.43.0.10:53: no such host\"}","namespace":"example-namespace","pipeline":"simple-pipeline","stacktrace":"github.com/numaproj/numaflow/pkg/reconciler/pipeline.(*pipelineReconciler).reconcile\n\t/Users/jwang21/workspace/numaproj/numaflow/pkg/reconciler/pipeline/controller.go:153\ngithub.com/numaproj/numaflow/pkg/reconciler/pipeline.(*pipelineReconciler).Reconcile\n\t/Users/jwang21/workspace/numaproj/numaflow/pkg/reconciler/pipeline/controller.go:83\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/Users/jwang21/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/Users/jwang21/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/Users/jwang21/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/Users/jwang21/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227"}

To Reproduce Steps to reproduce the behavior:

  1. Submit a Pipeline with lifecycle.desiredPhase: Paused
  2. After 30 seconds (pause timeout) do kubectl get pipelines
  3. Pipeline remains in Pausing phase

Expected behavior Pipeline should be able to reach "Paused" phase

Additional Information I think the issue is that pausing isn't attempted whenever the Daemon Pod is running.

But there are cases like this one in which the Pipeline never had any Pods running in the first place and could theoretically at least adhere to the pause timeout. Other examples include:


Message from the maintainers:

Impacted by this bug? Give it a 👍. We often sort issues this way to know what to prioritize.

For quick help and support, join our slack channel.

juliev0 commented 2 months ago

FYI this is related to this issue

kohlisid commented 1 month ago

1) Pipeline created before ISB Service, and desiredPhase=Paused

skohli@macos-JQWR9T560R numaflow % kubectl get pl
NAME              PHASE    VERTICES   AGE   MESSAGE
simple-pipeline   Failed   3          13s   ISB Service not found.

We keep the phase as failed and do not try to pause https://github.com/numaproj/numaflow/issues/1992#issuecomment-2310697806