Open gopaljayanthi opened 5 months ago
@gopaljayanthi May I ask why you were using the old version (v0.7.2) of numaflow? Perhaps you could try updating to the latest version (v1.2.1) to see if it resolves the issue.
This bug should have been fixed in later versions.
i upgraded to v1.2.1, now I am seeing pod failures with not much in the logs
the pipeline i am using is here https://raw.githubusercontent.com/gopaljayanthi/numalogic-prometheus/master/manifests/pipeline/numalogic-prometheus-pipeline.yaml
k -n numalogic-prometheus get po NAME READY STATUS RESTARTS AGE isbsvc-default-js-0 3/3 Running 0 45m isbsvc-default-js-1 3/3 Running 0 45m isbsvc-default-js-2 3/3 Running 0 45m mlflow-sqlite-5bf68fc797-sxc2r 1/1 Running 2 (6h22m ago) 4d9h numalogic-prometheus-pipeline-daemon-79f6c9449-rjp4j 1/1 Running 0 38m numalogic-prometheus-pipeline-decode-0-2oljb 0/2 CrashLoopBackOff 15 (4m6s ago) 38m numalogic-prometheus-pipeline-filter-0-shtha 0/2 CrashLoopBackOff 18 (38s ago) 38m numalogic-prometheus-pipeline-inference-0-tkvfb 0/2 CrashLoopBackOff 17 (16s ago) 38m numalogic-prometheus-pipeline-input-0-inpqs 1/1 Running 0 38m numalogic-prometheus-pipeline-input-output-0-06ex2 1/1 Running 0 38m numalogic-prometheus-pipeline-output-0-tu4iy 1/1 Running 0 38m numalogic-prometheus-pipeline-postprocess-0-jakah 0/2 CrashLoopBackOff 18 (37s ago) 38m numalogic-prometheus-pipeline-preprocess-0-iq3hh 0/2 CrashLoopBackOff 18 (96s ago) 38m numalogic-prometheus-pipeline-prometheus-pusher-0-sffgp 2/2 Running 2 (27m ago) 38m numalogic-prometheus-pipeline-threshold-0-kfxad 0/2 CrashLoopBackOff 18 (98s ago) 38m numalogic-prometheus-pipeline-trainer-0-hqlrf 0/2 CrashLoopBackOff 17 (20s ago) 38m numalogic-prometheus-pipeline-training-output-0-mrcpc 1/1 Running 0 38m numalogic-prometheus-pipeline-window-0-gbsfk 0/2 CrashLoopBackOff 18 (37s ago) 38m numalogic-redis-cluster-0 1/1 Running 4 (6h22m ago) 4d6h numalogic-redis-cluster-1 1/1 Running 4 (6h21m ago) 4d6h numalogic-redis-cluster-2 1/1 Running 6 (6h21m ago) 4d6h numalogic-redis-cluster-3 1/1 Running 4 (2d4h ago) 4d6h numalogic-redis-cluster-4 1/1 Running 5 (6h21m ago) 4d6h numalogic-redis-cluster-5 0/1 Running 5 (27m ago) 4d6h
in the input vertex pod, i am seeing this {"level":"error","ts":"2024-06-21T21:35:42.500605264Z","logger":"numaflow","caller":"publish/publisher.go:278","msg":"put to bucket failed","entityID":"numalogic-prometheus-pipeline-input-0","otStore":"numalogic-prometheus-numalogic-prometheus-pipeline-input_SOURCE_OT","hbStore":"numalogic-prometheus-numalogic-prometheus-pipeline-input_SOURCE_PROCESSORS","bucket":"numalogic-prometheus-numalogic-prometheus-pipeline-input_SOURCE_PROCESSORS","error":"nats: timeout","stacktrace":"github.com/numaproj/numaflow/pkg/watermark/publish.(*publish).publishHeartbeat\n\t/home/runner/work/numaflow/numaflow/pkg/watermark/publish/publisher.go:278"}
in the decode vertex pod 2024/06/21 22:05:45 starting the gRPC server with unix domain socket... /var/run/numaflow/function.sock .... Error: failed to wait until server info is ready: context deadline exceeded 2024/06/21 22:39:08 | ERROR | {"level":"error","ts":"2024-06-21T22:39:08.590708576Z","logger":"numaflow.MapUDF-processor","caller":"nats/nats_client.go:68","msg":"Nats default: disconnected","pipeline":"numalogic-prometheus-pipeline","vertex":"decode","stacktrace":"github.com/numaproj/numaflow/pkg/shared/clients/nats.NewNATSClient.func3\n\t/home/runner/work/numaflow/numaflow/pkg/shared/clients/nats/nats_client.go:68\ngithub.com/nats-io/nats%2ego.(Conn).close.func1\n\t/home/runner/go/pkg/mod/github.com/nats-io/nats.go@v1.33.1/nats.go:5122\ngithub.com/nats-io/nats%2ego.(asyncCallbacksHandler).asyncCBDispatcher\n\t/home/runner/go/pkg/mod/github.com/nats-io/nats.go@v1.33.1/nats.go:2901"} 2024/06/21 22:39:08 | INFO | {"level":"info","ts":"2024-06-21T22:39:08.590858178Z","logger":"numaflow.MapUDF-processor","caller":"nats/nats_client.go:62","msg":"Nats default: connection closed","pipeline":"numalogic-prometheus-pipeline","vertex":"decode"} 2024/06/21 22:39:08 | ERROR | {"level":"error","ts":"2024-06-21T22:39:08.590881565Z","logger":"numaflow.MapUDF-processor","caller":"nats/nats_client.go:68","msg":"Nats default: disconnected","pipeline":"numalogic-prometheus-pipeline","vertex":"decode","stacktrace":"github.com/numaproj/numaflow/pkg/shared/clients/nats.NewNATSClient.func3\n\t/home/runner/work/numaflow/numaflow/pkg/shared/clients/nats/nats_client.go:68\ngithub.com/nats-io/nats%2ego.(Conn).close.func1\n\t/home/runner/go/pkg/mod/github.com/nats-io/nats.go@v1.33.1/nats.go:5122\ngithub.com/nats-io/nats%2ego.(asyncCallbacksHandler).asyncCBDispatcher\n\t/home/runner/go/pkg/mod/github.com/nats-io/nats.go@v1.33.1/nats.go:2901"} 2024/06/21 22:39:08 | INFO | {"level":"info","ts":"2024-06-21T22:39:08.590895017Z","logger":"numaflow.MapUDF-processor","caller":"nats/nats_client.go:62","msg":"Nats default: connection closed","pipeline":"numalogic-prometheus-pipeline","vertex":"decode"} 2024/06/21 22:39:08 | ERROR | {"level":"error","ts":"2024-06-21T22:39:08.590915688Z","logger":"numaflow.MapUDF-processor","caller":"nats/nats_client.go:68","msg":"Nats default: disconnected","pipeline":"numalogic-prometheus-pipeline","vertex":"decode","stacktrace":"github.com/numaproj/numaflow/pkg/shared/clients/nats.NewNATSClient.func3\n\t/home/runner/work/numaflow/numaflow/pkg/shared/clients/nats/nats_client.go:68\ngithub.com/nats-io/nats%2ego.(Conn).close.func1\n\t/home/runner/go/pkg/mod/github.com/nats-io/nats.go@v1.33.1/nats.go:5122\ngithub.com/nats-io/nats%2ego.(asyncCallbacksHandler).asyncCBDispatcher\n\t/home/runner/go/pkg/mod/github.com/nats-io/nats.go@v1.33.1/nats.go:2901"} 2024/06/21 22:39:08 | INFO | {"level":"info","ts":"2024-06-21T22:39:08.590927835Z","logger":"numaflow.MapUDF-processor","caller":"nats/nats_client.go:62","msg":"Nats default: connection closed","pipeline":"numalogic-prometheus-pipeline","vertex":"decode"} 2024/06/21 22:39:08 | INFO | {"level":"info","ts":"2024-06-21T22:39:08.590957969Z","logger":"numaflow.MapUDF-processor","caller":"jetstream/kv_store.go:166","msg":"stopping WatchAll","pipeline":"numalogic-prometheus-pipeline","vertex":"decode","kvName":"numalogic-prometheus-numalogic-prometheus-pipeline-input-decode_PROCESSORS","watcher":"numalogic-prometheus-numalogic-prometheus-pipeline-input-decode_PROCESSORS"} 2024/06/21 22:39:08 | ERROR | {"level":"error","ts":"2024-06-21T22:39:08.590985218Z","logger":"numaflow.MapUDF-processor","caller":"jetstream/kv_store.go:170","msg":"Failed to stop","pipeline":"numalogic-prometheus-pipeline","vertex":"decode","kvName":"numalogic-prometheus-numalogic-prometheus-pipeline-input-decode_PROCESSORS","watcher":"numalogic-prometheus-numalogic-prometheus-pipeline-input-decode_PROCESSORS","error":"nats: connection closed","stacktrace":"github.com/numaproj/numaflow/pkg/shared/kvs/jetstream.(jetStreamStore).Watch.func1\n\t/home/runner/work/numaflow/numaflow/pkg/shared/kvs/jetstream/kv_store.go:170"} 2024/06/21 22:39:08 | INFO | {"level":"info","ts":"2024-06-21T22:39:08.591035612Z","logger":"numaflow.MapUDF-processor","caller":"jetstream/kv_store.go:166","msg":"stopping WatchAll","pipeline":"numalogic-prometheus-pipeline","vertex":"decode","kvName":"numalogic-prometheus-numalogic-prometheus-pipeline-input-decode_OT","watcher":"numalogic-prometheus-numalogic-prometheus-pipeline-input-decode_OT"} 2024/06/21 22:39:08 | ERROR | {"level":"error","ts":"2024-06-21T22:39:08.591091852Z","logger":"numaflow.MapUDF-processor","caller":"jetstream/kv_store.go:170","msg":"Failed to stop","pipeline":"numalogic-prometheus-pipeline","vertex":"decode","kvName":"numalogic-prometheus-numalogic-prometheus-pipeline-input-decode_OT","watcher":"numalogic-prometheus-numalogic-prometheus-pipeline-input-decode_OT","error":"nats: connection closed","stacktrace":"github.com/numaproj/numaflow/pkg/shared/kvs/jetstream.(jetStreamStore).Watch.func1\n\t/home/runner/work/numaflow/numaflow/pkg/shared/kvs/jetstream/kv_store.go:170"} Usage: numaflow processor [flags] Flags: -h, --help help for processor --isbsvc-type string ISB Service type, e.g. jetstream --type string Processor type, 'source', 'sink' or 'udf' panic: failed to wait until server info is ready: context deadline exceeded goroutine 1 [running]: github.com/numaproj/numaflow/cmd/commands.Execute(...) /home/runner/work/numaflow/numaflow/cmd/commands/root.go:33 main.main() /home/runner/work/numaflow/numaflow/cmd/main.go:24 +0x3c
Please help.
Hello @gopaljayanthi!
From the logs I can see the error
2024/06/21 22:05:45 starting the gRPC server with unix domain socket... /var/run/numaflow/function.sock
This is an older version of the UDF which would not be compatible with numaflow v1.2.1.
As I can see that you are using numalogic, then can you upgrade your code to use the latest version of Numalogic which supports the latest Numaflow SDK versions?
Describe the bug In prometheus.yml i am using remote write to send metrics to a source vertex of http type. and getting this error ERROR 2024-06-19 21:54:43 {"level":"error","ts":1718814283.906287,"logger":"numaflow.Source-processor","caller":"forward/forward.go:473","msg":"Retrying failed msgs","vertex":"simple-pipeline-in","errors":{"expected to write body size of -24158 but got 41378":1},"stacktrace":"github.com/numaproj/numaflow/pkg/forward.(InterStepDataForward).writeToBuffer\n\t/home/runner/work/numaflow/numaflow/pkg/forward/forward.go:473\ngithub.com/numaproj/numaflow/pkg/forward.(InterStepDataForward).writeToBuffers\n\t/home/runner/work/numaflow/numaflow/pkg/forward/forward.go:428\ngithub.com/numaproj/numaflow/pkg/forward.(InterStepDataForward).forwardAChunk\n\t/home/runner/work/numaflow/numaflow/pkg/forward/forward.go:314\ngithub.com/numaproj/numaflow/pkg/forward.(InterStepDataForward).Start.func1\n\t/home/runner/work/numaflow/numaflow/pkg/forward/forward.go:143"}
To Reproduce Steps to reproduce the behavior:
install numaflow, prometheus
create interstepbufferservices
create pipeline using pipeline yaml
apiVersion: numaflow.numaproj.io/v1alpha1 kind: Pipeline metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"numaflow.numaproj.io/v1alpha1","kind":"Pipeline","metadata":{"annotations":{},"name":"simple-pipeline","namespace":"default"},"spec":{"edges":[{"from":"in","to":"cat"},{"from":"cat","to":"out"}],"vertices":[{"name":"in","source":{"generator":{"duration":"1s","rpu":5}}},{"name":"cat","udf":{"builtin":{"name":"cat"}}},{"name":"out","sink":{"log":{}}}]}} creationTimestamp: "2024-06-19T15:54:29Z" finalizers:
in prometheus config prometheus.yml, include following remote write
. remote_write:
Expected behavior the source accepts the metrics fromprometheus and forwards to the cat and then to out vertices
Screenshots If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
Additional context Add any other context about the problem here.
Message from the maintainers:
Impacted by this bug? Give it a 👍. We often sort issues this way to know what to prioritize.