Open guoyiang opened 3 years ago
@tzolov: Could you share how we correlate the app/pod-specific metrics to reconstitute the stream definition, and likewise, the stats at the level of apps and streams? Perhaps we could document that even so we can the details to answer the questions posted by @guoyiang here.
There is some new development in the works to associate source and target systems about the event and interactivity overall. That's on top of the message tracing support we shipped in 2.8.x; see: https://dataflow.spring.io/docs/feature-guides/streams/tracing/
Problem description: Environment: spring cloud data flow server is running in kubernetes with prometheus metrics. Prometheus is deployed with prometheus operator.
In the metrics collected by prometheus, there's no way to really tell which pods were the source of the metrics, especially when running data flow server with multiple replicas.
Here's some examples (one for task and one for server):
You will notice there're no real way to identify the real source of the metrics.
instance
andpod
are added by prometheus when scraping the metrics, but because data flow use rsocket proxy to collect metrics, the values are set as the value of proxy pods, instead of the source (task pods and data flow server pods respectively). There's a indirect link to the pod for tasks by task execution id. But for data flow server, it's impossible when we have multiple replicas.Though this description is specific to prometheus, I would assume it's a similar case for other type of metric stores.
Solution description: Identity of sources should be included in metrics, and they can be added by either roscket proxy or client.
If rsocket proxy is the one to do it, it should be able to add a tag (or label) of connected client. Or, if client is the one to do it, it can push some data about itself, like its own ip, hostname, pod name, etc.
Description of alternatives: As a workaround, an additional tag can be added on client (data flow server and tasks):
This way the client will add a tag
application_host
with value as its hostname, which is the pod name.Additional context: A similar issue reported on rsocket proxy: https://github.com/micrometer-metrics/prometheus-rsocket-proxy/issues/12
Similar issue on missing external execution id on task metrics: https://github.com/spring-cloud/spring-cloud-dataflow/issues/4437