spring-cloud / spring-cloud-dataflow

A microservices-based Streaming and Batch data processing in Cloud Foundry and Kubernetes
https://dataflow.spring.io
Apache License 2.0
1.09k stars 579 forks source link

Allow to identify source of data flow metrics #4472

Open guoyiang opened 3 years ago

guoyiang commented 3 years ago

Problem description: Environment: spring cloud data flow server is running in kubernetes with prometheus metrics. Prometheus is deployed with prometheus operator.

In the metrics collected by prometheus, there's no way to really tell which pods were the source of the metrics, especially when running data flow server with multiple replicas.

Here's some examples (one for task and one for server):

jvm_threads_live_threads{application="spring-cloud-dataflow-server",application_version="2.7.1",cluster="test",container="prometheus-proxy",endpoint="http",exported_service="scdf server",instance="10.0.0.251:8080",job="spring-cloud-dataflow-prometheus-proxy",namespace="qa",pod="spring-cloud-dataflow-prometheus-proxy-8659b5f6ff-kvwx2",prometheus="monitoring/prometheus",service="spring-cloud-dataflow-prometheus-proxy"}

jvm_threads_live_threads{application="tasks-730",cluster="test",container="prometheus-proxy",endpoint="http",exported_service="task-application",instance="10.0.0.251:8080",job="spring-cloud-dataflow-prometheus-proxy",namespace="qa",pod="spring-cloud-dataflow-prometheus-proxy-8659b5f6ff-kvwx2",prometheus="monitoring/prometheus",service="spring-cloud-dataflow-prometheus-proxy",task_execution_id="730",task_external_execution_id="unknown",task_name="tasks",task_parent_execution_id="unknown"}

You will notice there're no real way to identify the real source of the metrics. instance and pod are added by prometheus when scraping the metrics, but because data flow use rsocket proxy to collect metrics, the values are set as the value of proxy pods, instead of the source (task pods and data flow server pods respectively). There's a indirect link to the pod for tasks by task execution id. But for data flow server, it's impossible when we have multiple replicas.

Though this description is specific to prometheus, I would assume it's a similar case for other type of metric stores.

Solution description: Identity of sources should be included in metrics, and they can be added by either roscket proxy or client.

If rsocket proxy is the one to do it, it should be able to add a tag (or label) of connected client. Or, if client is the one to do it, it can push some data about itself, like its own ip, hostname, pod name, etc.

Description of alternatives: As a workaround, an additional tag can be added on client (data flow server and tasks):

management:
  metrics:
    tags:
      application.host: ${HOSTNAME:unknown}

This way the client will add a tag application_host with value as its hostname, which is the pod name.

Additional context: A similar issue reported on rsocket proxy: https://github.com/micrometer-metrics/prometheus-rsocket-proxy/issues/12

Similar issue on missing external execution id on task metrics: https://github.com/spring-cloud/spring-cloud-dataflow/issues/4437

sabbyanandan commented 3 years ago

@tzolov: Could you share how we correlate the app/pod-specific metrics to reconstitute the stream definition, and likewise, the stats at the level of apps and streams? Perhaps we could document that even so we can the details to answer the questions posted by @guoyiang here.

sabbyanandan commented 3 years ago

There is some new development in the works to associate source and target systems about the event and interactivity overall. That's on top of the message tracing support we shipped in 2.8.x; see: https://dataflow.spring.io/docs/feature-guides/streams/tracing/