open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.07k stars 2.37k forks source link

Prometheus receiver miss some metrics #34727

Open peachisai opened 2 months ago

peachisai commented 2 months ago

Component(s)

cmd/otelcontribcol

What happened?

Description

When I use prometheus receiver to grab metrics, I found otel miss someone, but it could grab other mertics which have the similar structure.

Steps to Reproduce

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: "nacos-monitoring"
          scrape_interval: 30s
          metrics_path: "/nacos/actuator/prometheus"
          static_configs:
            - targets: ['127.0.0.1:8848']
          relabel_configs:
            - source_labels: [ ]
              target_label: cluster
              replacement: nacos-cluster
            - source_labels: [ __address__ ]
              regex: (.+)
              target_label: node
              replacement: $$1

processors:
  batch:

exporters:
  debug:
    verbosity: detailed

service:
  pipelines:
    metrics:
      receivers:
        - prometheus
      processors:
        - batch
      exporters:
        - debug

Expected Result

orginal data

nacos_monitor{module="naming",name="serviceCount",} 0.0
nacos_monitor{module="naming",name="ipCount",} 0.0

Actual Result

Only get ipCount

NumberDataPoints #7
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> module: Str(naming)
     -> name: Str(ipCount)
     -> node: Str(43.139.166.178:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-19 02:48:13.416 +0000 UTC
Value: 0.000000

Collector version

v0.107.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04") Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

No response

github-actions[bot] commented 2 months ago

Pinging code owners for receiver/prometheus: @Aneurysm9 @dashpole. See Adding Labels via Comments if you do not have permissions to add labels yourself.

dashpole commented 2 months ago

Do you see anything in the logs?

Can you enable debug logging, and let us know if there are any scrape failures, etc?

can you share the full scrape response for that metric?

Can you look at the up, and scrape_* metrics to see if any targets are failing to be scraped, or any metrics are being dropped by the receiver?

peachisai commented 2 months ago

Do you see anything in the logs?

Can you enable debug logging, and let us know if there are any scrape failures, etc?

can you share the full scrape response for that metric?

Can you look at the up, and scrape_* metrics to see if any targets are failing to be scraped, or any metrics are being dropped by the receiver?

Hi, Thank you for the reply. I use

exporters:
  debug:
    verbosity: detailed

These are some parts of my log. I didn't find some errors or failures, and I can't found the missed target names.

StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
NumberDataPoints #2
Data point attributes:
     -> action: Str(end of minor GC)
     -> cause: Str(Allocation Failure)
     -> cluster: Str(nacos-cluster)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
Metric #37
Descriptor:
     -> Name: executor_pool_max_threads
     -> Description: The maximum allowed number of threads in the pool
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> name: Str(applicationTaskExecutor)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 2147483647.000000
NumberDataPoints #1
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> name: Str(taskScheduler)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 2147483647.000000
Metric #38
Descriptor:
     -> Name: nacos_naming_subscriber
     -> Description:
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> node: Str(127.0.0.1:8848)
     -> version: Str(v1)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
NumberDataPoints #1
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> node: Str(127.0.0.1:8848)
     -> version: Str(v2)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
Metric #39
Descriptor:
     -> Name: jvm_classes_loaded_classes
     -> Description: The number of classes that are currently loaded in the Java virtual machine
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 14983.000000
Metric #40
Descriptor:
     -> Name: tomcat_sessions_created_sessions_total
     -> Description:
     -> Unit:
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 2024-08-20 06:42:13.391 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
Metric #41
Descriptor:
     -> Name: tomcat_sessions_alive_max_seconds
     -> Description:
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
Metric #42
Descriptor:
     -> Name: nacos_naming_publisher
     -> Description:
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> node: Str(127.0.0.1:8848)
     -> version: Str(v1)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
NumberDataPoints #1
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> node: Str(127.0.0.1:8848)
     -> version: Str(v2)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
Metric #43
Descriptor:
     -> Name: jvm_gc_memory_allocated_bytes_total
     -> Description: Incremented for an increase in the size of the (young) heap memory pool after one GC to before the next
     -> Unit:
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 2024-08-20 06:42:13.391 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 31471073024.000000
Metric #44
Descriptor:
     -> Name: executor_completed_tasks_total
     -> Description: The approximate total number of tasks that have completed execution
     -> Unit:
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> name: Str(applicationTaskExecutor)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 2024-08-20 06:42:13.391 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
NumberDataPoints #1
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> name: Str(taskScheduler)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 2024-08-20 06:42:13.391 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 181528.000000
Metric #45
Descriptor:
     -> Name: nacos_timer_seconds
     -> Description:
     -> Unit:
     -> DataType: Summary
SummaryDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> module: Str(config)
     -> name: Str(writeConfigRpcRt)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 2024-08-20 06:42:13.391 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Count: 2
Sum: 0.114000
Metric #46
Descriptor:
     -> Name: jdbc_connections_min
     -> Description: Minimum number of idle connections in the pool.
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> name: Str(dataSource)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: -1.000000
Metric #47
Descriptor:
     -> Name: http_server_requests_seconds_max
     -> Description:
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> exception: Str(None)
     -> method: Str(GET)
     -> node: Str(127.0.0.1:8848)
     -> outcome: Str(SUCCESS)
     -> status: Str(200)
     -> uri: Str(/v2/core/cluster/node/list)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
NumberDataPoints #1
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> exception: Str(None)
     -> method: Str(GET)
     -> node: Str(127.0.0.1:8848)
     -> outcome: Str(SUCCESS)
     -> status: Str(200)
     -> uri: Str(/actuator/prometheus)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.003789
NumberDataPoints #2
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> exception: Str(None)
     -> method: Str(GET)
     -> node: Str(127.0.0.1:8848)
     -> outcome: Str(SUCCESS)
     -> status: Str(200)
     -> uri: Str(/v1/console/namespaces)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
NumberDataPoints #3
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> exception: Str(None)
     -> method: Str(GET)
     -> node: Str(127.0.0.1:8848)
     -> outcome: Str(SERVER_ERROR)
     -> status: Str(501)
     -> uri: Str(root)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
Metric #48
Descriptor:
     -> Name: jdbc_connections_max
     -> Description: Maximum number of active connections that can be allocated at the same time.
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> name: Str(dataSource)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: -1.000000
Metric #49
Descriptor:
     -> Name: executor_queued_tasks
     -> Description: The approximate number of tasks that are queued for execution
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
peachisai commented 2 months ago

@dashpole Hi, I found this issue was assigned. If any detail should I provide, please ping me.

dashpole commented 2 months ago

Were you able to check this?

Can you look at the up, and scrape_* metrics to see if any targets are failing to be scraped, or any metrics are being dropped by the receiver?

peachisai commented 2 months ago

Were you able to check this?

Can you look at the up, and scrape_* metrics to see if any targets are failing to be scraped, or any metrics are being dropped by the receiver?

Hi, I did not find some error. did you mean config the receivers to get the scrape log? sorry I don't know how to do it, could you give me some advice? This is my receiver config

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: "nacos-monitoring"
          scrape_interval: 30s
          metrics_path: "/nacos/actuator/prometheus"
          static_configs:
            - targets: ['127.0.0.1:8848']
          relabel_configs:
            - source_labels: [ ]
              target_label: cluster
              replacement: nacos-cluster
            - source_labels: [ __address__ ]
              regex: (.+)
              target_label: node
              replacement: $$1
dashpole commented 2 months ago

You should get additional metrics with names "up", and "scrape_seriesadded", and a few other scrape. metrics. The scrape. metrics let you know if any metrics were dropped or rejected by Prometheus

peachisai commented 2 months ago

failing

Hi,I filter the metrics up and scrape_*, still found nothing

Descriptor:
     -> Name: up
     -> Description: The scraping was successful
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0

Metric #4
Descriptor:
     -> Name: scrape_series_added
     -> Description: The approximate number of new series in this scrape
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0

Descriptor:
     -> Name: scrape_samples_post_metric_relabeling
     -> Description: The number of samples remaining after metric relabeling was applied
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0

Metric #1
Descriptor:
     -> Name: scrape_duration_seconds
     -> Description: Duration of the scrape
     -> Unit: s
     -> DataType: Gauge
NumberDataPoints #0
dashpole commented 2 months ago

Right, you will need to look at the values of those metrics to see if any are being dropped, or if the target is down. Otherwise, if you can provide the full output of the prometheus endpoint (e.g. using curl), we can try to reproduce.

peachisai commented 2 months ago

Right, you will need to look at the values of those metrics to see if any are being dropped, or if the target is down. Otherwise, if you can provide the full output of the prometheus endpoint (e.g. using curl), we can try to reproduce.

I browsed the log detailly but still found nothing contains error or drop. May I send you an email with my remote peer endpoint ?

dashpole commented 2 months ago

I browsed the log detailly but still anything contains error or drop. May I send you an email with my remote peer endpoint ?

No, sorry. Please don't email me links. I also don't actually need your logs--I need the metrics scrape response.

peachisai commented 2 months ago

I browsed the log detailly but still anything contains error or drop. May I send you an email with my remote peer endpoint ?

No, sorry. Please don't email me links. I also don't actually need your logs--I need the metrics scrape response.

Hi, I found nothing drop or error in the metrics scrape response. But it overlooked some certain segments

nacos_monitor{module="naming",name="mysqlHealthCheck",} 0.0
nacos_monitor{module="naming",name="emptyPush",} 0.0
nacos_monitor{module="config",name="configCount",} 2.0
nacos_monitor_count{module="core",name="raft_read_from_leader",} 0.0
nacos_monitor_sum{module="core",name="raft_read_from_leader",} 0.0
nacos_monitor{module="naming",name="tcpHealthCheck",} 0.0
nacos_monitor{module="naming",name="serviceChangedEventQueueSize",} 0.0
nacos_monitor{module="core",name="longConnection",} 0.0
nacos_monitor{module="naming",name="totalPush",} 0.0
nacos_monitor{module="naming",name="serviceSubscribedEventQueueSize",} 0.0
nacos_monitor{module="naming",name="serviceCount",} 0.0
nacos_monitor{module="naming",name="httpHealthCheck",} 0.0
nacos_monitor{module="naming",name="maxPushCost",} -1.0
nacos_monitor{module="config",name="longPolling",} 0.0
nacos_monitor{module="naming",name="failedPush",} 0.0
nacos_monitor{module="naming",name="leaderStatus",} 0.0
nacos_monitor{module="config",name="publish",} 0.0
nacos_monitor{module="config",name="dumpTask",} 0.0
nacos_monitor_count{module="core",name="raft_read_index_failed",} 0.0
nacos_monitor_sum{module="core",name="raft_read_index_failed",} 0.0
nacos_monitor{module="config",name="notifyTask",} 0.0
nacos_monitor{module="config",name="fuzzySearch",} 0.0
nacos_monitor{module="naming",name="avgPushCost",} -1.0
nacos_monitor{module="config",name="getConfig",} 0.0
nacos_monitor{module="naming",name="totalPushCountForAvg",} 0.0
nacos_monitor{module="naming",name="subscriberCount",} 0.0
nacos_monitor{module="naming",name="ipCount",} 0.0
nacos_monitor{module="config",name="notifyClientTask",} 0.0
nacos_monitor{module="naming",name="totalPushCostForAvg",} 0.0
nacos_monitor{module="naming",name="pushPendingTaskCount",} 0.0
# HELP nacos_monitor_max  

Above nacos_monitor_sum{module="core",name="raft_read_index_failed",} 0.0 cannot be scraped The rest metrics below it can be scraped

There are the scrape log

Descriptor:
     -> Name: disk_total_bytes
     -> Description: Total space for path
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> node: Str(127.0.0.1:8848)
     -> path: Str(D:\ideaprojects\github\nacos\.)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-31 14:08:35.452 +0000 UTC
Value: 296022437888.000000
Metric #69
Descriptor:
     -> Name: nacos_monitor
     -> Description:
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> module: Str(core)
     -> name: Str(raft_read_index_failed)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-31 14:08:35.452 +0000 UTC
Value: 0.000000
NumberDataPoints #1
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> module: Str(config)
     -> name: Str(notifyTask)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-31 14:08:35.452 +0000 UTC
Value: 0.000000
NumberDataPoints #2
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> module: Str(config)
     -> name: Str(fuzzySearch)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-31 14:08:35.452 +0000 UTC
Value: 0.000000
NumberDataPoints #3
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> module: Str(naming)
     -> name: Str(avgPushCost)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-31 14:08:35.452 +0000 UTC
Value: -1.000000
NumberDataPoints #4
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> module: Str(config)
     -> name: Str(getConfig)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-31 14:08:35.452 +0000 UTC
Value: 0.000000
NumberDataPoints #5
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> module: Str(naming)
     -> name: Str(totalPushCountForAvg)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-31 14:08:35.452 +0000 UTC
Value: 0.000000
NumberDataPoints #6
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> module: Str(naming)
     -> name: Str(subscriberCount)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-31 14:08:35.452 +0000 UTC
Value: 0.000000
NumberDataPoints #7
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> module: Str(naming)
     -> name: Str(ipCount)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-31 14:08:35.452 +0000 UTC
Value: 0.000000
peachisai commented 2 months ago

I will have a try to debug the code

github-actions[bot] commented 3 days ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.