raffis / mongodb-query-exporter

Prometheus MongoDB aggregation query exporter
MIT License
96 stars 27 forks source link

Сollected metric was collected before with the same name and label values #272

Open Madjew opened 5 months ago

Madjew commented 5 months ago

Describe the bug

An error has occurred while serving metrics: collected metric "count_docs_by_type_and_region_last_hour_global" { label:<name:"_idForm" value:"107-1/\321\203" > label:<name:"_idRegion" value:"10" > label:<name:"server" value:"main" > gauge: } was collected before with the same name and label values

To Reproduce

configuration

version: 3.0
bind: 0.0.0.0:9412
metricsPath: /metrics
log:
  encoding: json
  level: debug
  development: false
  disableCaller: false
global:
  queryTimeout: "120s"
  maxConnection: 3
  defaultCache: 0
servers:
- name: main
  uri: mongodb://url
aggregations:
- database: db
  collection: coll
  metrics:
  - name: count_docs_by_type_and_region_last_hour_global
    type: gauge
    help: ''
    value: count
    labels: [_idRegion,_idForm]
    constLabels: {} 
  mode: pull
  cache: "35m"
  pipeline: |
    [
      {
        "$group": {
          "_id": {
            "region": "$clinic.address.regionCode",
            "form": "$form.type"
          },
          "count": {
            "$sum": 1
          }
        }
      },
      {
        "$project": {
          "_idRegion": "$_id.region",
          "_idForm": "$_id.form",
          "count": 1,
          "_id": 0
        }
      }
    ]

Environment

Additional context

We use this pipeline to find documents that were created an hour ago by adding the following code:

      {
        "$addFields": {
          "thirtyDaysAgo": {
            "$add": [
              "$$NOW",
              -3600000
            ]
          }
        }
      },
      {
        "$match": {
          "$expr": {
            "$gte": [
              "$creationDate",
              "$thirtyDaysAgo"
            ]
          }
        }
      },

But when using the same query for all documents for all time, we get an error: An error has occurred while serving metrics: collected metric "count_docs_by_type_and_region_last_hour_global" { label:<name:"_idForm" value:"107-1/\321\203" > label:<name:"_idRegion" value:"10" > label:<name:"server" value:"main" > gauge: } was collected before with the same name and label values

Logs

"level":"info","ts":1717515121.2494338,"caller":"v3/config.go:103","msg":"will listen on 0.0.0.0:9412"}
{"level":"debug","ts":1717515121.2498593,"caller":"collector/collector.go:171","msg":"register metric count_docs_by_type_and_region_last_hour_global"}
{"level":"debug","ts":1717515171.8772416,"caller":"collector/collector.go:225","msg":"start collecting metrics"}
{"level":"debug","ts":1717515171.8773785,"caller":"collector/collector.go:367","msg":"run aggregation [\n  {\n    \"$group\": {\n      \"_id\": {\n        \"region\": \"$clinic.address.regionCode\",\n        \"form\": \"$form.type\"\n      },\n      \"count\": {\n        \"$sum\": 1\n      }\n    }\n  },\n  {\n    \"$project\": {\n      \"_idRegion\": \"$_id.region\",\n      \"_idForm\": \"$_id.form\",\n      \"count\": 1,\n      \"_id\": 0\n    }\n  }\n]\n on server main"}
{"level":"debug","ts":1717515179.8331606,"caller":"collector/collector.go:225","msg":"start collecting metrics"}
{"level":"debug","ts":1717515179.83426,"caller":"collector/collector.go:367","msg":"run aggregation [\n  {\n    \"$group\": {\n      \"_id\": {\n        \"region\": \"$clinic.address.regionCode\",\n        \"form\": \"$form.type\"\n      },\n      \"count\": {\n        \"$sum\": 1\n      }\n    }\n  },\n  {\n    \"$project\": {\n      \"_idRegion\": \"$_id.region\",\n      \"_idForm\": \"$_id.form\",\n      \"count\": 1,\n      \"_id\": 0\n    }\n  }\n]\n on server main"}
{"level":"debug","ts":1717515229.8149161,"caller":"collector/collector.go:386","msg":"found record map[_idForm:107-1/у _idRegion:27 count:%!s(int32=305)] from aggregation [\n  {\n    \"$group\": {\n      \"_id\": {\n        \"region\": \"$clinic.address.regionCode\",\n        \"form\": \"$form.type\"\n      },\n      \"count\": {\n        \"$sum\": 1\n      }\n    }\n  },\n  {\n    \"$project\": {\n      \"_idRegion\": \"$_id.region\",\n      \"_idForm\": \"$_id.form\",\n      \"count\": 1,\n      \"_id\": 0\n    }\n  }\n]\n"}
{"level":"debug","ts":1717515229.8149953,"caller":"collector/collector.go:386","msg":"found record map[_idForm:148-1/у-04(л) _idRegion:30 count:%!s(int32=187458)] from aggregation [\n  {\n    \"$group\": {\n      \"_id\": {\n        \"region\": \"$clinic.address.regionCode\",\n        \"form\": \"$form.type\"\n      },\n      \"count\": {\n        \"$sum\": 1\n      }\n    }\n  },\n  {\n    \"$project\": {\n      \"_idRegion\": \"$_id.region\",\n      \"_idForm\": \"$_id.form\",\n      \"count\": 1,\n      \"_id\": 0\n    }\n  }\n]\n"}
{"level":"debug","ts":1717515229.8150616,"caller":"collector/collector.go:386","msg":"found record map[_idForm:148-1/у-88 _idRegion:54 count:%!s(int32=1)] from aggregation [\n  {\n    \"$group\": {\n      \"_id\": {\n        \"region\": \"$clinic.address.regionCode\",\n        \"form\": \"$form.type\"\n      },\n      \"count\": {\n        \"$sum\": 1\n      }\n    }\n  },\n  {\n    \"$project\": {\n      \"_idRegion\": \"$_id.region\",\n      \"_idForm\": \"$_id.form\",\n      \"count\": 1,\n      \"_id\": 0\n    }\n  }\n]\n"}
{"level":"debug","ts":1717515229.8150969,"caller":"collector/collector.go:386","msg":"found record map[_idForm:107-1/у _idRegion:64 count:%!s(int32=183)] from aggregation [\n  {\n    \"$group\": {\n      \"_id\": {\n        \"region\": \"$clinic.address.regionCode\",\n        \"form\": \"$form.type\"\n      },\n      \"count\": {\n        \"$sum\": 1\n      }\n    }\n  },\n  {\n    \"$project\": {\n      \"_idRegion\": \"$_id.region\",\n      \"_idForm\": \"$_id.form\",\n      \"count\": 1,\n      \"_id\": 0\n    }\n  }\n]\n"}
Loki-Afro commented 1 week ago

i encounter that issue right now too

in my case it complains about the very first aggregation, even though i added one at the end and was just checking the logs

pipeline in question:

      [
        {"$count":"count"}
      ]

recently on that aggregation there has been added a new constLabel and cache: 5m might be the cause

update

this is the fixed version, applying cache: 5m does cause the collector to stutter and only report every 5m, not only this but all other metrics

aggregations: 
  - database: scapp
    collection: schools
    metrics:
    - name: sc_schools
      type: gauge
      help: 'The number of schools'
      value: count
      overrideEmpty: true
      emptyValue: 0
      labels: []
      constLabels:
        mongodb_collection: schools
        longterm: 'true'
    mode: pull
    pipeline: |
      [
        {"$count":"count"}
      ]

looking at the service-monitor, the interval is set to 60s and with a scrapeTimeout of 10s if that matters for other metrics which have a cache like 1h this does not happen

hope that gives some insight