m3db / m3

M3 monorepo - Distributed TSDB, Aggregator and Query Engine, Prometheus Sidecar, Graphite Compatible, Metrics Platform
https://m3db.io/
Apache License 2.0
4.75k stars 453 forks source link

[BUG]Wrong result(Sudden peak ) when use irate/rate/increase to smooth original points #3894

Open naughtyGitCat opened 2 years ago

naughtyGitCat commented 2 years ago

when querying Counter points which are stored in m3db, the irate/rate/increase show a different trend

image

image

by default or expected result, the rate graph should be smooth

I have tried query direct with m3query, m3coordinator, and prometheus->remote-read->m3coordinator

both show the same wrong result.

naughtyGitCat commented 2 years ago

below is an irate result graph vs original points

image

image

naughtyGitCat commented 2 years ago

Version

m3dbnode --help
2021/11/02 17:27:23 Go Runtime version: go1.13.8
2021/11/02 17:27:23 Build Version:      v0.15.9
2021/11/02 17:27:23 Build Revision:     1b7e6a758
2021/11/02 17:27:23 Build Branch:       HEAD
2021/11/02 17:27:23 Build Date:         2020-08-12-15:32:29
2021/11/02 17:27:23 Build TimeUnix:     1597246349
./m3coordinator --help
2021/11/02 17:29:06 Go Runtime version: go1.13.8
2021/11/02 17:29:06 Build Version:      v1.0.0
2021/11/02 17:29:06 Build Revision:     a3853ee56
2021/11/02 17:29:06 Build Branch:       HEAD
2021/11/02 17:29:06 Build Date:         2020-11-19-10:00:56
2021/11/02 17:29:06 Build TimeUnix:     1605780056
prometheus ver 2.15.2

./m3query --help
2021/11/02 17:31:13 Go Runtime version: go1.16.5
2021/11/02 17:31:13 Build Version:      v1.3.0
2021/11/02 17:31:13 Build Revision:     4cd1b14a4
2021/11/02 17:31:13 Build Branch:       HEAD
2021/11/02 17:31:13 Build Date:         2021-10-10-12:36:17
2021/11/02 17:31:13 Build TimeUnix:     1633869377

Prometheus config


global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  scrape_timeout:      10s
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 127.0.0.1:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - rules/*.rules
  # - "first_rules.yml"
  # - "second_rules.yml"

remote_read:
  # m3db coodinator
  - url: "http://localhost:7201/api/v1/prom/remote/read"
    read_recent: true
    remote_timeout: 1m

coordinator config

listenAddress: "0.0.0.0:7201"

tagOptions:
# Configuration setting for generating metric IDs from tags.
  idScheme: quoted

metrics:
  scope:
    prefix: "coordinator"
  prometheus:
    handlerPath: /metrics
    listenAddress: ZZZZZ:7203 # until https://github.com/m3db/m3/issues/682 is resolved
  sanitization: prometheus
  samplingRate: 1.0
  extended: none

clusters:
   - namespaces:
# We created a namespace called "default" and had set it to retention "48h".
       - namespace: unaggregated
         retention: 168h
         type: unaggregated
       - namespace: aggregated
         retention: 26888h
         type: aggregated
         resolution: 1m
     client:
       config:
         service:
           env: default_env
           zone: embedded
           service: m3db
           cacheDir: /data1/m3cache
           etcdClusters:
             - zone: embedded
               endpoints:
# We have five M3DB nodes but only three are seed nodes, they are listed here.
                 - 127.0.0.1:2379
                 - XXXXXXX:2379
                 - YYYYYYY:2379
       writeConsistencyLevel: majority
       # readConsistencyLevel: unstrict_majority
       readConsistencyLevel: majority
       writeTimeout: 10s
       fetchTimeout: 15s
       connectTimeout: 20s
       writeRetry:
         initialBackoff: 500ms
         backoffFactor: 3
         maxRetries: 2
         jitter: true
       fetchRetry:
         initialBackoff: 500ms
         backoffFactor: 2
         maxRetries: 3
         jitter: true
       backgroundHealthCheckFailLimit: 4
       backgroundHealthCheckFailThrottleFactor: 0.5
naughtyGitCat commented 2 years ago

Expected result

this is another sidecar cluster scrape data with Prometheus and write directly to Prometheus, given right result

image

naughtyGitCat commented 2 years ago

@schallert

wesleyk commented 2 years ago

@naughtyGitCat is the raw prom metric data equivalent to the raw m3db data? Could you post a dump of datapoints from both raw timeseries (prom vs m3db)? We can try to re-pro in a test case.

naughtyGitCat commented 2 years ago

m3-prom-points.zip

here are raw points, have something wrong when irate(xxx[2m]) to them

pahla1 commented 2 years ago

Same problem. both raw result from prom and m3db are the same but irate function on m3db returns all values zero.

any update?