m3db / m3

M3 monorepo - Distributed TSDB, Aggregator and Query Engine, Prometheus Sidecar, Graphite Compatible, Metrics Platform
https://m3db.io/
Apache License 2.0
4.7k stars 451 forks source link

Remote Write Failure due to conflicts in Timeseries protobuf def #4255

Closed christopherzli closed 8 months ago

christopherzli commented 8 months ago

Filing M3 Issues

General Issues

General issues are any non-performance related issues (data integrity, ease of use, error messages, configuration, documentation, etc).

Please provide the following information along with a description of the issue that you're experiencing:

  1. What service is experiencing the issue? (M3Coordinator, M3DB, M3Aggregator, etc) M3 Coordinator
  2. What is the configuration of the service? Please include any YAML files, as well as namespace / placement configuration (with any sensitive information anonymized if necessary). The Proto is from here https://github.com/m3db/m3/blob/master/src/query/generated/proto/prompb/types.proto#L17
  3. How are you using the service? For example, are you performing read/writes to the service via Prometheus, or are you using a custom script? We use grafana agent (prometheus compatible) to send metrics to remote write to M3 coordinator.
  4. Is there a reliable way to reproduce the behavior? If so, please provide detailed instructions. Yes, send any exemplar metric/histogram with Prometheus compatible agent to m3 write coordinator. Offending PR: https://github.com/m3db/m3/pull/2628/files#diff-ba5cbda5df32bcfab5e7acaf8f7a5324b9647684ea4f4b19574a4170c37fb93f we found inconsistencies between m3db Timeseries and Prometheus upstream protobuf on field MetricType type in m3 vs Exemplar exmplar in prometheus, causing our prometheus-compatible metric scraper to write metrics request failed to m3db due to this difference.

error message for our metric scraper:

caller=dedupe.go:112 agent=prometheus instance=xxxx component=remote level=error remote_name=xxx url=http://localhost:9900/api/v1/prom/remote/write msg="non-recoverable error" count=1992 exemplarCount=8 err="server returned HTTP status 400 Bad Request: {\"status\":\"error\",\"error\":\"proto: wrong wireType = 2 for field Type\"}

We should add support for exemplar/histogram. At the very least we should not have a conflict with prometheus protobuf def. We can change the number of timeseries Type (i.e. moving it to 103).

Performance issues

If the issue is performance related, please provide the following information along with a description of the issue that you're experiencing:

  1. What service is experiencing the performance issue? (M3Coordinator, M3DB, M3Aggregator, etc)
  2. Approximately how many datapoints per second is the service handling?
  3. What is the approximate series cardinality that the series is handling in a given time window? I.E How many unique time series are being measured?
  4. What is the hardware configuration (number CPU cores, amount of RAM, disk size and types, etc) that the service is running on? Is the service the only process running on the host or is it colocated with other software?
  5. What is the configuration of the service? Please include any YAML files, as well as namespace / placement configuration (with any sensitive information anonymized if necessary).
  6. How are you using the service? For example, are you performing read/writes to the service via Prometheus, or are you using a custom script?

In addition to the above information, CPU and heap profiles are always greatly appreciated.

CPU / Heap Profiles

CPU and heap profiles are critical to helping us debug performance issues. All our services run with the net/http/pprof server enabled by default.

Instructions for obtaining CPU / heap profiles for various services are below, please attach these profiles to the issue whenever possible.

M3Coordinator

CPU curl <HOST_NAME>:<PORT(default 7201)>/debug/pprof/profile?seconds=5 > m3coord_cpu.out

Heap curl <HOST_NAME>:<PORT(default 7201)>/debug/pprof/heap > m3coord_heap.out

M3DB

CPU curl <HOST_NAME>:<PORT(default 9004)>/debug/pprof/profile?seconds=5 > m3db_cpu.out

Heap curl <HOST_NAME>:<PORT(default 9004)>/debug/pprof/heap -> m3db_heap.out

M3DB Grafana Dashboard Screenshots

If the service experiencing performance issues is M3DB and you're monitoring it using Prometheus, any screenshots you could provide using this dashboard would be helpful.