go-graphite / carbonapi

Implementation of graphite API (graphite-web) in golang
Other
308 stars 140 forks source link

[BUG] Duplicated series with VictoriaMetrics backend #646

Open keliss opened 3 years ago

keliss commented 3 years ago

Describe the bug Hi. Any metric queried from carbonapi results in multiple absolutely identical series with the same name. The number of duplicates seems to depend on whether there were any data merges in VictoriaMetrics (e.g. the number of duplicates can be 5, and after forcing a merge it reduces to 2).

CarbonAPI Version carbonapi-0.15.4~1-1

Logs Literally any metric and request should reproduce the issue, but let me know if request-response logs are still needed.

CarbonAPI Configuration:

listen: "0.0.0.0:8182"
cache:
  type: "mem"
  size_mb: 0
  defaultTimeoutSec: 60
cpus: 0
tz: ""
graphite:
  host: "localhost:2003"
  interval: "60s"
  prefix: "carbon.api"
  pattern: "{prefix}.{fqdn}"
pidFile: ""
logger:
  - logger: ""
    file: "/var/log/carbonapi/carbonapi-1.log"
    level: "warn"
    encoding: "console"
upstreams:
  buckets: 10
  slowLogThreshold: 7s
  timeouts:
    find: "3s"
    render: "15s"
    connect: "200ms"
  keepAliveInterval: "30s"
  maxIdleConnsPerHost: 1000
  backendsv2:
    backends:
      - groupName: "victoriametrics"
        protocol: "prometheus" # also tried the victoriametrics protocol
        lbMethod: "rr"
        maxTries: 2
        backendOptions:
          step: "1m"
        servers:
          - "http://localhost:8481/select/0/prometheus"

VictoriaMetrics Configuration: Cluster version, 1.64.1. Metrics are ingested like this: carbon-c-relay -> vmagent -> vminsert -> vmstorage, and read like this: grafana -> carbonapi -> vmselect -> vmstorage.

VMSTORAGE_ARGS="-retentionPeriod=3y -storageDataPath=/opt/vmstorage/data/ -dedup.minScrapeInterval=60s -finalMergeDelay=10080m"
VMINSERT_ARGS="-storageNode=metrics-storage-1:8400"
VMSELECT_ARGS="-storageNode=metrics-storage-1:8401 -search.disableCache -search.denyPartialResponse -search.noStaleMarkers -search.maxUniqueTimeseries=3000000"
VMAGENT_ARGS="-remoteWrite.url=http://localhost:8480/insert/0/prometheus/api/v1/write -promscrape.config=/etc/victoriametrics/prometheus.yml -promscrape.noStaleMarkers -graphiteListenAddr=:2004 -graphiteTrimTimestamp=1m -remoteWrite.maxDiskUsagePerURL=20GB -remoteWrite.showURL -remoteWrite.tmpDataPath=/opt/vmagent/data/"
Civil commented 3 years ago

Can you describe a way how can I replicate that behavior on an artificial setup (e.x. imagine that I've just installed VM and carbonapi, what should I feed it to trigger that behavior)?

As my test setup for VM + carbonapi looks like that:

write path: telegraf -> (influxdb writer) -> VM

read path: carbonapi -> VM

With that setup I've never seen such behavior. I'll try to reproduce it though again (not sure how I can delay data merges, it might be dependent on what VM returns). But I guess in my setup I do not usually see unmerged data (I do not have that much of a load, I'll try to throttle disk i/o for vm though to replicate that behavior).

It might be helpful if you can provide what VM returns on typical request from carbonapi (any metric that would trigger that behavior would work). You can enable debug logs in carbonapi and it should eventually print exact query that it do towards VictoriaMetrics and then you can query some data and replace actual metric names with something that doesn't contain any sensitive information (as well as for example replacing all data with 1.0 instead of whatever value there was)

Civil commented 3 years ago

Also are you sure that it happens for both protocol: prometehus and protocol: victoriametrics? Prometheus is a fallback one in case it was impossible to identify feature set (btw, that might be specified in the logs what actual version carbonapi think it is) and actually VM protocol is far more tested than prometheus one (and it uses VM-specific endpoints for almost everything at this moment).

But version detection relies on you to use official packages or to at least that VM returns version correctly on /metrics endpoint, vm_app_version and short_version fields.

keliss commented 3 years ago

Unfortunately, I had to terminate my setup because the debug mode quickly filled up the root disk, so I can't pull the logs anymore, sorry. I'm sure that it is reproducible both on the prometheus and the vm protocols. Maybe my config is incorrect or you're running a different version of carbonapi/VM? You should be able to reproduce it without introducing any artificial delay for merges, just with the configs and versions I listed above. Metric name is also arbitrary as I'm seeing this for all the metrics where any variation can be found (like when some intermediary node is hostname: my.metric.base.name.10_0_5_1.value and my.metric.base.name.10_0_5_2.value).

keliss commented 3 years ago

In the end, I decided to give it another try with a single-node setup. And the weirdest thing with this bug is that everything works fine on the single-node version. Perhaps, something's wrong with how carbonapi interacts with vmselect?

keliss commented 3 years ago

And now I was able to reproduce it on the single-node version too. It appears not so often, but still. Maybe flags -selfScrapeInterval=60s and -search.disableCache play some role here because I was able to see it only when I restarted the node with these parameters. Protocol used - victoriametrics.

Civil commented 3 years ago

Thx. I'll try to run VM with that flags and see how it will go. I likely will try that next week or so.

I also had plans to try implement Victoria Metric's Graphite /render API (or at least remember why I haven't done that before). Maybe there it will work better (as well).

keliss commented 3 years ago

FYI: unfortunately, the same problem appears even with Clickhouse as a backend.

Civil commented 3 years ago

And what would be the replies from the backend in that case? As carbonapi by design do not do any deduplication of metrics that backend returned.

I mean can you provide example query, reply from the backend, reply from a carbonapi.

keliss commented 3 years ago

Yes, here they are:

carbonapi:
curl localhost:8182/render?target=node1.node2.node3.node4.node5.node6.node7.node8.%2A.node10.node11.node12.node13\&from=1629999830\&until=1630000764\&format=json\&maxDataPoints=25

[
   {
      "target":"node1.node2.node3.node4.node5.node6.node7.node8.node9.node10.node11.node12.node13",
      "datapoints":[
         [
            6043941,
            1629999840
         ],
         [
            6044388,
            1629999900
         ],
         [
            6044934,
            1629999960
         ],
         [
            6045494,
            1630000020
         ],
         [
            6045995,
            1630000080
         ],
         [
            6046596,
            1630000140
         ],
         [
            6047207,
            1630000200
         ],
         [
            6047848,
            1630000260
         ],
         [
            6048477,
            1630000320
         ],
         [
            6049073,
            1630000380
         ],
         [
            6049736,
            1630000440
         ],
         [
            6050316,
            1630000500
         ],
         [
            6050915,
            1630000560
         ],
         [
            6051398,
            1630000620
         ],
         [
            6051685,
            1630000680
         ],
         [
            6052129,
            1630000740
         ]
      ],
      "tags":{
         "name":"node1.node2.node3.node4.node5.node6.node7.node8.node9.node10.node11.node12.node13"
      }
   },
   {
      "target":"node1.node2.node3.node4.node5.node6.node7.node8.node9.node10.node11.node12.node13",
      "datapoints":[
         [
            6043941,
            1629999840
         ],
         [
            6044388,
            1629999900
         ],
         [
            6044934,
            1629999960
         ],
         [
            6045494,
            1630000020
         ],
         [
            6045995,
            1630000080
         ],
         [
            6046596,
            1630000140
         ],
         [
            6047207,
            1630000200
         ],
         [
            6047848,
            1630000260
         ],
         [
            6048477,
            1630000320
         ],
         [
            6049073,
            1630000380
         ],
         [
            6049736,
            1630000440
         ],
         [
            6050316,
            1630000500
         ],
         [
            6050915,
            1630000560
         ],
         [
            6051398,
            1630000620
         ],
         [
            6051685,
            1630000680
         ],
         [
            6052129,
            1630000740
         ]
      ],
      "tags":{
         "name":"node1.node2.node3.node4.node5.node6.node7.node8.node9.node10.node11.node12.node13"
      }
   },
   {
      "target":"node1.node2.node3.node4.node5.node6.node7.node8.node9_another.node10.node11.node12.node13",
      "datapoints":[
         [
            935,
            1629999840
         ],
         [
            935,
            1629999900
         ],
         [
            936,
            1629999960
         ],
         [
            936,
            1630000020
         ],
         [
            936,
            1630000080
         ],
         [
            936,
            1630000140
         ],
         [
            936,
            1630000200
         ],
         [
            936,
            1630000260
         ],
         [
            936,
            1630000320
         ],
         [
            936,
            1630000380
         ],
         [
            936,
            1630000440
         ],
         [
            936,
            1630000500
         ],
         [
            936,
            1630000560
         ],
         [
            936,
            1630000620
         ],
         [
            936,
            1630000680
         ],
         [
            936,
            1630000740
         ]
      ],
      "tags":{
         "name":"node1.node2.node3.node4.node5.node6.node7.node8.node9_another.node10.node11.node12.node13"
      }
   },
   {
      "target":"node1.node2.node3.node4.node5.node6.node7.node8.node9_another.node10.node11.node12.node13",
      "datapoints":[
         [
            935,
            1629999840
         ],
         [
            935,
            1629999900
         ],
         [
            936,
            1629999960
         ],
         [
            936,
            1630000020
         ],
         [
            936,
            1630000080
         ],
         [
            936,
            1630000140
         ],
         [
            936,
            1630000200
         ],
         [
            936,
            1630000260
         ],
         [
            936,
            1630000320
         ],
         [
            936,
            1630000380
         ],
         [
            936,
            1630000440
         ],
         [
            936,
            1630000500
         ],
         [
            936,
            1630000560
         ],
         [
            936,
            1630000620
         ],
         [
            936,
            1630000680
         ],
         [
            936,
            1630000740
         ]
      ],
      "tags":{
         "name":"node1.node2.node3.node4.node5.node6.node7.node8.node9_another.node10.node11.node12.node13"
      }
   }
]
graphite-clickhouse:
curl -H 'Content-Type: application/json' -H 'X-Gch-Debug-Output: a' -d "{\"metrics\":[{\"name\":\"test\",\"startTime\":1629999830,\"stopTime\":1630000764,\"pathExpression\":\"node1.node2.node3.node4.node5.node6.node7.node8.*.node10.node11.node12.node13\",\"maxDataPoints\":25}]}" 'localhost:9090/render/?format=json'

{
   "metrics":[
      {
         "name":"node1.node2.node3.node4.node5.node6.node7.node8.node9_another.node10.node11.node12.node13",
         "pathExpression":"node1.node2.node3.node4.node5.node6.node7.node8.*.node10.node11.node12.node13",
         "consolidationFunc":"avg",
         "startTime":1629999840,
         "stopTime":1630000800,
         "stepTime":60,
         "xFilesFactor":0.000000,
         "values":[
            935.000000,
            935.000000,
            936.000000,
            936.000000,
            936.000000,
            936.000000,
            936.000000,
            936.000000,
            936.000000,
            936.000000,
            936.000000,
            936.000000,
            936.000000,
            936.000000,
            936.000000,
            936.000000
         ],
         "requestStartTime":1629999830,
         "requestStopTime":1630000764
      },
      {
         "name":"node1.node2.node3.node4.node5.node6.node7.node8.node9.node10.node11.node12.node13",
         "pathExpression":"node1.node2.node3.node4.node5.node6.node7.node8.*.node10.node11.node12.node13",
         "consolidationFunc":"avg",
         "startTime":1629999840,
         "stopTime":1630000800,
         "stepTime":60,
         "xFilesFactor":0.000000,
         "values":[
            6043941.000000,
            6044388.000000,
            6044934.000000,
            6045494.000000,
            6045995.000000,
            6046596.000000,
            6047207.000000,
            6047848.000000,
            6048477.000000,
            6049073.000000,
            6049736.000000,
            6050316.000000,
            6050915.000000,
            6051398.000000,
            6051685.000000,
            6052129.000000
         ],
         "requestStartTime":1629999830,
         "requestStopTime":1630000764
      }
   ]
}

I've also compared the data in CH to what graphite-clickhouse returned, and the results seem to match (both the values and the total number of values):

SELECT * FROM graphite.graphite_data WHERE Path = 'node1.node2.node3.node4.node5.node6.node7.node8.node9.node10.node11.node12.node13' AND (Date >='2021-08-26' AND Date <= '2021-08-26') and Time between 1629999830 and 1630000764;

As you can see, the duplication happens on the carbonapi level. Maybe you have any idea what could be causing this?

Civil commented 3 years ago

Can you enable log level debug in carbonapi and post a full log for the request that duplicates metrics? Ideally from the start of application (it have a dump of config in a way how it was parsed, etc).

keliss commented 3 years ago
2021-08-26T22:16:50.181Z        INFO    functionInit    failed to read config file, using default       {"function": "timeShift", "error": "Config File \"config\" Not Found in \"[]\""}
2021-08-26T22:16:50.181Z        WARN    functionInit    timeShift function in graphite-web have a default value for resetEnd set to true.carbonapi currently forces this to be false. This behavior will change in next major release (0.15)to be compatible with graphite-web. Please change your dashboards to explicitly pass resetEnd parameteror create a config file for this function that sets it to false.Please see https://github.com/go-graphite/carbonapi/blob/main/doc/configuration.md#example-for-timeshift     {"function": "timeShift"}
2021-08-26T22:16:50.182Z        DEBUG   functionInit    no config file specified        {"function": "aliasByRedis", "message": "this function requrires config file to work properly"}
2021-08-26T22:16:50.182Z        DEBUG   functionInit    no config file specified        {"function": "aliasByPostgres", "message": "this function requrires config file to work properly"}
2021-08-26T22:16:50.182Z        DEBUG   functionInit    no config file specified        {"function": "graphiteWeb", "message": "this function requrires config file to work properly"}
2021-08-26T22:16:50.183Z        DEBUG   zipper  initializing zipper
2021-08-26T22:16:50.183Z        DEBUG   zipper  creating lb group       {"name": "clickhouse", "servers": ["http://localhost:9090"], "type": "rr"}
2021-08-26T22:16:50.183Z        DEBUG   dns     no caching dns initialized, will return typical DialContext
2021-08-26T22:16:50.183Z        DEBUG   zipper  zipper config   {"config": {"Buckets":10,"SlowLogThreshold":7000000000,"ConcurrencyLimitPerServer":0,"MaxIdleConnsPerHost":1000,"Backends":null,"BackendsV2":{"Backends":[{"GroupName":"clickhouse","Protocol":"carbonapi_v3_pb","LBMethod":"rr","Servers":["http://localhost:9090"],"Timeouts":{"Find":10000000000,"Render":30000000000,"Connect":500000000},"ConcurrencyLimit":null,"KeepAliveInterval":null,"MaxIdleConnsPerHost":null,"MaxTries":2,"MaxBatchSize":null,"BackendOptions":null,"ForceAttemptHTTP2":false,"DoMultipleRequestsIfSplit":false}],"MaxIdleConnsPerHost":0,"ConcurrencyLimitPerServer":0,"Timeouts":{"Find":10000000000,"Render":30000000000,"Connect":500000000},"KeepAliveInterval":0,"MaxTries":0,"MaxBatchSize":100},"MaxBatchSize":100,"FallbackMaxBatchSize":100,"MaxTries":0,"DoMultipleRequestsIfSplit":false,"CarbonSearch":{"Backend":"","Prefix":"virt.v1.*"},"CarbonSearchV2":{"Backends":null,"MaxIdleConnsPerHost":0,"ConcurrencyLimitPerServer":0,"Timeouts":{"Find":0,"Render":0,"Connect":0},"KeepAliveInterval":0,"MaxTries":0,"MaxBatchSize":100,"Prefix":""},"ExpireDelaySec":0,"TLDCacheDisabled":false,"InternalRoutingCache":600000000000,"Timeouts":{"Find":10000000000,"Render":30000000000,"Connect":500000000},"KeepAliveInterval":30000000000,"ScaleToCommonStep":true}}
2021-08-26T22:16:50.183Z        DEBUG   zipper  doing request   {"type": "protoV3Group", "name": "clickhouse", "function": "prober", "request": ["*"]}
2021-08-26T22:16:50.183Z        DEBUG   zipper  trying to get slot      {"type": "protoV3Group", "name": "clickhouse", "type": "find", "request": ["*"], "function": "HttpQuery.doRequest", "server": "http://localhost:9090", "name": "clickhouse", "uri": "http://localhost:9090/metrics/find/?format=carbonapi_v3_pb", "name": "http://localhost:9090"}
2021-08-26T22:16:50.183Z        DEBUG   zipper  got slot for server     {"type": "protoV3Group", "name": "clickhouse", "type": "find", "request": ["*"], "function": "HttpQuery.doRequest", "server": "http://localhost:9090", "name": "clickhouse", "uri": "http://localhost:9090/metrics/find/?format=carbonapi_v3_pb", "name": "http://localhost:9090"}
2021-08-26T22:16:50.204Z        DEBUG   zipper  will return data        {"type": "protoV3Group", "name": "clickhouse", "function": "prober", "tlds": ["carbon", "node", "clickhouse_metrics_1"]}

2021-08-26T22:16:53.316Z        DEBUG   zipper  will try to fetch data  {"type": "broadcastGroup", "groupName": "root", "type": "fetch", "request": ["node1.node2.node3.node4.node5.node6.node7.node8.*.node10.node11.node12.node13"], "carbonapi_uuid": "6e6d9001-0596-4673-82c0-81ebbe132c86"}
2021-08-26T22:16:53.316Z        DEBUG   zipper  single fetch    {"type": "broadcastGroup", "groupName": "root", "type": "fetch", "request": ["node1.node2.node3.node4.node5.node6.node7.node8.*.node10.node11.node12.node13"], "carbonapi_uuid": "6e6d9001-0596-4673-82c0-81ebbe132c86", "client": {}}
2021-08-26T22:16:53.316Z        DEBUG   zipper  trying to get slot      {"type": "protoV3Group", "name": "clickhouse", "type": "find", "request": ["node1.node2.node3.node4.node5.node6.node7.node8.*.node10.node11.node12.node13"], "function": "HttpQuery.doRequest", "server": "http://localhost:9090", "name": "clickhouse", "uri": "http://localhost:9090/metrics/find/?format=carbonapi_v3_pb", "name": "http://localhost:9090"}
2021-08-26T22:16:53.316Z        DEBUG   zipper  got slot for server     {"type": "protoV3Group", "name": "clickhouse", "type": "find", "request": ["node1.node2.node3.node4.node5.node6.node7.node8.*.node10.node11.node12.node13"], "function": "HttpQuery.doRequest", "server": "http://localhost:9090", "name": "clickhouse", "uri": "http://localhost:9090/metrics/find/?format=carbonapi_v3_pb", "name": "http://localhost:9090"}
2021-08-26T22:16:53.324Z        DEBUG   zipper  waiting for slot        {"type": "broadcastGroup", "groupName": "root", "type": "fetch", "request": ["node1.node2.node3.node4.node5.node6.node7.node8.*.node10.node11.node12.node13"], "carbonapi_uuid": "6e6d9001-0596-4673-82c0-81ebbe132c86", "multi_fetch": false, "backend_name": "clickhouse", "max_connections": 0}
2021-08-26T22:16:53.324Z        DEBUG   zipper  got slot        {"type": "broadcastGroup", "groupName": "root", "type": "fetch", "request": ["node1.node2.node3.node4.node5.node6.node7.node8.*.node10.node11.node12.node13"], "carbonapi_uuid": "6e6d9001-0596-4673-82c0-81ebbe132c86", "multi_fetch": false, "backend_name": "clickhouse"}
2021-08-26T22:16:53.324Z        DEBUG   zipper  sending request {"type": "broadcastGroup", "groupName": "root", "type": "fetch", "request": ["node1.node2.node3.node4.node5.node6.node7.node8.*.node10.node11.node12.node13"], "carbonapi_uuid": "6e6d9001-0596-4673-82c0-81ebbe132c86", "multi_fetch": false, "backend_name": "clickhouse"}
2021-08-26T22:16:53.325Z        DEBUG   zipper  trying to get slot      {"type": "protoV3Group", "name": "clickhouse", "type": "fetch", "request": "&MultiFetchRequest{Metrics:[{node1.node2.node3.node4.node5.node6.node7.node8.node9_another.node10.node11.node12.node13 1629999830 1630000764 false node1.node2.node3.node4.node5.node6.node7.node8.*.node10.node11.node12.node13 [] 0} {node1.node2.node3.node4.node5.node6.node7.node8.node9.node10.node11.node12.node13 1629999830 1630000764 false node1.node2.node3.node4.node5.node6.node7.node8.*.node10.node11.node12.node13 [] 0}],}", "function": "HttpQuery.doRequest", "server": "http://localhost:9090", "name": "clickhouse", "uri": "http://localhost:9090/render/?format=carbonapi_v3_pb", "name": "http://localhost:9090"}
2021-08-26T22:16:53.325Z        DEBUG   zipper  got slot for server     {"type": "protoV3Group", "name": "clickhouse", "type": "fetch", "request": "&MultiFetchRequest{Metrics:[{node1.node2.node3.node4.node5.node6.node7.node8.node9_another.node10.node11.node12.node13 1629999830 1630000764 false node1.node2.node3.node4.node5.node6.node7.node8.*.node10.node11.node12.node13 [] 0} {node1.node2.node3.node4.node5.node6.node7.node8.node9.node10.node11.node12.node13 1629999830 1630000764 false node1.node2.node3.node4.node5.node6.node7.node8.*.node10.node11.node12.node13 [] 0}],}", "function": "HttpQuery.doRequest", "server": "http://localhost:9090", "name": "clickhouse", "uri": "http://localhost:9090/render/?format=carbonapi_v3_pb", "name": "http://localhost:9090"}
2021-08-26T22:16:53.344Z        DEBUG   zipper  got response    {"type": "broadcastGroup", "groupName": "root", "type": "fetch", "request": ["node1.node2.node3.node4.node5.node6.node7.node8.*.node10.node11.node12.node13"], "carbonapi_uuid": "6e6d9001-0596-4673-82c0-81ebbe132c86", "multi_fetch": false, "backend_name": "clickhouse"}
2021-08-26T22:16:53.344Z        DEBUG   zipper  got some fetch responses        {"type": "broadcastGroup", "groupName": "root", "type": "fetch", "request": ["node1.node2.node3.node4.node5.node6.node7.node8.*.node10.node11.node12.node13"], "carbonapi_uuid": "6e6d9001-0596-4673-82c0-81ebbe132c86", "backends_count": 1, "response_count": 1, "have_errors": false, "errors": null, "response_count": 4}
2021-08-26T22:16:53.344Z        INFO    access  request served  {"data": {"handler":"render","carbonapi_uuid":"6e6d9001-0596-4673-82c0-81ebbe132c86","url":"/render?target=node1.node2.node3.node4.node5.node6.node7.node8.%2A.node10.node11.node12.node13&from=1629999830&until=1630000764&format=json&maxDataPoints=25","peer_ip":"[","host":"localhost:8182","format":"json","use_cache":true,"targets":["node1.node2.node3.node4.node5.node6.node7.node8.*.node10.node11.node12.node13"],"cache_timeout":60,"metrics":["node1.node2.node3.node4.node5.node6.node7.node8.*.node10.node11.node12.node13"],"runtime":0.028408644,"http_code":200,"carbonzipper_response_size_bytes":1388,"carbonapi_response_size_bytes":2161,"from":1629999830,"until":1630000764,"max_data_points":25,"from_raw":"1629999830","until_raw":"1630000764","uri":"/render?target=node1.node2.node3.node4.node5.node6.node7.node8.%2A.node10.node11.node12.node13&from=1629999830&until=1630000764&format=json&maxDataPoints=25","from_cache":false,"used_backend_cache":false,"request_headers":{}}}
keliss commented 3 years ago

I've tried out different versions and different backend config types (old-fashion backend vs newer backendv2), here are the configs:

v1:

listen: "0.0.0.0:8182"
concurency: 20
cache:
  type: "mem"
  size_mb: 0
  defaultTimeoutSec: 60
cpus: 0
tz: ""
graphite:
  host: "localhost:2003"
  interval: "60s"
  prefix: "carbon.api"
  pattern: "{prefix}.{fqdn}"
pidFile: ""
sendGlobsAsIs: false
idleConnections: 10
maxBatchSize: 100000
logger:
  - logger: ""
    file: "/var/log/carbonapi/carbonapi-1.log"
    level: "warn"
    encoding: "console"
upstreams:
  buckets: 10
  slowLogThreshold: 7s
  timeouts:
    find: "10s"
    render: "30s"
    connect: "500ms"
  keepAliveInterval: "30s"
  concurrencyLimit: 0
  maxIdleConnsPerHost: 100
  backends:
    - "http://localhost:9090"
  graphite09compat: false
expireDelaySec: 10

v2:

listen: "0.0.0.0:8182"
cache:
  type: "mem"
  size_mb: 0
  defaultTimeoutSec: 60
cpus: 0
tz: ""
graphite:
  host: "localhost:2003"
  interval: "60s"
  prefix: "carbon.api"
  pattern: "{prefix}.{fqdn}"
pidFile: ""
logger:
  - logger: ""
    file: "/var/log/carbonapi/carbonapi-1.log"
    level: "warn"
    encoding: "console"
upstreams:
  buckets: 10
  slowLogThreshold: 7s
  timeouts:
    find: "10s"
    render: "30s"
    connect: "500ms"
  keepAliveInterval: "30s"
  maxIdleConnsPerHost: 1000
  backendsv2:
    backends:
      - groupName: "clickhouse"
        protocol: "carbonapi_v3_pb"
        lbMethod: "rr"
        maxTries: 1
        servers:
          - "http://localhost:9090"

Here are the results for the clickhouse backend: 0.12.6-1 - v1 config works, v2 works too (no duplication). Performance on high-cardinality metrics with the averageSeriesWithWildcards function is good. 0.13.0-0 - v1 config works, but performance on the abovementioned type of metrics is really poor, the query times out with a 404 error for some reason. v2 config is broken - I'm seeing duplication. 0.14.1-1 - v1 config is unstable: metrics are often missing in the response. 0.15.0-1 - v1 config is broken completely: to the problem with missing responses now partial responses are added.

As you can see, the latest stable version available as rpm package is 0.12.6-1. Maybe that gives you a hint. By the way, rpm package uninstallation doesn't work correctly as well - it leaves a systemd service running.

Civil commented 3 years ago

Can you test if the problem is still there if you have maxBatchSize: 0 in the config? (see https://github.com/go-graphite/carbonapi/blob/main/doc/configuration.md#for-victoriametrics or https://github.com/go-graphite/carbonapi/blob/main/doc/configuration.md#for-graphite-clickhouse for recomendations) As mostly tested way to query them is when carbonapi send requests that contains globs as they are and let backend handle globs expansion (mostly because they should be able to do much better job than carbonapi at that and that would save multiple roundtrips to the backend) and to be honest I haven't tried leaving it default (default value is 100, for go-carbon) for quiet some time.

keliss commented 3 years ago

No difference, still having duplicated results.

Civil commented 3 years ago

I've tried to use your config from https://github.com/go-graphite/carbonapi/issues/646#issuecomment-907396570 (v2 version) but I cannot reproduce metrics duplication.

However I've noticed an odd thing in your logs:

2021-08-26T22:16:53.344Z        DEBUG   zipper  got some fetch responses        {"type": "broadcastGroup", "groupName": "root", "type": "fetch", "request": ["node1.node2.node3.node4.node5.node6.node7.node8.*.node10.node11.node12.node13"], "carbonapi_uuid": "6e6d9001-0596-4673-82c0-81ebbe132c86", "backends_count": 1, "response_count": 1, "have_errors": false, "errors": null, "response_count": 4}

And to be precise that part:

"response_count": 4

As that is populated here: https://github.com/go-graphite/carbonapi/blob/bb33f26c8466deb9c9134c1085d6fcf77cc02205/zipper/broadcast/broadcast_group.go#L470-L476

as len(result.Response.Metrics)

I'll push some changes to logging to get rid of duplicate respone_count (first one is about backends, second one about metrics) and I would log amount of metrics in response in several different places to make it easier to debug.

As about your questiuon about it leaves a systemd service running. - that is by design, packages on uninstall do not stop or restart the service as that is something user should decide if they want to do (and how they want to do that). I will try to remember to add some message stating that service will be left running.

Also about missing responses now partial responses are added. I'm not sure I understand what you mean by missing responses.

Also if backend somehow duplicate results, carbonapi won't do any deduplication on it's own.

Civil commented 3 years ago

In current master I've added some metrics to debug log level. I hope it will help to better understand what's going on.