go-graphite / carbonapi

Implementation of graphite API (graphite-web) in golang
Other
308 stars 140 forks source link

[BUG] handling find query takes insanely long time #820

Closed interfan7 closed 5 months ago

interfan7 commented 5 months ago

Describe the bug It seems that a find query takes ~2 minutes to process. According to logs this query is processed in a fraction of second in go-carbon which responds with almost 61K metrics. It then takes 2 minutes for carbonapi to accomplish the processing which results in ~2.5K metrics.

CarbonAPI Version 0.16.1

Logs Pasting the whole log might be too long an unnecessary. Thus I want to paste just the 2 last entries of the log. log level is debug.

{
  "level": "INFO",
  "timestamp": "2024-02-01T00:54:00.220Z",
  "logger": "access",
  "message": "request served",
  "data": {
    "handler": "find",
    "carbonapi_uuid": "99171551-7bab-4bbf-9dc1-a8253be1da52",
    "url": "/metrics/find?query=a.*b*.c.d.*",
    "peer_ip": "127.0.0.1",
    "peer_port": "39294",
    "host": "localhost:8081",
    "metrics": [
      "a.*b*.c.d.*"
    ],
    "runtime": 122.749326233,
    "http_code": 200,
    "uri": "/metrics/find?query=a.*b*.c.d.*",
    "from_cache": false,
    "used_backend_cache": false,
    "zipper_requests": 1,
    "total_metrics_count": 60948,
    "request_headers": {}
  }
}
{
  "level": "WARN",
  "timestamp": "2024-02-01T00:54:00.220Z",
  "logger": "slow",
  "message": "Slow Request",
  "time": 122.751682618,
  "slowLogThreshold": 1,
  "url": "/metrics/find?query=a.*b*.c.d.*",
  "referer": ""
}

CarbonAPI Configuration: I'll provide as needed but mostly it's just the default with the timeout bumped up to 65 secs. Interestingly the processing time of the request takes 2 minutes, and if I set the timeout to 50 secs the request handling fails after 50 secs (although the log doesn't state it's because of timeout, the verbosity could be improved in that regard). However I'll be happy to paste it if needed.

Simplified query (if applicable)

curl http://localhost:8081/metrics/find?query=a.*b*.c.d.*

Backend metric retention and aggregation schemas Not sure it's relevant. Will provide if required.

Backend response (if possible) Only single backend in our case. Actually that go-carbon process runs on the same host too. The log of go-carbon seem to indicate this query is processed in a fraction of a second. I'll pasted in needed.

Additional context We have not had this issue with the Python carbon. I'm not sure what carbonapi does, but sorting 61K takes a fraction of a second in Linux shell using sort command.

Civil commented 5 months ago

Full log might be indeed unncecessary, but debug log have one important thing - it logs time of different stages of the process. So it is currently not clear for me what part of the process took that long (e.x. waiting for another server? Merging results? Something else?)

Probably it would be also helpful if you'll be able to tell me if during that time carbonapi consumes a lot of CPU or just sitting idle (that is another indication on waiting on something)?

Of course ideally, pprof during that slow query, would be helpful (just svg export is usually fine)

interfan7 commented 5 months ago

Hi @Civil 🙂,

The cores seem somewhat relaxed to me. This is the top snapshot during the query:

image

The log I've prepared for the request is extracted from the log by grepping the query, so if there are related logs without a field for the query, they are not part of the log. Still, the log took some 25MB due to the printing of the response from backend. So I've removed the specific field of response from the 4 long entries which contain it, so it's reduced to 5KB. Here is the log: curl_65s_success_log_carbonapi_json_gh_no_top_level_response_field.txt

As for pprof, I can try that later.

Civil commented 5 months ago

Thanks, I'll try to have a look later today (but more likely this weekend, and very likely Sunday, Zurich evening).

Though if CPU usage is relatively ok, I suspect that it is to be an HTTP client problem. I was recently looking at adding alternative clients (fasthttp as it have an http/2 plugin nowadays, for example). And probably that is one of the difference between go-graphite's and booking's carobnapis - later migrated to cURL via CGO as far as I remember (I might be wrong though).

interfan7 commented 5 months ago

@Civil Thanks! Appreciate it.

Grepping the timestamp in the log, you can clearly see 2 considerable "jumps" in timeline, each of which is about 1 minute long.

Just for clarity and maybe future readers, these are the 2 jumps:

{
  "level": "DEBUG",
  "timestamp": "2024-02-01T00:51:57.764Z",
  "logger": "zipper",
  "message": "got some find responses",
  "type": "find",
  "groupName": "backends",
  "request": [
    "a.*b*.c.d.*"
  ],
  "backends_count": 1,
  "response_count": 1,
  "have_errors": false,
  "errors": null
}
{
  "level": "DEBUG",
  "timestamp": "2024-02-01T00:52:57.957Z",
  "logger": "zipper",
  "message": "fetched response",
  "type": "find",
  "groupName": "root",
  "request": [
    "a.*b*.c.d.*"
  ],
  "group_name": "root",
  "backend_name": "backends"
}

Both log entries contained "response" field originally. These fields are very long.

and:

{
  "level": "DEBUG",
  "timestamp": "2024-02-01T00:52:57.988Z",
  "logger": "zipper",
  "message": "got some find responses",
  "type": "find",
  "groupName": "root",
  "request": [
    "a.*b*.c.d.*"
  ],
  "backends_count": 1,
  "response_count": 1,
  "have_errors": false,
  "errors": null
}
{
  "level": "INFO",
  "timestamp": "2024-02-01T00:54:00.220Z",
  "logger": "access",
  "message": "request served",
  "data": {
    "handler": "find",
    "carbonapi_uuid": "99171551-7bab-4bbf-9dc1-a8253be1da52",
    "url": "/metrics/find?query=a.*b*.c.d.*",
    "peer_ip": "127.0.0.1",
    "peer_port": "39294",
    "host": "localhost:8081",
    "metrics": [
      "a.*b*.c.d.*"
    ],
    "runtime": 122.749326233,
    "http_code": 200,
    "uri": "/metrics/find?query=a.*b*.c.d.*",
    "from_cache": false,
    "used_backend_cache": false,
    "zipper_requests": 1,
    "total_metrics_count": 60948,
    "request_headers": {}
  }
}

The 1st of the entries had response field. Clearly log entries with message got some find responses are followed by a query-related entry timestamped about a minute later.

interfan7 commented 5 months ago

I've captured 2 CPU profilings during the same long curl. Used these commands:

curl http://127.0.0.1:8081/debug/pprof/profile > curl_cpu_1.prof
go tool pprof -svg /usr/bin/carbonapi curl_cpu_1.prof > curl_cpu_1.svg

curl_cpu_1 curl_cpu_2

Civil commented 5 months ago

Interesting... According to the profile it is debug logging :) I think that would be an easy fix then.

interfan7 commented 5 months ago

@Civil

I've changed carbonapi log level from "debug" to "info". It has halved the runtime of the query, but 1:08 minute is still unexpectedly long.

This is the logger config:

logger:
    - logger: ""
      file: "stderr"
      level: "debug"
      encoding: "console"
      encodingTime: "iso8601"
      encodingDuration: "seconds"
    - logger: ""
      file: "/var/log/carbonapi/carbonapi.log"
      level: "info"
      encoding: "json"

Would that change reduce a minute of the running time? Even 100,000's log lines shall not be the reason. What about the 1st logger? Why don't I see it in the console as it's supposed to be (stderr)? Might it still be the reason for the minute long execution?

interfan7 commented 5 months ago

Oh well it seems so!!!

logger:
    - logger: ""
      file: "stderr"
      level: "info"
      encoding: "console"
      encodingTime: "iso8601"
      encodingDuration: "seconds"
    - logger: ""
      file: "/var/log/carbonapi/carbonapi.log"
      level: "info"
      encoding: "json"

Instantaneous execution!

I think, if it's really an easy fix, it shall be done, as people may not have an idea why long queries are so slow. I was through various sorts of Linux networking tweaking :-)

Civil commented 5 months ago

I meant that according to profile it is string concatenation in debug log call. So probably debug log tried to log something that it shouldn't (probably that is logging the resposne or something like that). On the other hand I never thought that people would use debug in production. But on the other hand there is no way to force debug logging either on request or on requests matching criteria... (zap provides a way to do that, but I always was lazy to actually implement that).

interfan7 commented 5 months ago

@Civil I see, thanks.

The "debug" level for the stderr is the default config packaged in carbonapi. I've set the file log to "debug" too only because I had the issue for too long request handling (1+ min).

Yes, the log portion which tracks this find request is about 25MB. The debug print is very big, but "big" is a relative adjective. Why would this size of data take a minute in such strong machines? Something doesn't sum up to me.

That also means if users set log-level to debug and have a request with big response, they'll jeopardize their service.

Civil commented 5 months ago

I've pushed a fix, it should make logging a bit cheaper.