Closed interfan7 closed 5 months ago
Full log might be indeed unncecessary, but debug log have one important thing - it logs time of different stages of the process. So it is currently not clear for me what part of the process took that long (e.x. waiting for another server? Merging results? Something else?)
Probably it would be also helpful if you'll be able to tell me if during that time carbonapi consumes a lot of CPU or just sitting idle (that is another indication on waiting on something)?
Of course ideally, pprof during that slow query, would be helpful (just svg export is usually fine)
Hi @Civil 🙂,
The cores seem somewhat relaxed to me. This is the top
snapshot during the query:
The log I've prepared for the request is extracted from the log by grepping the query, so if there are related logs without a field for the query, they are not part of the log. Still, the log took some 25MB due to the printing of the response from backend. So I've removed the specific field of response
from the 4 long entries which contain it, so it's reduced to 5KB.
Here is the log:
curl_65s_success_log_carbonapi_json_gh_no_top_level_response_field.txt
As for pprof, I can try that later.
Thanks, I'll try to have a look later today (but more likely this weekend, and very likely Sunday, Zurich evening).
Though if CPU usage is relatively ok, I suspect that it is to be an HTTP client problem. I was recently looking at adding alternative clients (fasthttp as it have an http/2 plugin nowadays, for example). And probably that is one of the difference between go-graphite's and booking's carobnapis - later migrated to cURL via CGO as far as I remember (I might be wrong though).
@Civil Thanks! Appreciate it.
Grepping the timestamp in the log, you can clearly see 2 considerable "jumps" in timeline, each of which is about 1 minute long.
Just for clarity and maybe future readers, these are the 2 jumps:
{
"level": "DEBUG",
"timestamp": "2024-02-01T00:51:57.764Z",
"logger": "zipper",
"message": "got some find responses",
"type": "find",
"groupName": "backends",
"request": [
"a.*b*.c.d.*"
],
"backends_count": 1,
"response_count": 1,
"have_errors": false,
"errors": null
}
{
"level": "DEBUG",
"timestamp": "2024-02-01T00:52:57.957Z",
"logger": "zipper",
"message": "fetched response",
"type": "find",
"groupName": "root",
"request": [
"a.*b*.c.d.*"
],
"group_name": "root",
"backend_name": "backends"
}
Both log entries contained "response" field originally. These fields are very long.
and:
{
"level": "DEBUG",
"timestamp": "2024-02-01T00:52:57.988Z",
"logger": "zipper",
"message": "got some find responses",
"type": "find",
"groupName": "root",
"request": [
"a.*b*.c.d.*"
],
"backends_count": 1,
"response_count": 1,
"have_errors": false,
"errors": null
}
{
"level": "INFO",
"timestamp": "2024-02-01T00:54:00.220Z",
"logger": "access",
"message": "request served",
"data": {
"handler": "find",
"carbonapi_uuid": "99171551-7bab-4bbf-9dc1-a8253be1da52",
"url": "/metrics/find?query=a.*b*.c.d.*",
"peer_ip": "127.0.0.1",
"peer_port": "39294",
"host": "localhost:8081",
"metrics": [
"a.*b*.c.d.*"
],
"runtime": 122.749326233,
"http_code": 200,
"uri": "/metrics/find?query=a.*b*.c.d.*",
"from_cache": false,
"used_backend_cache": false,
"zipper_requests": 1,
"total_metrics_count": 60948,
"request_headers": {}
}
}
The 1st of the entries had response
field.
Clearly log entries with message got some find responses
are followed by a query-related entry timestamped about a minute later.
I've captured 2 CPU profilings during the same long curl. Used these commands:
curl http://127.0.0.1:8081/debug/pprof/profile > curl_cpu_1.prof
go tool pprof -svg /usr/bin/carbonapi curl_cpu_1.prof > curl_cpu_1.svg
Interesting... According to the profile it is debug logging :) I think that would be an easy fix then.
@Civil
I've changed carbonapi
log level from "debug" to "info". It has halved the runtime of the query, but 1:08 minute is still unexpectedly long.
This is the logger config:
logger:
- logger: ""
file: "stderr"
level: "debug"
encoding: "console"
encodingTime: "iso8601"
encodingDuration: "seconds"
- logger: ""
file: "/var/log/carbonapi/carbonapi.log"
level: "info"
encoding: "json"
Would that change reduce a minute of the running time? Even 100,000's log lines shall not be the reason.
What about the 1st logger? Why don't I see it in the console as it's supposed to be (stderr
)? Might it still be the reason for the minute long execution?
Oh well it seems so!!!
logger:
- logger: ""
file: "stderr"
level: "info"
encoding: "console"
encodingTime: "iso8601"
encodingDuration: "seconds"
- logger: ""
file: "/var/log/carbonapi/carbonapi.log"
level: "info"
encoding: "json"
Instantaneous execution!
I think, if it's really an easy fix, it shall be done, as people may not have an idea why long queries are so slow. I was through various sorts of Linux networking tweaking :-)
I meant that according to profile it is string concatenation in debug log call. So probably debug log tried to log something that it shouldn't (probably that is logging the resposne or something like that). On the other hand I never thought that people would use debug in production. But on the other hand there is no way to force debug logging either on request or on requests matching criteria... (zap provides a way to do that, but I always was lazy to actually implement that).
@Civil I see, thanks.
The "debug" level for the stderr
is the default config packaged in carbonapi
.
I've set the file log to "debug" too only because I had the issue for too long request handling (1+ min).
Yes, the log portion which tracks this find request is about 25MB. The debug print is very big, but "big" is a relative adjective. Why would this size of data take a minute in such strong machines? Something doesn't sum up to me.
That also means if users set log-level to debug and have a request with big response, they'll jeopardize their service.
I've pushed a fix, it should make logging a bit cheaper.
Describe the bug It seems that a find query takes ~2 minutes to process. According to logs this query is processed in a fraction of second in
go-carbon
which responds with almost 61K metrics. It then takes 2 minutes forcarbonapi
to accomplish the processing which results in ~2.5K metrics.CarbonAPI Version 0.16.1
Logs Pasting the whole log might be too long an unnecessary. Thus I want to paste just the 2 last entries of the log. log level is debug.
CarbonAPI Configuration: I'll provide as needed but mostly it's just the default with the timeout bumped up to 65 secs. Interestingly the processing time of the request takes 2 minutes, and if I set the timeout to 50 secs the request handling fails after 50 secs (although the log doesn't state it's because of timeout, the verbosity could be improved in that regard). However I'll be happy to paste it if needed.
Simplified query (if applicable)
Backend metric retention and aggregation schemas Not sure it's relevant. Will provide if required.
Backend response (if possible) Only single backend in our case. Actually that
go-carbon
process runs on the same host too. The log ofgo-carbon
seem to indicate this query is processed in a fraction of a second. I'll pasted in needed.Additional context We have not had this issue with the Python carbon. I'm not sure what
carbonapi
does, but sorting 61K takes a fraction of a second in Linux shell usingsort
command.