grafana / grafana-image-renderer

A Grafana backend plugin that handles rendering of panels & dashboards to PNGs using headless browser (Chromium/Chrome)
Apache License 2.0
381 stars 154 forks source link

Remote rendering with dockerized container sometimes fails #202

Open languitar opened 4 years ago

languitar commented 4 years ago

What happened:

We have Grafana 7 running on EKS with the dockerized remote rendering service deployed for rendering alert images. Sometimes, instead of the correct plot, only a default test image is posted to Slack and error messages are emitted by Grafana and the renderer.

What you expected to happen:

Images are always rendered correctly

How to reproduce it (as minimally and precisely as possible):

Unfortunately, I couldn't find a way to reproduce this in a limited test case.

Anything else we need to know?:

The remote rendering container is deployed using the method I have described in this issue comment. Thus, renderer and grafana share the same network namespace and can communicate directly using localhost. No caching involved.

When rendering fails, the remote renderer emits the following log messages:

{"message":"::ffff:127.0.0.1 - - [18/Jun/2020:01:04:34 +0000] \"GET /render?deviceScaleFactor=1.000000&domain=localhost&encoding=&height=500&renderKey=U8TxLVDSM1M7VcWHEQi9WyNcdH8YjAXN&timeout=15&timezone=&url=http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2F09cVU_9Wk%2Fdatabase-consistency-alerting%3ForgId%3D1%26panelId%3D10%26render%3D1&width=1000 HTTP/1.1\" - - \"-\" \"Grafana/7.0.3\"\n","level":"debug"}
{"url":"/render?deviceScaleFactor=1.000000&domain=localhost&encoding=&height=500&renderKey=U8TxLVDSM1M7VcWHEQi9WyNcdH8YjAXN&timeout=15&timezone=&url=http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2F09cVU_9Wk%2Fdatabase-consistency-alerting%3ForgId%3D1%26panelId%3D10%26render%3D1&width=1000","stack":"Error: Request aborted\n    at onaborted (/usr/src/app/node_modules/express/lib/response.js:1025:15)\n    at Immediate._onImmediate (/usr/src/app/node_modules/express/lib/response.js:1067:9)\n    at processImmediate (internal/timers.js:456:21)","level":"error","message":"Request failed"}
TypeError: Cannot read property 'statusCode' of undefined
    at /usr/src/app/build/service/http-server.js:92:46
    at Layer.handle_error (/usr/src/app/node_modules/express/lib/router/layer.js:71:5)
    at trim_prefix (/usr/src/app/node_modules/express/lib/router/index.js:315:13)
    at /usr/src/app/node_modules/express/lib/router/index.js:284:7
    at Function.process_params (/usr/src/app/node_modules/express/lib/router/index.js:335:12)
    at next (/usr/src/app/node_modules/express/lib/router/index.js:275:10)
    at next (/usr/src/app/node_modules/express/lib/router/route.js:127:14)
    at /usr/src/app/build/service/http-server.js:51:21
    at /usr/src/app/node_modules/express/lib/response.js:434:22
    at onaborted (/usr/src/app/node_modules/express/lib/response.js:1027:5)
{"url":"http://localhost:3000/d-solo/QOfuckAZk/api-events?orgId=1&panelId=10&render=1","level":"debug","message":"Render request received"}
{"file":"/tmp/f13be133.png","level":"debug","message":"Deleting temporary file"}

And Grafana reports the following:

t=2020-06-18T11:45:28+0000 lvl=eror msg="Failed to send request to remote rendering service." logger=rendering renderer=http error="Get \"htt
p://localhost:8081/render?deviceScaleFactor=1.000000&domain=localhost&encoding=&height=500&renderKey=2HUHoZbj07yWyaZF1RMt9tqRCz5EbUWI&timeout
=15&timezone=&url=http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2FL4WNBvtZz%2Flog-alerting%3ForgId%3D1%26panelId%3D2%26render%3D1&width=1000\": cont
ext deadline exceeded"
t=2020-06-18T11:45:28+0000 lvl=eror msg="Failed to render and upload alert panel image." logger=alerting.notifier ruleId=21 error="Failed to 
send request to remote rendering service. Get \"http://localhost:8081/render?deviceScaleFactor=1.000000&domain=localhost&encoding=&height=500
&renderKey=2HUHoZbj07yWyaZF1RMt9tqRCz5EbUWI&timeout=15&timezone=&url=http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2FL4WNBvtZz%2Flog-alerting%3ForgI
d%3D1%26panelId%3D2%26render%3D1&width=1000\": context deadline exceeded"
t=2020-06-18T11:45:28+0000 lvl=info msg="Executing slack notification" logger=alerting.notifier.slack ruleId=21 notification=devs
t=2020-06-18T11:45:28+0000 lvl=eror msg="Failed to get render key from cache" logger=rendering renderer=http error="cache item not found"
t=2020-06-18T11:45:28+0000 lvl=info msg="Request Completed" logger=context userId=0 orgId=0 uname= method=GET path=/api/alerts/states-for-das
hboard status=401 remote_addr=127.0.0.1 time_ms=0 size=32 referer="http://localhost:3000/d-solo/L4WNBvtZz/log-alerting?orgId=1&panelId=2&rend
er=1"

Environment:

torkelo commented 4 years ago

Is it completely random? What graph are you trying to render? how long does the query take ?

languitar commented 4 years ago

I couldn't spot a pattern so far :/ It happened for different alerts defined on different dashboards. Queries are usually pretty simple.

languitar commented 4 years ago

Maybe this is of interest: we use Victoria Metrics as the backend. However, this was already the case when we were still on Grafana 6 and there rendering worked without any issues.

languitar commented 4 years ago

I just had a look at the reported error messages and can confirm that various plots on different dashboards are affected:

❯ cat ~/Downloads/log.err | sed 's/.*\(http.*\)/\1/' | cut -d "\\" -f 1 | sort
http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2F09cVU_9Wk%2Fdatabase-consistency-alerting%3ForgId%3D1%26panelId%3D10%26render%3D1&width=1000
http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2F09cVU_9Wk%2Fdatabase-consistency-alerting%3ForgId%3D1%26panelId%3D10%26render%3D1&width=1000
http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2F09cVU_9Wk%2Fdatabase-consistency-alerting%3ForgId%3D1%26panelId%3D10%26render%3D1&width=1000
http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2F09cVU_9Wk%2Fdatabase-consistency-alerting%3ForgId%3D1%26panelId%3D10%26render%3D1&width=1000
http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2F09cVU_9Wk%2Fdatabase-consistency-alerting%3ForgId%3D1%26panelId%3D11%26render%3D1&width=1000
http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2F09cVU_9Wk%2Fdatabase-consistency-alerting%3ForgId%3D1%26panelId%3D11%26render%3D1&width=1000
http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2F7bOegv2Zz%2Fairflow%3ForgId%3D1%26panelId%3D4%26render%3D1&width=1000
http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2F7bOegv2Zz%2Fairflow%3ForgId%3D1%26panelId%3D4%26render%3D1&width=1000
http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2FCkNLSPrZz%2Fgdprlog-service%3ForgId%3D1%26panelId%3D4%26render%3D1&width=1000
http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2FCkNLSPrZz%2Fgdprlog-service%3ForgId%3D1%26panelId%3D4%26render%3D1&width=1000
http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2FL4WNBvtZz%2Flog-alerting%3ForgId%3D1%26panelId%3D2%26render%3D1&width=1000
http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2FL4WNBvtZz%2Flog-alerting%3ForgId%3D1%26panelId%3D2%26render%3D1&width=1000
http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2FL4WNBvtZz%2Flog-alerting%3ForgId%3D1%26panelId%3D2%26render%3D1&width=1000
http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2FL4WNBvtZz%2Flog-alerting%3ForgId%3D1%26panelId%3D2%26render%3D1&width=1000
http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2FL4WNBvtZz%2Flog-alerting%3ForgId%3D1%26panelId%3D2%26render%3D1&width=1000
http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2FL4WNBvtZz%2Flog-alerting%3ForgId%3D1%26panelId%3D2%26render%3D1&width=1000
http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2FL4WNBvtZz%2Flog-alerting%3ForgId%3D1%26panelId%3D2%26render%3D1&width=1000
http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2FL4WNBvtZz%2Flog-alerting%3ForgId%3D1%26panelId%3D2%26render%3D1&width=1000
http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2FL4WNBvtZz%2Flog-alerting%3ForgId%3D1%26panelId%3D2%26render%3D1&width=1000
http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2FL4WNBvtZz%2Flog-alerting%3ForgId%3D1%26panelId%3D2%26render%3D1&width=1000
http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2FL4WNBvtZz%2Flog-alerting%3ForgId%3D1%26panelId%3D2%26render%3D1&width=1000
http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2FL4WNBvtZz%2Flog-alerting%3ForgId%3D1%26panelId%3D2%26render%3D1&width=1000
http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2FL4WNBvtZz%2Flog-alerting%3ForgId%3D1%26panelId%3D2%26render%3D1&width=1000
http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2FL4WNBvtZz%2Flog-alerting%3ForgId%3D1%26panelId%3D2%26render%3D1&width=1000
http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2FQOfuckAZk%2Fapi-events%3ForgId%3D1%26panelId%3D10%26render%3D1&width=1000
http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2FQOfuckAZk%2Fapi-events%3ForgId%3D1%26panelId%3D10%26render%3D1&width=1000
http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2FQOfuckAZk%2Fapi-events%3ForgId%3D1%26panelId%3D10%26render%3D1&width=1000
http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2FQOfuckAZk%2Fapi-events%3ForgId%3D1%26panelId%3D10%26render%3D1&width=1000
arjunnayak commented 4 years ago

We're getting a similar error with Grafana v7 and image-renderer:latest running as docker containers in a k8s pod. Hitting the image render url in the browser works, but fails via curl. Curl: curl --http2 -v -H "Authorization: Bearer <API_KEY>" --silent -o image.png https://grafana.my.url/render/d-solo/4Uf2RVgZzA/overview?orgId=1&refresh=30s&from=1593547285834&to=1593558085834&panelId=3&width=1000&height=500&tz=America%2FLos_Angeles

Resulting logs from the containers:

renderer {"url":"/render?deviceScaleFactor=1.000000&domain=localhost&encoding=&height=400&renderKey=7D19N8pAxkj0a7A5l9u2sxkjVlmR97Ge&timeout=60&timezone=&url=http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2F4Uf2RVgZzA%2Fapi-gateway-overview%3ForgId%3D1%26render%3D1&width=800","stack":"TimeoutError: waiting for function failed: timeout 60000ms exceeded\n    at new WaitTask (/usr/src/app/node_modules/puppeteer/lib/DOMWorld.js:549:28)\n    at DOMWorld.waitForFunction (/usr/src/app/node_modules/puppeteer/lib/DOMWorld.js:454:12)\n    at Frame.waitForFunction (/usr/src/app/node_modules/puppeteer/lib/FrameManager.js:657:28)\n    at Page.waitForFunction (/usr/src/app/node_modules/puppeteer/lib/Page.js:1144:29)\n    at Browser.<anonymous> (/usr/src/app/build/browser/browser.js:174:24)\n    at Generator.next (<anonymous>)\n    at fulfilled (/usr/src/app/build/browser/browser.js:5:58)\n    at runMicrotasks (<anonymous>)\n    at processTicksAndRejections (internal/process/task_queues.js:97:5)","level":"error","message":"Requestfailed"}

grafana t=2020-07-01T00:08:39+0000 lvl=eror msg="Failed to send request to remote rendering service." logger=rendering renderer=http error="Get \"http://localhost:8081/render?deviceScaleFactor=1.000000&domain=localhost&encoding=&height=400&renderKey=bikU7J4IfyIJrPbHsPFDM4zUMKOLY637&timeout=60&timezone=&url=http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2F4Uf2RVgZzA%2Fapi-gateway-overview%3ForgId%3D1%26render%3D1&width=800\": context deadline exceeded"

grafana t=2020-07-01T00:08:39+0000 lvl=eror msg="Rendering failed." logger=context userId=0 orgId=1 uname= error="Failed to send request to remote rendering service. Get \"http://localhost:8081/render?deviceScaleFactor=1.000000&domain=localhost&encoding=&height=400&renderKey=bikU7J4IfyIJrPbHsPFDM4zUMKOLY637&timeout=60&timezone=&url=http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2F4Uf2RVgZzA%2Fapi-gateway-overview%3ForgId%3D1%26render%3D1&width=800\": context deadline exceeded"

grafana t=2020-07-01T00:08:39+0000 lvl=eror msg="Request Completed" logger=context userId=0 orgId=1 uname= method=GET path=/render/d-solo/4Uf2RVgZzA/api-gateway-overview status=500 remote_addr=10.0.14.97 time_ms=62027 size=1723 referer=
marefr commented 3 years ago

For reference it seems to fail here, but I don't understand why: https://github.com/grafana/grafana-image-renderer/blob/ce1f81438e5f69c7fd7c73ce08bab624c4c92e25/src/service/http-server.ts#L49

Suggests enabling debug/verbose logs, see https://github.com/grafana/grafana-image-renderer/blob/master/docs/remote_rendering_using_docker.md#environment-variables

f using environment variables, use:

LOG_LEVEL=debug
RENDERING_VERBOSE_LOGGING=true
RENDERING_DUMPIO=true

You could as an alternative mount a configuration file in the container, for instructions see https://github.com/grafana/grafana-image-renderer/blob/master/docs/remote_rendering_using_docker.md#configuration-file

AgnesToulet commented 3 years ago

Is this still an issue with the latest version of the plugin? If so, could you provide detailed logs by enabling the options mentioned above?

zebrapurring commented 2 years ago

I'm seeing the same issue on grafana/grafana-image-renderer:latest (image renderer version 3.4.2) deployed via Grafana Helm Chart version 6.29.5.

Environment variables:

HTTP_HOST="0.0.0.0"
HTTP_PORT="8081"
LOG_LEVEL="debug"
RENDERING_ARGS="--no-sandbox,--disable-gpu,--window-size=1280x758"
RENDERING_DUMPIO="true"
RENDERING_MODE="clustered"
RENDERING_VERBOSE_LOGGING="true"

Log output:

{"filePath":"/tmp/10156b63.png","level":"debug","message":"Taking screenshot"}
{"level":"error","message":"Request failed","stack":"Error: Request aborted\n    at onaborted (/usr/src/app/node_modules/express/lib/response.js:1030:15)\n    at Immediate._onImmediate (/usr/src/app/node_modules/express/lib/response.js:1072:9)\n    at processImmediate (internal/timers.js:464:21)","url":"/render?deviceScaleFactor=1.000000&domain=grafana.monitoring&encoding=&height=500&renderKey=<REDACTED>&timeout=15&timezone=&url=<REDACTED>"}
TypeError: Cannot read property 'statusCode' of undefined
    at /usr/src/app/build/service/http-server.js:135:42
    at Layer.handle_error (/usr/src/app/node_modules/express/lib/router/layer.js:71:5)
    at trim_prefix (/usr/src/app/node_modules/express/lib/router/index.js:321:13)
    at /usr/src/app/node_modules/express/lib/router/index.js:284:7
    at Function.process_params (/usr/src/app/node_modules/express/lib/router/index.js:341:12)
    at next (/usr/src/app/node_modules/express/lib/router/index.js:275:10)
    at next (/usr/src/app/node_modules/express/lib/router/route.js:127:14)
    at /usr/src/app/build/service/http-server.js:44:21
    at /usr/src/app/node_modules/express/lib/response.js:439:22
    at onaborted (/usr/src/app/node_modules/express/lib/response.js:1032:5)
Kot-o-pes commented 1 year ago

Same issue with grafana 8.5.15 and renderer ( running in separate pod in k8s) v3.4.2