initializing metrics in init_worker_by_lua noticeably blocks worker initialization

chet commented 4 years ago

Hey!

We've been troubleshooting some intermittent request latencies, and added some timing, which ultimately let us track it down to having moved away from the deprecated usage of initializing metrics in init_by_lua_block, and over to init_worker_by_lua_block.

As we store more and more metrics, doing the init call within init_worker_by_lua_block becomes slower and slower, causing north of 5 second request-blocking delays when a new nginx worker is spanwed off. In a highly dynamic environment (which nginx can only be so good at), we see this unfortunately pretty frequently, and customers see this as ~5 second delays in making connections.

Moving our require("prometheus").init("prometheus_metrics") line back to init_by_lua_block has fixed the issue, but since it's deprecated, I'm concerned this will come to bite us later, and we'll have to put it back as a worker-blocking initialization call (which isn't going to really work for us at this point).

FWIW, I'm more just curious: 1) If you were aware, because there's a good chance this doesn't come up a lot; not judging at all. 2) If it's ok if we keep it in init_by_lua_block? 3) If, maybe, there are some perf things you can think of off the top of your head to improve this?

But, above all else, I mainly just wanted you to realize it's a thing. We love this module and how easy it is to use, but I don't think we'll be using nginx for much longer at at this point; it's been years of trying to hack around shortcomings running in dynamic environments, and its exhausting!

knyar commented 4 years ago

Thanks for the report! I have not heard about this library slowing down worker start-up before.

I assume what you mean by 'dynamic environment' is that you regularly change nginx configuration and reload the process by sending the HUP signal, which starts a bunch of new workers.

The only operation that I believe could be causing delays if new workers start while a running nginx process has already accumulated a very large number of metrics is probably syncing per-worker metric name cache (implemented in 9ec4a6259ec78dcbdfe1ca36272135978eb80c56). But I am not sure how moving init back to init_by_lua_block would fix that.

To answer your questions:

It should be fine to do most initialization in init_by_lua_block, and only call init_worker from init_worker_by_lua_block. This should still work now, but I don't think we have any tests to validate this behaviour, so I guess it could break in the future.
Can you try running the library with the metric name cache disabled (just reverting 9ec4a6259ec78dcbdfe1ca36272135978eb80c56 should be sufficient) and see if this has any impact on observed performance?

knyar commented 3 years ago

Hey @chet, do you happen to be in a position to provide more details about this issue? I would like to understand whether there are any changes (to code or documentation) I should make.

knyar commented 3 years ago

Seeing no response from @chet, I don't think this is actionable at this point, so will have to close this issue. Please feel free to reopen if you can provide more details.

knyar / nginx-lua-prometheus

initializing metrics in init_worker_by_lua noticeably blocks worker initialization #107