Add usage example for Celery

Radiergummi commented 1 year ago

I recently wanted to collect custom application metrics from a rather large Celery app. I tried several approaches with varying degrees of awkwardness, until finally settling on having a "sidecar" HTTP server process for all Celery nodes.
This works pretty well, but took me a while to get right. To spare others from wasting as much time as me, I would like to propose adding a section on getting Celery metrics set up with the Prometheus client. I wrote up the solution on StackOverflow: https://stackoverflow.com/a/75799358/2532203

If you're interested, I can properly summarise it and open a PR?

Also, if you look at the code, I needed to copy some swaths of the library because the built-in Prometheus WSGI server is hard-coded to demonise the thread; this causes trouble when running in a demonised process. Maybe this could be made optional with another keyword argument that defaults to True?

NicoCaldo commented 1 year ago

Thanks a lot for this as it is exactly what I was looking.

Should it work with Celery in Django as well?

Radiergummi commented 1 year ago

@NicoCaldo That's exactly my setup. It works fine because the metrics HTTP server doesn't rely on Django, so there's nothing additional to configure

Andrew-Cha commented 1 year ago

I ran into this myself. While your solution looks correct, I found this to also work:

# Whatever hook you want, you can do this outside of a hook too.
@signals.celeryd_init.connect 
def name_this_what_you_want(sender=None, conf=None, **kwargs):
    registry = CollectorRegistry()
    multiprocess.MultiProcessCollector(registry)

    start_http_server(8000, registry=registry)

This way the processes in the celery pool can send back custom metrics (as in, non-host metrics) back to the one and only http_server. This avoids using a pushgateway.

Radiergummi commented 1 year ago

This way the processes in the celery pool can send back custom metrics (as in, non-host metrics) back to the one and only http_server. This avoids using a pushgateway.

You wouldn't need a pushgateway with my solution either, but your approach uses way less code, so that definitely is an improvement!

Edit: looking back, I actually tried it this way and ran into multiprocessing issues, although I'm no longer quite sure which. Have you used this code in production, with multiple workers and multiple nodes?

Andrew-Cha commented 1 year ago

This way the processes in the celery pool can send back custom metrics (as in, non-host metrics) back to the one and only http_server. This avoids using a pushgateway.

You wouldn't need a pushgateway with my solution either, but your approach uses way less code, so that definitely is an improvement!

Edit: looking back, I actually tried it this way and ran into multiprocessing issues, although I'm no longer quite sure which. Have you used this code in production, with multiple workers and multiple nodes?

Not quite extensively, though so far I haven't ran into issues. You could be right, I will write back if I ever run into any.

Andrew-Cha commented 11 months ago

Reporting back for those who are interested, the code I provided has not caused any issues so far. Metric collection works fine. 32 concurrent cores writing to prometheus through my setup have proven to not crash. I should note that we do multithreading within a worker, it works.

Andrew-Cha commented 11 months ago

@Radiergummi I highly recommend adding an example that shows your methodology and mine. It would have saved me a day of headaches. Feel free to @ me.

Folyd commented 8 months ago

I ran into this myself. While your solution looks correct, I found this to also work:
# Whatever hook you want, you can do this outside of a hook too.
@signals.celeryd_init.connect 
def name_this_what_you_want(sender=None, conf=None, **kwargs):
    registry = CollectorRegistry()
    multiprocess.MultiProcessCollector(registry)

    start_http_server(8000, registry=registry)
This way the processes in the celery pool can send back custom metrics (as in, non-host metrics) back to the one and only http_server. This avoids using a pushgateway.

Error occurred for me when Celery received a new task and start to process, not sure why

objc[7678]: +[NSCharacterSet initialize] may have been in progress in another thread when fork() was called.
objc[7678]: +[NSCharacterSet initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
[2024-02-22 16:02:08,420: ERROR/MainProcess] Process 'ForkPoolWorker-8' pid:7678 exited with 'signal 6 (SIGABRT)'
[2024-02-22 16:02:08,434: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 6 (SIGABRT) Job: 0.')
Traceback (most recent call last):
  File "/Users/folyd/UTA/chatbot/chatbot/.venv/lib/python3.11/site-packages/billiard/pool.py", line 1264, in mark_as_worker_lost
    raise WorkerLostError(
billiard.einfo.ExceptionWithTraceback: 
"""
Traceback (most recent call last):
  File "/Users/folyd/UTA/chatbot/chatbot/.venv/lib/python3.11/site-packages/billiard/pool.py", line 1264, in mark_as_worker_lost
    raise WorkerLostError(
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 6 (SIGABRT) Job: 0.

Radiergummi commented 7 months ago

@Folyd That looks unrelated to the metrics code, at least on first glance. Have you tried my original code from StackOverflow, too? The fork issue vaguely looks similar to my issue back then.

deathwebo commented 2 weeks ago

Sorry for reviving a very old thread but this thread and your stackoverflow answer are the only examples I could find covering this issue @Radiergummi I have one question, the documentation mentions some limitations when setting celery to multiprocess mode:

To handle this the client library can be put in multiprocess mode. This comes with a number of limitations:

Registries can not be used as normal, all instantiated metrics are exported
    Registering metrics to a registry later used by a MultiProcessCollector may cause duplicate metrics to be exported
Custom collectors do not work (e.g. cpu and memory metrics)
Info and Enum metrics do not work
The pushgateway cannot be used
Gauges cannot use the pid label
Exemplars are not supported
Remove and Clear of labels are currently not supported in multiprocess mode.

Have you run into this issues yourself with your code ? Thanks again for your code!

Radiergummi commented 2 weeks ago

Have you run into this issues yourself with your code ? Thanks again for your code!

@deathwebo it's running smoothly in production since then, and we didn't encounter any obvious issues (yet). The limitations stated in the documentation sound like a lot, but didn't actually impact our use case. You'll have to evaluate this for your application yourself. Assuming you register metrics to the default registry, don't try to measure system resource usage in multiple threads (just use node exporter for that), don't create fancy custom metrics, and use gauges instead of info metrics, you should be fine.

All those limitations really stem from the fact that you want multiple applications to contribute to a single counter. If you can picture that in your head and account for the consequences, you're golden.

Let me know if you hit any specific roadblocks!

deathwebo commented 1 week ago

Have you run into this issues yourself with your code ? Thanks again for your code!

@deathwebo it's running smoothly in production since then, and we didn't encounter any obvious issues (yet). The limitations stated in the documentation sound like a lot, but didn't actually impact our use case. You'll have to evaluate this for your application yourself. Assuming you register metrics to the default registry, don't try to measure system resource usage in multiple threads (just use node exporter for that), don't create fancy custom metrics, and use gauges instead of info metrics, you should be fine.

All those limitations really stem from the fact that you want multiple applications to contribute to a single counter. If you can picture that in your head and account for the consequences, you're golden.

Let me know if you hit any specific roadblocks!

Nice! I just set it up for prod where we run multiple nodes and workers and it's all starting to flow into prometheus. Thanks again for this thread and information.

prometheus / client_python

Add usage example for Celery #902