sashgorokhov / scrapy_prometheus

Exporting scrapy stats as prometheus metrics through pushgateway service
MIT License
6 stars 6 forks source link

Suggestion - allow hosting of Prometheus endpoint to support metric pull workflow #3

Open lukeplausin opened 1 year ago

lukeplausin commented 1 year ago

Hi there,

I've been trying out this module for prometheus metric push from scrapy to pushgateway. I also tried this other project: https://github.com/rangertaha/scrapy-prometheus-exporter

There were things I liked and disliked about both approaches.

This project is great because it embraces the concept of Scrapy metrics. By wrapping around the default scrapy metric interface, this version can ingest any custom metrics without adding any configuration. There is also no duplication of metrics between the default scrapy stats object and this implementation.

On the downside, this project only supports push metrics and they only get pushed at the end of the run. This isn't great for watching long-running workflows.

For the other project - https://github.com/rangertaha/scrapy-prometheus-exporter It supports the pull workflow, but does not support custom metrics. Also the implementation is a bit flaky in my opinion as all metrics are stored twice within their stats object and copied by a sync function every 10 seconds. Their implementation also doesn't support pushgateway so it could be possible to lose some data at the end of the run.

So.... here I present to you what I've done - I've made a lot of changes on my fork trying to integrate the best parts of both implementations to make one single metrics module with the best of both. It supports both push and pull workflows, has no duplication of data or sync jobs, it supports the metrics endpoint and supports custom metrics with no additional config.

The only downside compared to the original that I can see is that I had to deactivate segregation of Prometheus metrics registers per-spider. I've changed it so that there is only one registry, and all metrics appear on the same page with the spider name as a label. I understand that this is closer to prometheus best practice in any case.

Let me know what you think and if you would be inclined to accept this as a PR. This version is a bit rough and probably needs a bit more testing and documentation.