prometheus 相关文档整理

lzh2nix commented 1 year ago

最近一个月疯狂看prom 相关的文章, 有必要整理一下:

Content

lzh2nix commented 1 year ago

001 是时候放弃pushgateway了(2023.5.31)

https://prometheus.io/docs/practices/pushing/

pushgateway 的几个缺点:

多个instance推metric 到一个pushgateway 上, pushgateway 本身有单点的问题, 性能也会成为瓶颈
不能通过up 来感知instance 是不是挂了
推送到pushgateway的指标会一直保留着, 直到最后主动删除
3 引出的另外一个问题是, 当一个metric label或者metric名称变化时在prom上会有多份数据

替代品: https://github.com/prometheus-community/PushProx

Back To Top

lzh2nix commented 1 year ago

002 prom容量规划(2023.06.01)

https://prometheus.io/docs/prometheus/latest/storage/

needed_disk_space = retention_time_seconds * ingested_samples_per_second * bytes_per_sample

其中: retention_time_seconds = When to remove old data. Defaults to 15d ingested_samples_per_second = rate of ingested samples bytes_per_sample = metric size

Back To Top

lzh2nix commented 1 year ago

003 告警的艺术(2023.06.01)

https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit

最近也是大部分精力在做告警系统, 也是通过各种手段去给告警降噪, 上面是Google SRE整理的一个告警实践.

告警必须是urgent, important, actionable and real
必须是线上系统正在发生的(指标必须实时, 10分钟之后指标才上来这个是不能容忍的)
告警必须减少噪声, 不然会成为狼来了的故事, 最后真正的告警没人处理
告现象而非原因(原因应该是在告警里带上, 即为触发这个现象的原因是什么)--> Monitor for your users
4 Gold Rule(Errors, Traffic, Latency, Saturability) + 一些和业务相关的告警

Back To Top

lzh2nix commented 1 year ago

004 Introducing Prometheus Agent Mode, an Efficient and Cloud-Native Way for Metric Forwarding(2023.06.03)

https://prometheus.io/blog/2021/11/16/agent/#history-of-the-forwarding-use-case

有些组织的blog绝对是宝藏, 这边文章是Prometheus的作者也是thanos的co-author 所做

single reason why the Prometheus project is so successful, it is this: Focusing the monitoring community on what matters.

prometheus 引入agent mode的两个原因

cound native的兴起导致有很多serverless的服务, 而这些服务的生命周期很短, 在去拉之前有可能已经消失了, 这种场景下pull的模式就不太合适
在cloud-edge 的场景下更希望边缘的metric 直接写到global的prom

同时又不破坏当前的收集模式, 所以就有了prometheus agent mode(实现remote write功能的agent)

Back To Top

lzh2nix commented 1 year ago

005 存储(2023.06.05)

https://prometheus.io/docs/prometheus/latest/storage/

这篇文章大概讲了一下prometheus的存储体系, 再过一遍也做为开始看源码的引子.

其实思想上其他数据库没有差别(年初看过一遍DDIA里面有讲到单机存储系统).

然后最近的2小时的数据是先放到内存中不做持久化, 但是为了防止crash之后数据丢失, 数据来了之后也是先写WAL日志, 这样在重启之后可以通过replay来做数据恢复.

由于prometheus 是单机存储所以本身不提供高可用, 这块儿必须通过外部来完成, 两种推荐方式是:

部署在有RAID的盘上, 然后定期snapshot 做备份
通过remote write 将数据写到大的存储系统上(性能损耗需要关注)

详细每个文件的format 等后面在细细道来: https://github.com/prometheus/prometheus/tree/release-2.44/tsdb/docs/format

Back To Top

lzh2nix / articles