Open lzh2nix opened 1 year ago
https://prometheus.io/docs/practices/pushing/
pushgateway 的几个缺点:
up
来感知instance 是不是挂了https://prometheus.io/docs/prometheus/latest/storage/
needed_disk_space = retention_time_seconds * ingested_samples_per_second * bytes_per_sample
其中: retention_time_seconds = When to remove old data. Defaults to 15d ingested_samples_per_second = rate of ingested samples bytes_per_sample = metric size
https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit
最近也是大部分精力在做告警系统, 也是通过各种手段去给告警降噪, 上面是Google SRE整理的一个告警实践.
https://prometheus.io/blog/2021/11/16/agent/#history-of-the-forwarding-use-case
有些组织的blog绝对是宝藏, 这边文章是Prometheus的作者也是thanos的co-author 所做
single reason why the Prometheus project is so successful, it is this: Focusing the monitoring community on what matters.
prometheus 引入agent mode的两个原因
同时又不破坏当前的收集模式, 所以就有了prometheus agent mode(实现remote write功能的agent)
https://prometheus.io/docs/prometheus/latest/storage/
这篇文章大概讲了一下prometheus的存储体系, 再过一遍也做为开始看源码的引子.
其实思想上其他数据库没有差别(年初看过一遍DDIA里面有讲到单机存储系统).
然后最近的2小时的数据是先放到内存中不做持久化, 但是为了防止crash之后数据丢失, 数据来了之后也是先写WAL日志, 这样在重启之后可以通过replay来做数据恢复.
由于prometheus 是单机存储所以本身不提供高可用, 这块儿必须通过外部来完成, 两种推荐方式是:
详细每个文件的format 等后面在细细道来: https://github.com/prometheus/prometheus/tree/release-2.44/tsdb/docs/format
最近一个月疯狂看prom 相关的文章, 有必要整理一下:
Content