minio / docs

MinIO Object Storage Documentation
https://docs.min.io/minio/baremetal
Creative Commons Attribution 4.0 International
544 stars 290 forks source link

Additional recommended alerts #1135

Open ravindk89 opened 5 months ago

ravindk89 commented 5 months ago

Summary

From an internal discussion, we should expand the alerting page to include the following list of recommended metrics:

metric Description
minio_node_drive_free_bytes Total storage available on a drive.
minio_node_drive_free_inodes Total free inodes.
minio_node_drive_latency_us Average last minute latency in µs for drive API storage operations.
minio_node_drive_offline_total Total drives offline in this node.
minio_node_drive_online_total Total drives online in this node.
minio_node_drive_total Total drives in this node.
minio_node_drive_total_bytes Total storage on a drive.
minio_node_drive_used_bytes Total storage used on a drive.
minio_node_drive_errors_timeout Total number of drive timeout errors since server start
minio_node_drive_errors_availability Total number of drive I/O errors, permission denied and timeouts since server start
minio_node_drive_io_waiting Total number I/O operations waiting on drive

There's a lot of metrics here and the page already has some examples, so I'm thinking we can use a tab setup of something like

| Example Alerts | Recommended Alerts |

To help constrain the default length of the procedure.

Goals

List the in-scope goals

Non-Goals

Extensive testing of Prometheus + Alert Manager w/ the above metrics

Additional context Add any other context or screenshots about the feature request here.

ravindk89 commented 4 months ago

@kannappanr some assistance:

curl --retry 10 -L -X GET https://play.min.io/minio/v2/metrics/cluster | grep -E '^minio_[\s a-z _]*_drive'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

minio_cluster_drive_offline_total{server="play.min.io:9000"} 0
minio_cluster_drive_online_total{server="play.min.io:9000"} 4
minio_cluster_drive_total{server="play.min.io:9000"} 4
minio_cluster_health_erasure_set_healing_drives{pool="0",server="play.min.io:9000",set="0"} 0
minio_cluster_health_erasure_set_online_drives{pool="0",server="play.min.io:9000",set="0"} 4

Most of the recommended list as discussed does not appear in cluster metrics.

They do appear for the node endpoint:

minio_node_drive_errors_availability{drive="/disk1/data",server="play.min.io:9000"} 0
minio_node_drive_errors_availability{drive="/disk2/data",server="play.min.io:9000"} 0
minio_node_drive_errors_availability{drive="/disk3/data",server="play.min.io:9000"} 0
minio_node_drive_errors_availability{drive="/disk4/data",server="play.min.io:9000"} 0
minio_node_drive_errors_timeout{drive="/disk1/data",server="play.min.io:9000"} 0
minio_node_drive_errors_timeout{drive="/disk2/data",server="play.min.io:9000"} 0
minio_node_drive_errors_timeout{drive="/disk3/data",server="play.min.io:9000"} 0
minio_node_drive_errors_timeout{drive="/disk4/data",server="play.min.io:9000"} 0
minio_node_drive_free_bytes{drive="/disk1/data",server="play.min.io:9000"} 3.9700221952e+10
minio_node_drive_free_bytes{drive="/disk2/data",server="play.min.io:9000"} 4.0129953792e+10
minio_node_drive_free_bytes{drive="/disk3/data",server="play.min.io:9000"} 4.0129642496e+10
minio_node_drive_free_bytes{drive="/disk4/data",server="play.min.io:9000"} 4.013072384e+10
minio_node_drive_free_inodes{drive="/disk1/data",server="play.min.io:9000"} 2.0950584e+07
minio_node_drive_free_inodes{drive="/disk2/data",server="play.min.io:9000"} 2.0950777e+07
minio_node_drive_free_inodes{drive="/disk3/data",server="play.min.io:9000"} 2.0950777e+07
minio_node_drive_free_inodes{drive="/disk4/data",server="play.min.io:9000"} 2.0950777e+07
minio_node_drive_io_waiting{drive="/disk1/data",server="play.min.io:9000"} 0
minio_node_drive_io_waiting{drive="/disk2/data",server="play.min.io:9000"} 0
minio_node_drive_io_waiting{drive="/disk3/data",server="play.min.io:9000"} 0
minio_node_drive_io_waiting{drive="/disk4/data",server="play.min.io:9000"} 0
minio_node_drive_latency_us{api="storage.CreateFile",drive="/disk1/data",server="play.min.io:9000"} 3600
minio_node_drive_latency_us{api="storage.CreateFile",drive="/disk2/data",server="play.min.io:9000"} 3868
minio_node_drive_latency_us{api="storage.CreateFile",drive="/disk3/data",server="play.min.io:9000"} 3454
minio_node_drive_latency_us{api="storage.CreateFile",drive="/disk4/data",server="play.min.io:9000"} 4263
minio_node_drive_latency_us{api="storage.Delete",drive="/disk1/data",server="play.min.io:9000"} 35
minio_node_drive_latency_us{api="storage.Delete",drive="/disk2/data",server="play.min.io:9000"} 34
minio_node_drive_latency_us{api="storage.Delete",drive="/disk3/data",server="play.min.io:9000"} 32
minio_node_drive_latency_us{api="storage.Delete",drive="/disk4/data",server="play.min.io:9000"} 45
minio_node_drive_latency_us{api="storage.DiskInfo",drive="/disk1/data",server="play.min.io:9000"} 30
minio_node_drive_latency_us{api="storage.DiskInfo",drive="/disk2/data",server="play.min.io:9000"} 38
minio_node_drive_latency_us{api="storage.DiskInfo",drive="/disk3/data",server="play.min.io:9000"} 25
minio_node_drive_latency_us{api="storage.DiskInfo",drive="/disk4/data",server="play.min.io:9000"} 39
minio_node_drive_latency_us{api="storage.ListVols",drive="/disk1/data",server="play.min.io:9000"} 1000
minio_node_drive_latency_us{api="storage.ListVols",drive="/disk2/data",server="play.min.io:9000"} 615
minio_node_drive_latency_us{api="storage.ListVols",drive="/disk3/data",server="play.min.io:9000"} 643
minio_node_drive_latency_us{api="storage.ListVols",drive="/disk4/data",server="play.min.io:9000"} 2280
minio_node_drive_latency_us{api="storage.ReadFileStream",drive="/disk2/data",server="play.min.io:9000"} 58
minio_node_drive_latency_us{api="storage.ReadFileStream",drive="/disk3/data",server="play.min.io:9000"} 64
minio_node_drive_latency_us{api="storage.ReadXL",drive="/disk1/data",server="play.min.io:9000"} 58
minio_node_drive_latency_us{api="storage.ReadXL",drive="/disk2/data",server="play.min.io:9000"} 60
minio_node_drive_latency_us{api="storage.ReadXL",drive="/disk3/data",server="play.min.io:9000"} 49
minio_node_drive_latency_us{api="storage.ReadXL",drive="/disk4/data",server="play.min.io:9000"} 71
minio_node_drive_latency_us{api="storage.RenameData",drive="/disk1/data",server="play.min.io:9000"} 802
minio_node_drive_latency_us{api="storage.RenameData",drive="/disk2/data",server="play.min.io:9000"} 1039
minio_node_drive_latency_us{api="storage.RenameData",drive="/disk3/data",server="play.min.io:9000"} 868
minio_node_drive_latency_us{api="storage.RenameData",drive="/disk4/data",server="play.min.io:9000"} 1075
minio_node_drive_latency_us{api="storage.StatVol",drive="/disk1/data",server="play.min.io:9000"} 41
minio_node_drive_latency_us{api="storage.StatVol",drive="/disk2/data",server="play.min.io:9000"} 60
minio_node_drive_latency_us{api="storage.StatVol",drive="/disk3/data",server="play.min.io:9000"} 20
minio_node_drive_latency_us{api="storage.StatVol",drive="/disk4/data",server="play.min.io:9000"} 33
minio_node_drive_latency_us{api="storage.WalkDir",drive="/disk1/data",server="play.min.io:9000"} 234
minio_node_drive_latency_us{api="storage.WalkDir",drive="/disk2/data",server="play.min.io:9000"} 329
minio_node_drive_latency_us{api="storage.WalkDir",drive="/disk3/data",server="play.min.io:9000"} 465
minio_node_drive_latency_us{api="storage.WalkDir",drive="/disk4/data",server="play.min.io:9000"} 632
minio_node_drive_offline_total{server="play.min.io:9000"} 0
minio_node_drive_online_total{server="play.min.io:9000"} 4
minio_node_drive_total{server="play.min.io:9000"} 4
minio_node_drive_total_bytes{drive="/disk1/data",server="play.min.io:9000"} 4.292870144e+10
minio_node_drive_total_bytes{drive="/disk2/data",server="play.min.io:9000"} 4.292870144e+10
minio_node_drive_total_bytes{drive="/disk3/data",server="play.min.io:9000"} 4.292870144e+10
minio_node_drive_total_bytes{drive="/disk4/data",server="play.min.io:9000"} 4.292870144e+10
minio_node_drive_used_bytes{drive="/disk1/data",server="play.min.io:9000"} 3.228479488e+09
minio_node_drive_used_bytes{drive="/disk2/data",server="play.min.io:9000"} 2.798747648e+09
minio_node_drive_used_bytes{drive="/disk3/data",server="play.min.io:9000"} 2.799058944e+09
minio_node_drive_used_bytes{drive="/disk4/data",server="play.min.io:9000"} 2.7979776e+09

We had previously discussed de-emphasizing the node-level metrics because they should be included in the cluster endpoint as a rollup - is this a bug? cc/ @donatello @shtripat as I think you both have some experience here

ravindk89 commented 4 months ago

https://github.com/minio/minio/blob/master/docs/metrics/prometheus/list.md#drive-metrics

basically very few of these seem to roll up properly

bh4t commented 3 months ago

@kannappanr can you please assist here?

ravindk89 commented 3 months ago

This might be somewhat resolved with metrics v3, but until we've had enough time for customers to roll past that, we will need to maintain both:

And then fixups to ensure that node-level metrics are rolled up appropriately

allanrogerr commented 3 months ago

On metrics v3: These node metrics do not roll up to any cluster metrics:

Total used inodes on a drive
Total free inodes on a drive
Total inodes available on a drive
Average last minute latency in µs for drive API storage operations
Total timeout errors on a drive
Total availability errors (I/O errors, timeouts) on a drive
Total waiting I/O operations on a drive

Node metric Total storage available on a drive in bytes rolls up to Cluster metrics

    Total cluster usable storage capacity in bytes
    Total cluster raw storage capacity in bytes

Node metric Total storage free on a drive in bytes rolls up to Cluster metrics

    Total cluster usable storage free in bytes
    Total cluster raw storage free in bytes

Node metric Total storage used on a drive in bytes rolls up to Cluster metric

    Total cluster usage in bytes

Node metric Count of offline drives rolls up to Cluster metric

    Count of offline drives in the cluster

Node metric Count of online drives rolls up to Cluster metric

    Count of online drives in the cluster

Node metric Count of all drives rolls up to Cluster metric

    Count of all drives in the cluster
ravindk89 commented 2 months ago

@kannappanr @anjalshireesh was there still progress on addressing the metrics v2 rollups above, or should we just proceed with documenting the node-level ones for now?

Otherwise we can just focus on the cluster rollups that do work and drop the rest until v3 stabilizes.

feorlen commented 1 month ago

re: v2 rollup, customer reported these metrics were "missing" after upgrade because they are now found under minio/v2/metrics/node

minio_cluster_replication_link_offline_duration_seconds
minio_cluster_replication_link_online
minio_cluster_replication_current_active_workers
minio_cluster_replication_current_link_latency_ms
minio_cluster_replication_recent_backlog_count
minio_cluster_replication_last_minute_queued_count
minio_cluster_replication_credential_errors
minio_cluster_replication_current_transfer_rate
minio_cluster_replication_last_minute_queued_bytes
minio_cluster_replication_max_queued_count
ravindk89 commented 1 month ago

@kannappanr @anjalshireesh are we generally going to leave metrics v2 as-is for now then, and focus metrics v3? Our attempt to document the recommended alerts gets flaky because we do not list the /node metrics at all - since historically those are not recommended for use.

feorlen commented 1 month ago

see also https://github.com/minio/minio/pull/19932