Open ravindk89 opened 5 months ago
@kannappanr some assistance:
curl --retry 10 -L -X GET https://play.min.io/minio/v2/metrics/cluster | grep -E '^minio_[\s a-z _]*_drive'
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
minio_cluster_drive_offline_total{server="play.min.io:9000"} 0
minio_cluster_drive_online_total{server="play.min.io:9000"} 4
minio_cluster_drive_total{server="play.min.io:9000"} 4
minio_cluster_health_erasure_set_healing_drives{pool="0",server="play.min.io:9000",set="0"} 0
minio_cluster_health_erasure_set_online_drives{pool="0",server="play.min.io:9000",set="0"} 4
Most of the recommended list as discussed does not appear in cluster metrics.
They do appear for the node endpoint:
minio_node_drive_errors_availability{drive="/disk1/data",server="play.min.io:9000"} 0
minio_node_drive_errors_availability{drive="/disk2/data",server="play.min.io:9000"} 0
minio_node_drive_errors_availability{drive="/disk3/data",server="play.min.io:9000"} 0
minio_node_drive_errors_availability{drive="/disk4/data",server="play.min.io:9000"} 0
minio_node_drive_errors_timeout{drive="/disk1/data",server="play.min.io:9000"} 0
minio_node_drive_errors_timeout{drive="/disk2/data",server="play.min.io:9000"} 0
minio_node_drive_errors_timeout{drive="/disk3/data",server="play.min.io:9000"} 0
minio_node_drive_errors_timeout{drive="/disk4/data",server="play.min.io:9000"} 0
minio_node_drive_free_bytes{drive="/disk1/data",server="play.min.io:9000"} 3.9700221952e+10
minio_node_drive_free_bytes{drive="/disk2/data",server="play.min.io:9000"} 4.0129953792e+10
minio_node_drive_free_bytes{drive="/disk3/data",server="play.min.io:9000"} 4.0129642496e+10
minio_node_drive_free_bytes{drive="/disk4/data",server="play.min.io:9000"} 4.013072384e+10
minio_node_drive_free_inodes{drive="/disk1/data",server="play.min.io:9000"} 2.0950584e+07
minio_node_drive_free_inodes{drive="/disk2/data",server="play.min.io:9000"} 2.0950777e+07
minio_node_drive_free_inodes{drive="/disk3/data",server="play.min.io:9000"} 2.0950777e+07
minio_node_drive_free_inodes{drive="/disk4/data",server="play.min.io:9000"} 2.0950777e+07
minio_node_drive_io_waiting{drive="/disk1/data",server="play.min.io:9000"} 0
minio_node_drive_io_waiting{drive="/disk2/data",server="play.min.io:9000"} 0
minio_node_drive_io_waiting{drive="/disk3/data",server="play.min.io:9000"} 0
minio_node_drive_io_waiting{drive="/disk4/data",server="play.min.io:9000"} 0
minio_node_drive_latency_us{api="storage.CreateFile",drive="/disk1/data",server="play.min.io:9000"} 3600
minio_node_drive_latency_us{api="storage.CreateFile",drive="/disk2/data",server="play.min.io:9000"} 3868
minio_node_drive_latency_us{api="storage.CreateFile",drive="/disk3/data",server="play.min.io:9000"} 3454
minio_node_drive_latency_us{api="storage.CreateFile",drive="/disk4/data",server="play.min.io:9000"} 4263
minio_node_drive_latency_us{api="storage.Delete",drive="/disk1/data",server="play.min.io:9000"} 35
minio_node_drive_latency_us{api="storage.Delete",drive="/disk2/data",server="play.min.io:9000"} 34
minio_node_drive_latency_us{api="storage.Delete",drive="/disk3/data",server="play.min.io:9000"} 32
minio_node_drive_latency_us{api="storage.Delete",drive="/disk4/data",server="play.min.io:9000"} 45
minio_node_drive_latency_us{api="storage.DiskInfo",drive="/disk1/data",server="play.min.io:9000"} 30
minio_node_drive_latency_us{api="storage.DiskInfo",drive="/disk2/data",server="play.min.io:9000"} 38
minio_node_drive_latency_us{api="storage.DiskInfo",drive="/disk3/data",server="play.min.io:9000"} 25
minio_node_drive_latency_us{api="storage.DiskInfo",drive="/disk4/data",server="play.min.io:9000"} 39
minio_node_drive_latency_us{api="storage.ListVols",drive="/disk1/data",server="play.min.io:9000"} 1000
minio_node_drive_latency_us{api="storage.ListVols",drive="/disk2/data",server="play.min.io:9000"} 615
minio_node_drive_latency_us{api="storage.ListVols",drive="/disk3/data",server="play.min.io:9000"} 643
minio_node_drive_latency_us{api="storage.ListVols",drive="/disk4/data",server="play.min.io:9000"} 2280
minio_node_drive_latency_us{api="storage.ReadFileStream",drive="/disk2/data",server="play.min.io:9000"} 58
minio_node_drive_latency_us{api="storage.ReadFileStream",drive="/disk3/data",server="play.min.io:9000"} 64
minio_node_drive_latency_us{api="storage.ReadXL",drive="/disk1/data",server="play.min.io:9000"} 58
minio_node_drive_latency_us{api="storage.ReadXL",drive="/disk2/data",server="play.min.io:9000"} 60
minio_node_drive_latency_us{api="storage.ReadXL",drive="/disk3/data",server="play.min.io:9000"} 49
minio_node_drive_latency_us{api="storage.ReadXL",drive="/disk4/data",server="play.min.io:9000"} 71
minio_node_drive_latency_us{api="storage.RenameData",drive="/disk1/data",server="play.min.io:9000"} 802
minio_node_drive_latency_us{api="storage.RenameData",drive="/disk2/data",server="play.min.io:9000"} 1039
minio_node_drive_latency_us{api="storage.RenameData",drive="/disk3/data",server="play.min.io:9000"} 868
minio_node_drive_latency_us{api="storage.RenameData",drive="/disk4/data",server="play.min.io:9000"} 1075
minio_node_drive_latency_us{api="storage.StatVol",drive="/disk1/data",server="play.min.io:9000"} 41
minio_node_drive_latency_us{api="storage.StatVol",drive="/disk2/data",server="play.min.io:9000"} 60
minio_node_drive_latency_us{api="storage.StatVol",drive="/disk3/data",server="play.min.io:9000"} 20
minio_node_drive_latency_us{api="storage.StatVol",drive="/disk4/data",server="play.min.io:9000"} 33
minio_node_drive_latency_us{api="storage.WalkDir",drive="/disk1/data",server="play.min.io:9000"} 234
minio_node_drive_latency_us{api="storage.WalkDir",drive="/disk2/data",server="play.min.io:9000"} 329
minio_node_drive_latency_us{api="storage.WalkDir",drive="/disk3/data",server="play.min.io:9000"} 465
minio_node_drive_latency_us{api="storage.WalkDir",drive="/disk4/data",server="play.min.io:9000"} 632
minio_node_drive_offline_total{server="play.min.io:9000"} 0
minio_node_drive_online_total{server="play.min.io:9000"} 4
minio_node_drive_total{server="play.min.io:9000"} 4
minio_node_drive_total_bytes{drive="/disk1/data",server="play.min.io:9000"} 4.292870144e+10
minio_node_drive_total_bytes{drive="/disk2/data",server="play.min.io:9000"} 4.292870144e+10
minio_node_drive_total_bytes{drive="/disk3/data",server="play.min.io:9000"} 4.292870144e+10
minio_node_drive_total_bytes{drive="/disk4/data",server="play.min.io:9000"} 4.292870144e+10
minio_node_drive_used_bytes{drive="/disk1/data",server="play.min.io:9000"} 3.228479488e+09
minio_node_drive_used_bytes{drive="/disk2/data",server="play.min.io:9000"} 2.798747648e+09
minio_node_drive_used_bytes{drive="/disk3/data",server="play.min.io:9000"} 2.799058944e+09
minio_node_drive_used_bytes{drive="/disk4/data",server="play.min.io:9000"} 2.7979776e+09
We had previously discussed de-emphasizing the node-level metrics because they should be included in the cluster endpoint as a rollup - is this a bug? cc/ @donatello @shtripat as I think you both have some experience here
https://github.com/minio/minio/blob/master/docs/metrics/prometheus/list.md#drive-metrics
basically very few of these seem to roll up properly
@kannappanr can you please assist here?
This might be somewhat resolved with metrics v3, but until we've had enough time for customers to roll past that, we will need to maintain both:
And then fixups to ensure that node-level metrics are rolled up appropriately
On metrics v3: These node metrics do not roll up to any cluster metrics:
Total used inodes on a drive
Total free inodes on a drive
Total inodes available on a drive
Average last minute latency in µs for drive API storage operations
Total timeout errors on a drive
Total availability errors (I/O errors, timeouts) on a drive
Total waiting I/O operations on a drive
Node metric Total storage available on a drive in bytes
rolls up to Cluster metrics
Total cluster usable storage capacity in bytes
Total cluster raw storage capacity in bytes
Node metric Total storage free on a drive in bytes
rolls up to Cluster metrics
Total cluster usable storage free in bytes
Total cluster raw storage free in bytes
Node metric Total storage used on a drive in bytes
rolls up to Cluster metric
Total cluster usage in bytes
Node metric Count of offline drives
rolls up to Cluster metric
Count of offline drives in the cluster
Node metric Count of online drives
rolls up to Cluster metric
Count of online drives in the cluster
Node metric Count of all drives
rolls up to Cluster metric
Count of all drives in the cluster
@kannappanr @anjalshireesh was there still progress on addressing the metrics v2 rollups above, or should we just proceed with documenting the node-level ones for now?
Otherwise we can just focus on the cluster rollups that do work and drop the rest until v3 stabilizes.
re: v2 rollup, customer reported these metrics were "missing" after upgrade because they are now found under minio/v2/metrics/node
minio_cluster_replication_link_offline_duration_seconds
minio_cluster_replication_link_online
minio_cluster_replication_current_active_workers
minio_cluster_replication_current_link_latency_ms
minio_cluster_replication_recent_backlog_count
minio_cluster_replication_last_minute_queued_count
minio_cluster_replication_credential_errors
minio_cluster_replication_current_transfer_rate
minio_cluster_replication_last_minute_queued_bytes
minio_cluster_replication_max_queued_count
@kannappanr @anjalshireesh are we generally going to leave metrics v2 as-is for now then, and focus metrics v3? Our attempt to document the recommended alerts gets flaky because we do not list the /node
metrics at all - since historically those are not recommended for use.
Summary
From an internal discussion, we should expand the alerting page to include the following list of recommended metrics:
minio_node_drive_free_bytes
minio_node_drive_free_inodes
minio_node_drive_latency_us
minio_node_drive_offline_total
minio_node_drive_online_total
minio_node_drive_total
minio_node_drive_total_bytes
minio_node_drive_used_bytes
minio_node_drive_errors_timeout
minio_node_drive_errors_availability
minio_node_drive_io_waiting
There's a lot of metrics here and the page already has some examples, so I'm thinking we can use a tab setup of something like
To help constrain the default length of the procedure.
Goals
List the in-scope goals
Non-Goals
Extensive testing of Prometheus + Alert Manager w/ the above metrics
Additional context Add any other context or screenshots about the feature request here.