Avergare latency for IO requests

I found that the dashboard does not provide the equivalent of avio from atop, or lat metric from fio, which is an average time for execution for IO request.

Basically, it's a reverse of IOPS for the case when there is a single-threaded (iodepth=1) load. For multiqueue case it's more complicated, but this metrics shows the execution time for IO request.

I played a bit with dashboards and prom, run some fio benchmarks under prom scraping to verify results, and my current formula looks like this:

irate(node_disk_io_time_seconds_total{instance="$node",job="$job"}[5m]) /
     (
         irate(node_disk_writes_completed_total{instance="$node",job="$job"}[5m]) + 
         irate(node_disk_reads_completed_total{instance="$node",job="$job"}[5m])
      )

(io_time_rate / (write_requests + read_requests).

I checked it with synthetic load iodepth=1, iodepth=4 and 32, they all are consistent with fio and atop output.

Should I add one more graph? Is this formula sounds right?

rfmoz / grafana-dashboards

Avergare latency for IO requests #79