Open leftwo opened 2 years ago
Queue depth stats inspiration: https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/sys/kstat.h#L603-L672
Is this still a thing? @kev507 mentioned today that storage latency metrics are something people are interested in.
I think this is definitely still a thing. In addition to possibly tracking more statistics, we should update the fields on all statistics we do currently track. They ought to include at least:
It's not currently possible to update the schema for the existing timeseries. In this case, I would suggest we completely rename the existing, from things like crucible_upstairs:read
to virtual_disk:read
or similar. We can continue to report Crucible-specific metrics with that existing target if we want, but the virtual disk stuff itself I think makes sense to put somewhere else.
@leftwo let me know if you'd prefer to track the improvements to the field names (and timeseries name, if you agree with that) as a separate issue.
In the spirit of trying to answer the question "My VM IO is slow, why?", we want more metrics.
These metrics will help us either exonerate crucible, or identify where in crucible the problem is.
There are a bunch of dtrace probes in both crucible upstairs, crucible downstairs, and the crucible volume layer. These dtrace probes should be turned into actual metrics that are collected by Oximeter. In this case crucible should do the work of creating the histogram buckets and Oximeter would just collect them.
Additional metrics around throughput and queue depth are also desired, with more detail to come.