oxidecomputer / crucible

A storage service.
Mozilla Public License 2.0
175 stars 18 forks source link

Crucible metrics V2 #442

Open leftwo opened 2 years ago

leftwo commented 2 years ago

In the spirit of trying to answer the question "My VM IO is slow, why?", we want more metrics.

These metrics will help us either exonerate crucible, or identify where in crucible the problem is.

There are a bunch of dtrace probes in both crucible upstairs, crucible downstairs, and the crucible volume layer. These dtrace probes should be turned into actual metrics that are collected by Oximeter. In this case crucible should do the work of creating the histogram buckets and Oximeter would just collect them.

Additional metrics around throughput and queue depth are also desired, with more detail to come.

leftwo commented 2 years ago

Queue depth stats inspiration: https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/sys/kstat.h#L603-L672

david-crespo commented 5 months ago

Is this still a thing? @kev507 mentioned today that storage latency metrics are something people are interested in.

bnaecker commented 5 months ago

I think this is definitely still a thing. In addition to possibly tracking more statistics, we should update the fields on all statistics we do currently track. They ought to include at least:

It's not currently possible to update the schema for the existing timeseries. In this case, I would suggest we completely rename the existing, from things like crucible_upstairs:read to virtual_disk:read or similar. We can continue to report Crucible-specific metrics with that existing target if we want, but the virtual disk stuff itself I think makes sense to put somewhere else.

@leftwo let me know if you'd prefer to track the improvements to the field names (and timeseries name, if you agree with that) as a separate issue.