Open dswarbrick opened 3 days ago
@sbates130272 Can you perhaps confirm that the nvme smart-log
command is indeed controller-specific, not namespace-specific? I noticed that nvme-cli will work with either e.g. /dev/nvme0
or /dev/nvme0n1
, but as far as I can tell, the data returned are controller-specific. I don't think counters such as data_units_written
are tracked per namespace.
I would like to refactor a few of the metrics that we currently expose as "per-device" (which is a misnomer), and instead expose them per controller. This will unfortunately affect about 15 metrics, but I feel that it is important to get this right.
For example, this:
nvme_power_on_hours_total{device="nvme0n1"} 16170.0
would become:
nvme_power_on_hours_total{controller="nvme0"} 16170.0
... as I don't think it's possible for a namespace to be powered on for a different number of hours than another namespace on the same controller.
The metric / labels implemented by this collector hark back to the original nvme_metrics.sh collector, and were based on the wrong assumption that all NVMe devices are single-namespace - probably due to only having consumer-grade SSDs to test it on at the time.
@dswarbrick I did a quick check and depending on the NVMe device it may support SMART log pages on a per namespace basis or it may not. The ability to do this on a per namespace basis is advertised via a field in the controller. Here is the help information for the nvme smart-log
command.
NVME-SMART-LOG(1) NVMe Manual NVME-SMART-LOG(1)
NAME
nvme-smart-log - Send NVMe SMART log page request, returns result and log
SYNOPSIS
nvme smart-log <device> [--namespace-id=<nsid> | -n <nsid>]
[--raw-binary | -b]
[--output-format=<fmt> | -o <fmt>] [--verbose | -v]
DESCRIPTION
Retrieves the NVMe SMART log page from an NVMe device and provides the returned structure.
The <device> parameter is mandatory and may be either the NVMe character device (ex: /dev/nvme0), or a namespace block device (ex: /dev/nvme0n1).
On success, the returned smart log structure may be returned in one of several ways depending on the option flags; the structure may parsed by the program and
printed in a readable format or the raw buffer may be printed to stdout for another program to parse.
OPTIONS
-n <nsid>, --namespace-id=<nsid>
Retrieve the SMART log for the given nsid. This is optional and its success may depend on the device’s capabilities to provide this log on a per-namespace
basis (see the NVMe Identify Controller for this capability). The default nsid to use is 0xffffffff for the device global SMART log.
-b, --raw-binary
Print the raw SMART log buffer to stdout.
-o <fmt>, --output-format=<fmt>
Set the reporting format to normal, json or binary. Only one output format can be used at a time.
-v, --verbose
Increase the information detail in the output.
EXAMPLES
• Print the SMART log page in a human readable format:
# nvme smart-log /dev/nvme0
• Print the raw SMART log to a file:
I will be honest and say I have no idea what percentage of drives in the field or on the market support the smart-log information on a per namespace basis.
@dswarbrick is it worth sending an email to the linux-nvme mailing list so that the people on that list can chime in on this?
@sbates130272 What I am a little unclear about is whether there is any difference between passing e.g. /dev/nvme0
and /dev/nvme0n1
as the device to the nvme smart-log
command. I get the impression that nvme-cli derives the controller, e.g. nvme0
, regardless of whether /dev/nvme0
or /dev/nvme0n1
is specified. This is of course also assuming that the nsid is not specified with --namespace-id
.
Basically the issue is whether we are making a wrong assumption by tagging such metrics with a device="nvme0n1"
label, when in fact they should be tagged with controller="nvme0"
.
In other words, I think it's sufficient for the collector to only fetch the global SMART log for now, but that we should reflect that in the labeling convention. Tagging the SMART metrics with device="nvme0n1"
gives the impression that they are specific to that namespace, whereas nvme smart-log /dev/nvme0n1
is to my understanding not the same as nvme smart-log /dev/nvme0 --namespace-id 1
.
@dswarbrick I took a look at the latest version of the NVMe specification and the relevant code in nvme-cli and it looks like nvme-cli will return a controller-based SMART page unless the --namespace-id
is set. In which case it will return a namespace-based SMART log (or an error if the device does not support namespace-based SMART log pages). So I concur with your assumption that a controller-based SMART log page is always returned unless --namespace-id
is set.
@sbates130272 Thanks for researching that. So I think that, as I alluded to with the FIXME in the current code, it would make sense to just fetch the global SMART log once per controller, i.e., without specifying a namespace, and change the device
label to controller
. If there is a strong need in future, we could fetch it per namespace, with the addition of a nsid
label in the metric. The key thing is clear up the misunderstanding with the e.g. device="nvme0n1"
labels.
@sbates130272 How does this PR look to you now?
This PR contains breaking changes.
Change
nvme_nvmecli
metric type to Info, which also adds a suffix to the metric name, i.e.,nvme_nvmecli_info
.Refactor
nvme_device_info
metric asnvme_controller_info
, adding atransport
label. For example this:becomes:
Fetch device global SMART log once per controller Change
device
label in controller-specific metrics tocontroller
. This also means that the label value will be the NVMe character device name, e.g. "nvme0", instead of the previously used namespace block device name, e.g. "nvme0n1".Introduce
nvme_namespace_info
metric to facilitate join-relationship between namespace-specific and controller-specific metrics.