openbmc / technical-oversight-forum

6 stars 1 forks source link

Split nvmesensor out from dbus-sensors #30

Closed amboar closed 11 months ago

amboar commented 11 months ago

Design Document

  1. Patch to openbmc/docs
  2. Rendered

Naming

  1. nvmed
  2. nvmesensor

Maintainers

  1. Andrew Jeffery, TOF member
williamspatrick commented 11 months ago

One consideration is on naming. As much as prefixing 'phosphor-' on everything has lead to some grumbling, the original reason for doing that was that we were nearly certain that there would never be a package naming collision. Using nvmed is much more likely that some day someone else makes another nvmed and we get surprised by a Yocto update that adds a recipe that ends up taking precedent over ours.

(I ran into this problem recently in facebook/openbmc with a package we had named libnm that ended up colliding with a package provided by the network-manager recipe.)

I'm not suggesting we have to name this phosphor-foo but we might want to pick a more unique name.

mine260309 commented 11 months ago

I would like to discuss a somehow more "generic" approach for daemon like this. Not only nvme has this issue, other peripherals like nic cards/gpus have the same situation, that likely we will have "nic-sensor, gpu-sensor", but also "nic-daemon", "gpu-daemon".

How about considering the above use cases as well?

williamspatrick commented 11 months ago

Aren’t NICs and GPUs using PLDM? They’re certainly not using NVME-MI.

In my opinion, the vast majority of interesting code is unique to the protocol and not due to “I’m making a sensor” or “I’ve got some inventory”. Those common aspects should, in my opinion, be mostly covered by the generated bindings anyhow.

mine260309 commented 11 months ago

I am not referring to the protocol. The idea here is that for devices like NVMe/NIC/GPU, BMC will need to not only create sensors but also do other management, e.g. inventory, fault monitor, etc.

Such functions do not fit in dbus-sensors, which is why @amboar raises the proposal here. I would like to discuss a generalized approach to handle such use case.

williamspatrick commented 11 months ago

What is the potential code sharing though? I'm not understanding what a 'generalized approach' would look like if the protocols are different and the code makes common dbus interfaces.

My current expectation is that GPU and NICs should/would be supported by openbmc/pldm.

wak-google commented 11 months ago

Why would other management functions not just be done through a whole separate daemon. The original idea with dbus is that you could have multiple daemons providing the same objects with different interfaces. You could still provide the sensors bits here.

amboar commented 11 months ago

I am not referring to the protocol. The idea here is that for devices like NVMe/NIC/GPU, BMC will need to not only create sensors but also do other management, e.g. inventory, fault monitor, etc.

Such functions do not fit in dbus-sensors, which is why @amboar raises the proposal here. I would like to discuss a generalized approach to handle such use case.

This is a bit of a "useful reduction" of my argument in the linked design document. I'll bring the three points I was raising over here:

  1. nvmesensor becomes much more than a sensor daemon
  2. nvmesensor becomes a significant portion of the dbus-sensors codebase
  3. Maintenance of nvmesensor requires significant domain-specific knowledge

Your argument is around point 1, but somewhat disregards point 2 and 3. It's not clear to me whether you're suggesting adding GPU and NIC support into the repository I'm proposing here, but if you are that runs afoul of point 3 by requiring unrelated, significant, domain-specific knowledge to work with it. I'm trying to reduce the scope of the three specific problems I outlined in the points above by isolating the nvmesensor code.

I don't think there are significant issues with reassessing at some later point whether there is infrastructure that can be shared. I'm happy for those patterns to emerge from the state of the ecosystem rather than try to enforce some top-down structure from the start when it's unclear what it is we're trying to generalise. Please create a separate issue for discussion or a vote on your concerns. Let's try to keep the scope tight here - I prefer that "generalising now" is a non-goal for this TOF issue.

drakedog2008 commented 11 months ago

Why would other management functions not just be done through a whole separate daemon. The original idea with dbus is that you could have multiple daemons providing the same objects with different interfaces. You could still provide the sensors bits here.

The NVMe-MI provides the whole stack of functionalities including but beyond the thermal reading. In the other word, the thermal reading is embedded in the NVMe architecture and the health reporting mechanism, which is not suitable to split into multiple daemons.

As for the original NVMe BASIC based nvmesensor. The BASIC protocol is more like a premature and simplified version before NVMe-MI 1.0. But the status report definition is similar to NVMe-MI and worth sharing some of the implementation.

Quoting Andrew's proposal:

Prior to completion and public availability of the NVMe-MI 1.0 specification, we have received feedback that a standardized method is needed by the industry to poll an NVMe device for basic health status information and which only requires SMBus slave support. The intent of the technical note is to publicly release a standardized approach for such a capability.

amboar commented 11 months ago

Why would other management functions not just be done through a whole separate daemon. The original idea with dbus is that you could have multiple daemons providing the same objects with different interfaces. You could still provide the sensors bits here.

@wak-google initially I interpreted your comment here as a response to @mine260309 but in case it wasn't, I would like to avoid a third NVMe stack in the project, as outlined in the design document. That's the motivation for chopping out nvmesensor and extending it (in addition to @drakedog2008 's response above).

amboar commented 11 months ago

One consideration is on naming. As much as prefixing 'phosphor-' on everything has lead to some grumbling, the original reason for doing that was that we were nearly certain that there would never be a package naming collision. Using nvmed is much more likely that some day someone else makes another nvmed and we get surprised by a Yocto update that adds a recipe that ends up taking precedent over ours.

(I ran into this problem recently in facebook/openbmc with a package we had named libnm that ended up colliding with a package provided by the network-manager recipe.)

I'm not suggesting we have to name this phosphor-foo but we might want to pick a more unique name.

@williamspatrick so tackling this concern, we already have phosphor-nvme. I will argue that it takes phopshor-nvmed out as a potential name as they are far too similar. So we'll need something else.

A whimsical approach might be to find a representative word, e.g.:

$ grep -w 'n.*v.*m.*e' /usr/share/dict/words
nevermore

Rumour has it this technique was used to name the samba project. Existing practice for whimsical naming in OpenBMC includes swampd from openbmc/phosphor-pid-control. If others have suggestions, please list them for consideration.

amboar commented 11 months ago

Based on @edtanous' feedback on the proposal document I'm choosing to defer this for now. We can experiment with some of the strategies he outlined and reassess whether this is necessary.

Thanks for the input.