openbmc / dbus-sensors

D-Bus configurable sensor scanning applications
Apache License 2.0
23 stars 44 forks source link

The sensor value override feature is obscure and needs visibility #32

Open Krellan opened 1 month ago

Krellan commented 1 month ago

Currently, the dbus-sensors sensor daemons support the idea of sensor value override, that is, the ability to receive a written value from a D-Bus incoming command, instead of reading the actual value from the underlying sensor hardware. This feature is also known as sensor mocking.

This feature is very useful for debugging, but it is obscure. It has the following limitations:

So, to summarize, let's make an interface with three bools, and add it to each sensor.

This is closely related to the Mutability feature we talked about a while ago. The difference is that of layering: sensors that are mutable are designed to have their values typically externally changeable by end users, such as fan speed settings in manual mode, and such. The sensor mocking feature is more of a development and test feature, however, and typically will not be allowed during production.

So, any thoughts on this?

amboar commented 1 month ago

Three bools gives 8 states. Are they all valid?

I expect it would be better to use a "choice" style string-type property that behaves as an enumeration (and given its OpenBMC-related, needs to be defined in PDI and use the sdbusplus enum support).

edtanous commented 1 month ago

The override is placed into effect when the first D-Bus write command is received, and then, it will remain in effect until the sensor daemon itself is forcefully restarted. There is no way to gracefully cancel the sensor override feature from being in effect.

This was an intentional design decision, the expectation is that when completed running mocks/testing, the user by policy should reboot the bmc when the test completes. Doing anything else is a half-measure that very likely could lead to things not being reset properly, and us debugging code that isn't useful to a product. With that said, if you wanted to add support for something like this and document that it's best effort, feel free.

Enabling or disabling sensor mocking is done at compile time. At runtime, there is no way to see if the sensor mocking feature is available or not, short of trying it to see if you get an error message or not.

The compile time flag is to allow opting out of the feature for platforms that don't want it, or might consider it a potential DOS issue if an attacker were to get access to dbus. Why is returning an error message that a feature is not supported a problem in practice?

So, to summarize, let's make an interface with three bools, and add it to each sensor.

A dbus interface somewhat of defeats the purpose of having a mock, because now other services can code to "is this being mocked" and the behavior of the code could be different when in test mode versus not. I could certainly see the argument that we should log these things to the journal when they're mocked and when they're not, but a formal interface implies that we expect software to use it, which by design, we don't want to have happen.

and typically will not be allowed during production.

What does this mean? There are absolutely test harnesses that run this test during a production build. In fact, that's a lot of the reason for having it, being able to troubleshoot software on a production signed build.

For the logging portion, I think good old journal logs would get us where we need to be in short order, logging when sensors enter and exit mocking mode.

Krellan commented 1 month ago

Three bools gives 8 states. Are they all valid?

I expect it would be better to use a "choice" style string-type property that behaves as an enumeration (and given its OpenBMC-related, needs to be defined in PDI and use the sdbusplus enum support).

Good point. There are less than 8 valid states, so an enum might be preferable. Here's the states that would be valid:

I still think using bools would be clearer than having to come up with strings to describe each of these.

Another reason to use bools, instead of an enum, is to make it impossible to have to deal with the case of an invalid string, and also, to make it self-documenting what choices are available (instead of forcing the user to guess, or read the source, to learn which strings are valid).

Krellan commented 1 month ago

This was an intentional design decision, the expectation is that when completed running mocks/testing, the user by policy should reboot the bmc when the test completes. Doing anything else is a half-measure that very likely could lead to things not being reset properly, and us debugging code that isn't useful to a product. With that said, if you wanted to add support for something like this and document that it's best effort, feel free.

Interesting. Although inconvenient, it does reduce the attack surface for bugs.

The compile time flag is to allow opting out of the feature for platforms that don't want it, or might consider it a potential DOS issue if an attacker were to get access to dbus. Why is returning an error message that a feature is not supported a problem in practice?

Because it's a one-way thing, as you mentioned above, no way to undo it without restarting the sensor services (or rebooting the BMC). If I wanted to probe for the existence of sensor mocking, there's no way to do this safely, as if I try to write a value, even the same value that the sensor is already set to, it will either generate the error message or activate the sensor mocking feature, even if I don't want to activate it (I just wanted to see if it was enabled or not).

A dbus interface somewhat of defeats the purpose of having a mock, because now other services can code to "is this being mocked" and the behavior of the code could be different when in test mode versus not. I could certainly see the argument that we should log these things to the journal when they're mocked and when they're not, but a formal interface implies that we expect software to use it, which by design, we don't want to have happen.

What would be the use case of a "stealth" mocking feature that would be enabled but not having any way for the user to know if it is enabled or not?

As for the interface, we could do the same compromise that was done for phosphor-pid-control informative diagnostic information that is now appearing in the D-Bus objects (and is a great help when trying to tune a running system): use the Debug namespace for the name of the interface. This avoids having to formalize and standardize it.

What does this mean? There are absolutely test harnesses that run this test during a production build. In fact, that's a lot of the reason for having it, being able to troubleshoot software on a production signed build.

Agreed, and one of the tests I wanted to add in our product was a test to properly verify that the sensor mocking feature was disabled and inactive during a production build. Given the above, I can't do this test without inadvertently activating the sensor mocking feature (which is a test failure, but causes me to have to reboot the BMC, because of this). So, this test has to be classified as a disruptive test instead of a passive test, which, among other things, means I can't run it in parallel with other tests (in our particular test framework).

For the logging portion, I think good old journal logs would get us where we need to be in short order, logging when sensors enter and exit mocking mode.

This would be good to add, as a low-hanging fruit. As for entering mocking mode, this would take place upon the first D-Bus write received (unless the sensor is a special one with its own write handlers, such as the fan PWM sensors, which are designed to push sensor writes down to the underlying hardware, to set the fan speed during normal running conditions). As for leaving mocking mode, there is currently no way to do this without a restart, as you mentioned, which already has fairly loud logging.