linux-surface / surface-aggregator-module

Linux ACPI and Platform Drivers for Surface Devices using the Surface Aggregator Module over Surface Serial Hub (Surface Book 2, Surface Pro 2017, Surface Laptop, and Newer)
GNU General Public License v2.0
93 stars 11 forks source link

Support for Thermal Subsystem #59

Open qzed opened 2 years ago

qzed commented 2 years ago

The thermal subsystem (TMP) currently lacks a driver for 7th and later generation models. This should be a SSAM client device driver and needs devices to be added to the registry (e.g. IDs ssam:01:03:<TID>:<IID>:<FUN>).

Since we can actually detect which sensor instances are available, there should probably be a thermal hub device managing (i.e. instantiating) individual sensor devices. It seems that there's also the possibility for having different target IDs ('TID') so it might be a good idea to have one hub per target ID. Hubs and sensor devices can be separated by function ID (FUN), e.g. hub has FUN=0 and sensor FUN=1.

There is some functionality to set sensor trip points, implementing this should also considered but might be something for a later time. First priority should be getting the values to sysfs in a format that e.g. thermald can read. Gen7 devices do have a DPTF config which seems to depend on those sensors, so we'll have to figure out what it expects and how to make that compatible, ideally so that thermald can automatically use that config, which should then result in behavior similar to Windows.

On the gen5 devices, these sensors have some ID string embedded in ACPI, which gives an indication what it is used for. Ideally we'd find a mapping for those on the gen7 devices as well. This might be encoded in the Windows driver, so might need some reverse engineering.

A list of currently known commands and events is provided here: https://github.com/linux-surface/surface-aggregator-module/blob/d85d69a25c8d47e769bad74ada3d5e9a19e0b883/doc/requests.txt#L139-L160

See https://www.kernel.org/doc/html/latest/driver-api/surface_aggregator/client.html#adding-ssam-devices for documentation on client devices and drivers.

gamer191 commented 2 years ago

which should then result in behavior similar to Windows

Wait, is surface-linux currently not behaving (thermal-management-wise) like windows? Is it dangerous/have the potential to cause hardware damage? I'm probably being a bit paranoid, but since I don't really understand that much about Linux's internal etc, I'm a little worried?

qzed commented 2 years ago

Wait, is surface-linux currently not behaving (thermal-management-wise) like windows?

Pretty much no laptop does out of the box. Intel DPTF stuff that describes throttling via ACPI needs at least thermald (and then thermald needs to support the ACPI stuff, which IIRC has also been an issue in the past). On the newer Surfaces we're also missing the thermal sensors so even with thermald at the moment this doesn't work.

Is it dangerous/have the potential to cause hardware damage?

There should be hard safeguards in firmware, like the CPU throttling and being locked to 400MHz (which has been reported on a couple of issues). To avoid these triggering, it's a good idea to have a custom thermald config (like the one provided in the repos) that is a bit more sensible (i.e. have something that throttles gradually before running into those limits, hopefully preventing ever running into those).

Also, as a very late edit to this: CPUs generally have thermal sensors built in and those are not handled via SAM. SAM sensors are mostly skin sensors, and on some devices also peripherals (e.g. battery).

qzed commented 2 years ago

Note: SL3 and SP7 handle temperature sensors via SAN, i.e. still via SAM but indirectly via ACPI and the ACPI bridge, which is already supported. So I've removed those.

qzed commented 2 years ago

I've added some SAM request/interface specifications for the thermal subsystem in https://github.com/linux-surface/surface-aggregator-cmddb/commit/257e4b206465db8e1ae5e8eab752a13d8d96689e. That should provide all the necessary info to implement a Linux driver.

qzed commented 2 years ago

I've added a basic implementation with https://github.com/linux-surface/kernel/commit/5fb7bb8160343cfa11a109c0970cd4ff562723db and https://github.com/linux-surface/kernel/commit/1cdd37f43fe526c259d06702528808e7d6efd66a.

iwanders commented 6 months ago

Following up here from https://github.com/linux-surface/kernel/pull/144#issuecomment-1858999412 .

I pulled the feature/thermal-sensors branch at 004898c290c9f11cd4907902773aa0cc21be9843.

On surface pro 9, the following thermal sensors appear;

ssam_thermal-virtual-0
Adapter: Virtual device
temp1:        +30.8C
temp2:        +30.6C
temp3:        +30.6C
temp4:        +30.5C
temp5:        +30.3C
temp6:        +30.3C
temp7:        +30.6C
temp8:        +30.6C
temp9:        +30.4C
temp10:       +30.4C
temp11:       +30.3C

@qzed , in one of the comments you mentioned it'd be great if we can get names for them.

I had no fear, so went with a polling approach:

``` $ /home/ivor/.nix-profile/bin/python ./ctrl.py request 3 1 1 1 1 08 0c $ /home/ivor/.nix-profile/bin/python ./ctrl.py request 3 1 2 1 1 03 00 00 00 01 00 03 00 $ /home/ivor/.nix-profile/bin/python ./ctrl.py request 3 1 3 1 1 TimeoutError: [Errno 110] ETIMEDOUT $ /home/ivor/.nix-profile/bin/python ./ctrl.py request 3 1 4 1 1 ff 07 $ /home/ivor/.nix-profile/bin/python ./ctrl.py request 3 1 5 1 1 TimeoutError: [Errno 110] ETIMEDOUT $ /home/ivor/.nix-profile/bin/python ./ctrl.py request 3 1 6 1 1 TimeoutError: [Errno 110] ETIMEDOUT $ /home/ivor/.nix-profile/bin/python ./ctrl.py request 3 1 7 1 1 TimeoutError: [Errno 110] ETIMEDOUT $ /home/ivor/.nix-profile/bin/python ./ctrl.py request 3 1 8 1 1 TimeoutError: [Errno 110] ETIMEDOUT $ /home/ivor/.nix-profile/bin/python ./ctrl.py request 3 1 9 1 1 00 $ /home/ivor/.nix-profile/bin/python ./ctrl.py request 3 1 10 1 1 0a 7d 00 00 $ /home/ivor/.nix-profile/bin/python ./ctrl.py request 3 1 11 1 1 TimeoutError: [Errno 110] ETIMEDOUT $ /home/ivor/.nix-profile/bin/python ./ctrl.py request 3 1 12 1 1 TimeoutError: [Errno 110] ETIMEDOUT $ /home/ivor/.nix-profile/bin/python ./ctrl.py request 3 1 13 1 1 02 00 01 00 01 02 03 04 05 06 07 08 09 0a 0b 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 $ /home/ivor/.nix-profile/bin/python ./ctrl.py request 3 1 14 1 1 27 16 01 49 5f 52 54 53 31 00 00 00 00 00 00 00 00 00 00 00 00 $ /home/ivor/.nix-profile/bin/python ./ctrl.py request 3 1 14 2 1 3a 36 01 49 5f 52 54 53 32 00 00 00 00 00 00 00 00 00 00 00 00 $ /home/ivor/.nix-profile/bin/python ./ctrl.py request 3 1 14 3 1 22 d5 01 49 5f 52 54 53 33 00 00 00 00 00 00 00 00 00 00 00 00 ```
$ /home/ivor/.nix-profile/bin/python ./ctrl.py request 3 1 14 1 1
27 16 01 49 5f 52 54 53 31 00 00 00 00 00 00 00 00 00 00 00 00
'I_RTS1������������

So command 14 is the name, but it's pretty underwhelming as you can see, they're just named RTS#, the firmware image (SurfaceSAM_6.101.139.bin.0.img) doesn't contain any other useful temperature names, with strings I can even find the block of I_RTS# sensors;

I_RTS1
I_RTS2
I_RTS3
I_RTS4
I_RTS5
I_RTS6
I_RTS7
I_RTS8
I_VTS1
I_VTS2
I_VTS3

edit; Running a test now, I think something is off.

qzed commented 6 months ago

Thanks! Well, at least it's marginally better than not having names at all. Names look similar on my SPX, and the SB2 does support temperature readouts, but the CID 14 times out (likely because back then it was routed via ACPI and extended functionality like names was hard-coded there).

I think to be able to actually incorporate sensor names/labels we'd have to rewrite the driver as hwmon driver.

qzed commented 6 months ago

Based on a quick look, this could then also be just one single hwmon driver and device instead of a hub and client drivers with multiple sensor device instances.

iwanders commented 6 months ago

The scaling is definitely off:

thermals

Temperatures stay at ~30, while I'm certain my tablet was more than lukewarm... stress -c12 going... With log_sensors.py (but forgot to log the load :sob:), and plot script

Data: thermal_hub_with_speed_load_stress.json.gz

We may be able to 'salvage' the values by just multiplying with 1000 and then bitmasking / shifting as appropriate.

qzed commented 6 months ago

Keep in mind that it also depends on where the sensors are placed. As far as I know, on the SPX at least a bunch (if not all) of those should be skin temp sensors, so not that close to the actual hot components (in particular because CPUs and most bigger components generally have their own temperature sensors).

On the SB2 at least (which uses the same CID for temperature queries via ACPI), the values are not scaled at all (ACPI reports in centi-degree). See e.g. here and here. I somewhat doubt that they would change scaling and "break backwards compatibility" of that function in such a way. From past experiences, I think they'd just add a new function.

iwanders commented 6 months ago

Keep in mind that it also depends on where the sensors are placed

Agreed, but if I have 11 sensors in my device, they're not all going to be at the far end where everything stayed between 30 and 35 celsius, especially not if many spots on the back were hot to touch.

If we apply WMI scaling (idea from here) with https://github.com/iwanders/nixos-surface/commit/a3196593eb51deef04f6e5152f5871b048ace30a so values are in 0.1 Kelvin, we obtain a reasonable plot, especially when we compare it with some acpi temperature sensors.

thermals_kelvin

The pci and wifi sensors are these entries from sensors;

nvme-pci-0200
Adapter: PCI adapter
Composite:    +37.9°C  (low  = -273.1°C, high = +82.8°C)
                       (crit = +84.8°C)
Sensor 1:     +37.9°C  (low  = -273.1°C, high = +65261.8°C)

iwlwifi_1-virtual-0
Adapter: Virtual device
temp1:        +33.0°C  

I presume those come from an ACPI bus.

I somewhat doubt that they would change scaling and "break backwards compatibility" of that function in such a way. From past experiences, I think they'd just add a new function.

I would 100% hope they didn't break backwards compatibility, but these numbers seem to indicate they did :grimacing:

qzed commented 6 months ago

Turns out you are completely correct! It looks like for some reason I misread and misremembered ACPI spec. ACPI says its values are in 1/10 Kelvin, not 1/100 Celsius... So no breakage, I just misunderstood things and the current driver implementation is wrong as a result of that. As you posted it, scaling of the SAM value should be x * 100 - 273150 to get the millidegree Celsius that Linux wants (or x / 10 - 273.15 for degree Celsius).

iwanders commented 6 months ago

Ah excellent news! Glad we don't need to sprinkle if-statements! :tada:

qzed commented 6 months ago

Fixed in 47ffb20dca031af8ce8f61636cb6ecd6f2a43f41. I decided to do the Kelvin subtraction in the original scale (i.e., rounding that to 273.1K, otherwise we'd just have some fixed 0.05 in every reported value.

iwanders commented 6 months ago

Yeah, cherrypicked that into my dev repo (edits to work around lack of thermal_tripless_zone_device_register >_<), can confirm temperatures look good now!

qzed commented 6 months ago

I've changed it to a hwmon-based driver in 09526a2d47711e258ab37ae99f348f5dd915e8a4. I hacked that up in a train, so the quality might not be the best. Also I could only test this on my SB2, so the name/label support isn't implemented yet.

I'll be without my main workstation for the next couple of days, so I'm not sure if I'll be able to compile a kernel for the SPX, meaning the label support might need to wait until I'm back home. Unless you want to have a go at it.

iwanders commented 6 months ago

Took a crack at it; https://github.com/linux-surface/surface-aggregator-module/pull/68

The duplicates do make me worry a bit whether it's really the name, or just a string... :/

qzed commented 6 months ago

I've pushed some more commits to feature/thermal-sensors, if you want to test those. With those, I think it's now in a state where it could go upstream.

iwanders commented 3 months ago

Just capturing this here, I came across the following id's in the SurfaceDiagnosticToolkit_SDT4B.exe binary (from the business diagnostic toolkit found here) when looked at with ilspy.

TouchTemperature2 = 12;
TouchTemperature = 9;
BatteryTemperature = 5;