Open qzed opened 2 years ago
which should then result in behavior similar to Windows
Wait, is surface-linux currently not behaving (thermal-management-wise) like windows? Is it dangerous/have the potential to cause hardware damage? I'm probably being a bit paranoid, but since I don't really understand that much about Linux's internal etc, I'm a little worried?
Wait, is surface-linux currently not behaving (thermal-management-wise) like windows?
Pretty much no laptop does out of the box. Intel DPTF stuff that describes throttling via ACPI needs at least thermald (and then thermald needs to support the ACPI stuff, which IIRC has also been an issue in the past). On the newer Surfaces we're also missing the thermal sensors so even with thermald at the moment this doesn't work.
Is it dangerous/have the potential to cause hardware damage?
There should be hard safeguards in firmware, like the CPU throttling and being locked to 400MHz (which has been reported on a couple of issues). To avoid these triggering, it's a good idea to have a custom thermald config (like the one provided in the repos) that is a bit more sensible (i.e. have something that throttles gradually before running into those limits, hopefully preventing ever running into those).
Also, as a very late edit to this: CPUs generally have thermal sensors built in and those are not handled via SAM. SAM sensors are mostly skin sensors, and on some devices also peripherals (e.g. battery).
Note: SL3 and SP7 handle temperature sensors via SAN, i.e. still via SAM but indirectly via ACPI and the ACPI bridge, which is already supported. So I've removed those.
I've added some SAM request/interface specifications for the thermal subsystem in https://github.com/linux-surface/surface-aggregator-cmddb/commit/257e4b206465db8e1ae5e8eab752a13d8d96689e. That should provide all the necessary info to implement a Linux driver.
I've added a basic implementation with https://github.com/linux-surface/kernel/commit/5fb7bb8160343cfa11a109c0970cd4ff562723db and https://github.com/linux-surface/kernel/commit/1cdd37f43fe526c259d06702528808e7d6efd66a.
Following up here from https://github.com/linux-surface/kernel/pull/144#issuecomment-1858999412 .
I pulled the feature/thermal-sensors
branch at 004898c290c9f11cd4907902773aa0cc21be9843.
On surface pro 9, the following thermal sensors appear;
ssam_thermal-virtual-0
Adapter: Virtual device
temp1: +30.8C
temp2: +30.6C
temp3: +30.6C
temp4: +30.5C
temp5: +30.3C
temp6: +30.3C
temp7: +30.6C
temp8: +30.6C
temp9: +30.4C
temp10: +30.4C
temp11: +30.3C
@qzed , in one of the comments you mentioned it'd be great if we can get names for them.
I had no fear, so went with a polling approach:
$ /home/ivor/.nix-profile/bin/python ./ctrl.py request 3 1 14 1 1
27 16 01 49 5f 52 54 53 31 00 00 00 00 00 00 00 00 00 00 00 00
'I_RTS1������������
So command 14 is the name, but it's pretty underwhelming as you can see, they're just named RTS#
, the firmware image (SurfaceSAM_6.101.139.bin.0.img
) doesn't contain any other useful temperature names, with strings
I can even find the block of I_RTS#
sensors;
I_RTS1
I_RTS2
I_RTS3
I_RTS4
I_RTS5
I_RTS6
I_RTS7
I_RTS8
I_VTS1
I_VTS2
I_VTS3
edit; Running a test now, I think something is off.
Thanks! Well, at least it's marginally better than not having names at all. Names look similar on my SPX, and the SB2 does support temperature readouts, but the CID 14 times out (likely because back then it was routed via ACPI and extended functionality like names was hard-coded there).
I think to be able to actually incorporate sensor names/labels we'd have to rewrite the driver as hwmon driver.
Based on a quick look, this could then also be just one single hwmon driver and device instead of a hub and client drivers with multiple sensor device instances.
The scaling is definitely off:
Temperatures stay at ~30, while I'm certain my tablet was more than lukewarm... stress -c12
going... With log_sensors.py (but forgot to log the load :sob:), and plot script
Data: thermal_hub_with_speed_load_stress.json.gz
We may be able to 'salvage' the values by just multiplying with 1000 and then bitmasking / shifting as appropriate.
Keep in mind that it also depends on where the sensors are placed. As far as I know, on the SPX at least a bunch (if not all) of those should be skin temp sensors, so not that close to the actual hot components (in particular because CPUs and most bigger components generally have their own temperature sensors).
On the SB2 at least (which uses the same CID for temperature queries via ACPI), the values are not scaled at all (ACPI reports in centi-degree). See e.g. here and here. I somewhat doubt that they would change scaling and "break backwards compatibility" of that function in such a way. From past experiences, I think they'd just add a new function.
Keep in mind that it also depends on where the sensors are placed
Agreed, but if I have 11 sensors in my device, they're not all going to be at the far end where everything stayed between 30 and 35 celsius, especially not if many spots on the back were hot to touch.
If we apply WMI scaling (idea from here) with https://github.com/iwanders/nixos-surface/commit/a3196593eb51deef04f6e5152f5871b048ace30a so values are in 0.1 Kelvin, we obtain a reasonable plot, especially when we compare it with some acpi temperature sensors.
The pci and wifi sensors are these entries from sensors
;
nvme-pci-0200
Adapter: PCI adapter
Composite: +37.9°C (low = -273.1°C, high = +82.8°C)
(crit = +84.8°C)
Sensor 1: +37.9°C (low = -273.1°C, high = +65261.8°C)
iwlwifi_1-virtual-0
Adapter: Virtual device
temp1: +33.0°C
I presume those come from an ACPI bus.
I somewhat doubt that they would change scaling and "break backwards compatibility" of that function in such a way. From past experiences, I think they'd just add a new function.
I would 100% hope they didn't break backwards compatibility, but these numbers seem to indicate they did :grimacing:
Turns out you are completely correct! It looks like for some reason I misread and misremembered ACPI spec. ACPI says its values are in 1/10 Kelvin, not 1/100 Celsius... So no breakage, I just misunderstood things and the current driver implementation is wrong as a result of that. As you posted it, scaling of the SAM value should be x * 100 - 273150
to get the millidegree Celsius that Linux wants (or x / 10 - 273.15
for degree Celsius).
Ah excellent news! Glad we don't need to sprinkle if-statements! :tada:
Fixed in 47ffb20dca031af8ce8f61636cb6ecd6f2a43f41. I decided to do the Kelvin subtraction in the original scale (i.e., rounding that to 273.1K, otherwise we'd just have some fixed 0.05 in every reported value.
Yeah, cherrypicked that into my dev repo (edits to work around lack of thermal_tripless_zone_device_register
>_<), can confirm temperatures look good now!
I've changed it to a hwmon-based driver in 09526a2d47711e258ab37ae99f348f5dd915e8a4. I hacked that up in a train, so the quality might not be the best. Also I could only test this on my SB2, so the name/label support isn't implemented yet.
I'll be without my main workstation for the next couple of days, so I'm not sure if I'll be able to compile a kernel for the SPX, meaning the label support might need to wait until I'm back home. Unless you want to have a go at it.
Took a crack at it; https://github.com/linux-surface/surface-aggregator-module/pull/68
The duplicates do make me worry a bit whether it's really the name, or just a string... :/
I've pushed some more commits to feature/thermal-sensors
, if you want to test those. With those, I think it's now in a state where it could go upstream.
The thermal subsystem (
TMP
) currently lacks a driver for 7th and later generation models. This should be a SSAM client device driver and needs devices to be added to the registry (e.g. IDsssam:01:03:<TID>:<IID>:<FUN>
).Since we can actually detect which sensor instances are available, there should probably be a thermal hub device managing (i.e. instantiating) individual sensor devices. It seems that there's also the possibility for having different target IDs ('TID') so it might be a good idea to have one hub per target ID. Hubs and sensor devices can be separated by function ID (
FUN
), e.g. hub hasFUN=0
and sensorFUN=1
.There is some functionality to set sensor trip points, implementing this should also considered but might be something for a later time. First priority should be getting the values to sysfs in a format that e.g. thermald can read. Gen7 devices do have a DPTF config which seems to depend on those sensors, so we'll have to figure out what it expects and how to make that compatible, ideally so that thermald can automatically use that config, which should then result in behavior similar to Windows.
On the gen5 devices, these sensors have some ID string embedded in ACPI, which gives an indication what it is used for. Ideally we'd find a mapping for those on the gen7 devices as well. This might be encoded in the Windows driver, so might need some reverse engineering.
A list of currently known commands and events is provided here: https://github.com/linux-surface/surface-aggregator-module/blob/d85d69a25c8d47e769bad74ada3d5e9a19e0b883/doc/requests.txt#L139-L160
See https://www.kernel.org/doc/html/latest/driver-api/surface_aggregator/client.html#adding-ssam-devices for documentation on client devices and drivers.