linux-surface / kernel

Linux kernel with modifications for Microsoft Surface devices.
Other
118 stars 33 forks source link

Surface fan module #144

Closed iwanders closed 7 months ago

iwanders commented 9 months ago

Updated Description / Summary

This pull request has quite a lengthy amount of comments with analysis, updating the description with a short recap to get people from the future up to speed:

Original description:

First of; thanks for the great work to support Linux on these devices, I wouldn't have bought a Surface had it not been for this project.

I ended up compiling a kernel on my new Surface and found that it got quite warm. After seeing a few issues about thermal and fans without real resolutions I decided to investigate this over the course of this week and weekend.

A coarse write up of my analysis can be found in this file, roughly, my current understanding is the following:

I tried to write a kernel module that does the following:

The value set with CID 11 unfortunately gets overwritten as soon as the surface tablet temperature exceeds roughly 40C and the onboard controller kicks in. Big disappointment there, I think the fan profile needs to be switched, but I cannot figure out how to achieve that, I tried sending integers to various CID's to no avail. I tried to setup Irpmon to sniff the commands sent at boot, but I could not get that working. I only discovered this after I basically finished this module. But I'm not sure now how much value there is in providing the functionality to set the value at the moment. As long as the device stays below 40 degrees C we can control the fan, above that we lose control. I'm really not sure what else I can do to try to figure out how to change the profile, hopefully someone else has an idea.

About the actual PR

Footnote [1]: CID 11 captured from Windows.

``` 533:{"ctrl": {"type": 64, "len": 0, "pad": 0, "seq": 138}}, {"ctrl": {"type": 128, "len": 10, "pad": 0, "seq": 15}, "cmd": {"type": 128, "tc": 5, "sid": 0, "tid": 1, "iid": 1, "rqid_lo": 56, "rqid_hi": 7, "cid": 11}, "payload": [210, 11], "time": "2023-12-03 1:55:16 AM"}, 538:{"ctrl": {"type": 64, "len": 0, "pad": 0, "seq": 140}}, {"ctrl": {"type": 128, "len": 10, "pad": 0, "seq": 18}, "cmd": {"type": 128, "tc": 5, "sid": 0, "tid": 1, "iid": 1, "rqid_lo": 59, "rqid_hi": 7, "cid": 11}, "payload": [204, 11], "time": "2023-12-03 1:55:17 AM"}, 553:{"ctrl": {"type": 64, "len": 0, "pad": 0, "seq": 146}}, {"ctrl": {"type": 128, "len": 10, "pad": 0, "seq": 25}, "cmd": {"type": 128, "tc": 5, "sid": 0, "tid": 1, "iid": 1, "rqid_lo": 66, "rqid_hi": 7, "cid": 11}, "payload": [0, 0], "time": "2023-12-03 1:55:21 AM"}, ```

Footnote [2]: CID 14 together with the platform profile

``` {"ctrl": {"type": 64, "len": 0, "pad": 0, "seq": 124}}, {"ctrl": {"type": 128, "len": 12, "pad": 0, "seq": 12}, "cmd": {"type": 128, "tc": 3, "sid": 0, "tid": 1, "iid": 0, "rqid_lo": 53, "rqid_hi": 7, "cid": 3}, "payload": [3, 0, 0, 0], "time": "2023-12-03 12:42:53 AM"}, {"ctrl": {"type": 64, "len": 0, "pad": 0, "seq": 13}}, {"ctrl": {"type": 128, "len": 9, "pad": 0, "seq": 14}, "cmd": {"type": 128, "tc": 5, "sid": 0, "tid": 1, "iid": 1, "rqid_lo": 55, "rqid_hi": 7, "cid": 14}, "payload": [3], "time": "2023-12-03 12:42:53 AM"}, {"ctrl": {"type": 64, "len": 0, "pad": 0, "seq": 126}}, {"ctrl": {"type": 128, "len": 12, "pad": 0, "seq": 17}, "cmd": {"type": 128, "tc": 3, "sid": 0, "tid": 1, "iid": 0, "rqid_lo": 58, "rqid_hi": 7, "cid": 3}, "payload": [4, 0, 0, 0], "time": "2023-12-03 12:42:58 AM"}, {"ctrl": {"type": 64, "len": 0, "pad": 0, "seq": 18}}, {"ctrl": {"type": 128, "len": 9, "pad": 0, "seq": 19}, "cmd": {"type": 128, "tc": 5, "sid": 0, "tid": 1, "iid": 1, "rqid_lo": 60, "rqid_hi": 7, "cid": 14}, "payload": [4], "time": "2023-12-03 12:42:58 AM"}, {"ctrl": {"type": 64, "len": 0, "pad": 0, "seq": 130}}, {"ctrl": {"type": 128, "len": 12, "pad": 0, "seq": 24}, "cmd": {"type": 128, "tc": 3, "sid": 0, "tid": 1, "iid": 0, "rqid_lo": 65, "rqid_hi": 7, "cid": 3}, "payload": [1, 0, 0, 0], "time": "2023-12-03 12:43:01 AM"}, {"ctrl": {"type": 64, "len": 0, "pad": 0, "seq": 25}}, {"ctrl": {"type": 128, "len": 9, "pad": 0, "seq": 26}, "cmd": {"type": 128, "tc": 5, "sid": 0, "tid": 1, "iid": 1, "rqid_lo": 67, "rqid_hi": 7, "cid": 14}, "payload": [2], "time": "2023-12-03 12:43:01 AM"}, ```

Tested on Surface Pro 9, i7-1255U, Windows 11 & NixOS 23.11 with kernel from this repo.

qzed commented 9 months ago

Thanks for figuring all that out! I don't have time right now to look into this in detail, but I'll try to do that on the weekend. I'll also try to check if this works on my SB2, but I doubt that the interface exists there since the FAN0/PNP0C0B device is not present on that.

Regarding your implementation: For a proper upstreamable module, we will need to figure something out to avoid the PNP0C0B clash. I have honestly no idea how Windows does that, or why they use the generic PNP0C0B match for a more or less custom device in the first place.

Apart from that, I have one very small comment on your code, but I only had a quick glance yet. Again, I'll try to have a look on the weekend and maybe see if I can figure out why it sometimes won't set up correctly.

iwanders commented 9 months ago

Thanks for figuring all that out! I don't have time right now to look into this in detail, but I'll try to do that on the weekend. I'll also try to check if this works on my SB2, but I doubt that the interface exists there since the FAN0/PNP0C0B device is not present on that.

Absolutely no rush!

I'd really like to control the fan above the 40 degrees point, so I'm going to be exploring some more around that, but it may involve figuring out how to get IRPMon to log during boot, I may just try to hack up that driver and hardcode it for that uart bus. I'm not a Windows developer though so I expect that'll take some time.

Regarding your implementation: For a proper upstreamable module, we will need to figure something out to avoid the PNP0C0B clash. I have honestly no idea how Windows does that, or why they use the generic PNP0C0B match for a more or less custom device in the first place.

Oh interesting... I expected them to have a rule that is something like; if there is a resource in a common bus, the custom driver should still claim that. If we don't want the clash, I think we can easily remove the ACPI bus stuff, in which case I think we can make this into a ssam client driver instead (also, kudo's for the docs!).

Apart from that, I have one very small comment on your code, but I only had a quick glance yet. Again, I'll try to have a look on the weekend and maybe see if I can figure out why it sometimes won't set up correctly.

:+1: no pressure, I'll try to change it to not use the ACPI bus anymore.

iwanders commented 9 months ago

If we don't want the clash, I think we can easily remove the ACPI bus stuff, in which case I think we can make this into a ssam client driver instead (also, kudo's for the docs!).

Ok, well, I didn't quite get this working yet. I feel the setup should be very similar to the surface_platform_profile module. With the setup basically reducing to:

https://github.com/linux-surface/kernel/blob/593ced7efc8c5c7cb6ab7d89ffadfe2c9102076a/drivers/platform/surface/surface_fan.c#L242-L257

But this seems to never enter the probe, current commit 593ced7efc8c5c7cb6ab7d89ffadfe2c9102076a is this ssam_device_driver approach, so not working yet, not sure if we somehow have ensure the FAN entity on the bus is a known entity, given that it doesn't talk by itself we currently may just not 'discover' the hardware? (Just a wild guess, all of this is new to me.) Last working version (that does the ACPI device capture) is at 3ab20d15992af3e250341d01c45f85442fb85420.

I have honestly no idea how Windows does that, or why they use the generic PNP0C0B match for a more or less custom device in the first place.

I did a brief survey of strings in some of the .sys files over the weekend, one of them (thermal policy?) had some strings in it that led me to the 'PEP for ACPI' page. Which seems to allow implementing ACPI methods with custom drivers.

qzed commented 9 months ago

But this seems to never enter the probe, current commit https://github.com/linux-surface/kernel/commit/593ced7efc8c5c7cb6ab7d89ffadfe2c9102076a is this ssam_device_driver approach, so not working yet, not sure if we somehow have ensure the FAN entity on the bus is a known entity, given that it doesn't talk by itself we currently may just not 'discover' the hardware? (Just a wild guess, all of this is new to me.)

Unfortunately, that's kind of a thing with the SAM devices. None of them are really auto-discoverable (at least reliably). You need to add a corresponding entry to the device registry. Specifically, you need to add a new software_node declaration and then link/include that in the SP9 node group.

I did a brief survey of strings in some of the .sys files over the weekend, one of them (thermal policy?) had some strings in it that led me to the 'PEP for ACPI' page. Which seems to allow implementing ACPI methods with custom drivers.

Urgh... that's not a good sign. Essentially, you are correct: PEPs allow overriding/adding ACPI functions with custom drivers. And there currently is no support for this on Linux. (It's a mechanism that Qualcomm uses heavily for their SoCs on Windows and one of the reasons those devices generally don't work out of the box with Linux, and need quite a lot of work).

So based on this, I assume the SAM driver is overriding/extending the ACPI spec for the generic fan somehow. So at least it could contain some useful calls. What file did you find the strings in?

iwanders commented 9 months ago

Specifically, you need to add a new software_node declaration and then link/include that in the SP9 node group.

Ah I see! Thanks for that pointer! Can't seem to be able to load that module with an out of tree build as it has some unknown symbol, so I'm building a new kernel now with modifications :roll_eyes: :crossed_fingers:

Urgh... that's not a good sign. Essentially, you are correct: PEPs allow overriding/adding ACPI functions with custom drivers

I tried to find something to look at the available ACPI methods on Windows, but couldn't find anything useful yet. I was hoping there was an 'enable/disable' method accessible through it or something, or even a 'reinit' that we could trigger with irpmon running. But so far haven't even found anything for Windows that allows me to list the acpi devices.

What file did you find the strings in?

I downloaded the SurfacePro9_Win11_22621_23.103.34762.0.msi file, then used msitools to extract it. Then it is in SurfaceUpdate/surfaceacpiplatformextension/SurfaceAcpiPlatformExtensionDriver.sys, I haven't spend any real time with a disassembler, but strings shows lots of mentions of DMF and things like AcpiPepDeviceFan.

iwanders commented 9 months ago

Specifically, you need to add a new software_node declaration and then link/include that in the SP9 node group.

It works! 7cbc1fae638aa76488d0dca7e1f2fc587be8875a did exactly what you recommended, after that the fan module correctly registers using the ssam_device. Very high just works factor :+1:

qzed commented 9 months ago

Ah I see! Thanks for that pointer! Can't seem to be able to load that module with an out of tree build as it has some unknown symbol, so I'm building a new kernel now with modifications 🙄 🤞

I assume it complained while running insmod and not during build? I'm not entirely sure why that happened in your case, but this can happen if the SAM core module hasn't been loaded. You could try unloading the built-in modules with make modprobe-unload and then run make insmod to load the out-of-tree compiled ones in the correct order.

Also: Nice work!

iwanders commented 9 months ago

I assume it complained while running insmod and not during build? I'm not entirely sure why that happened in your case, but this can happen if the SAM core module hasn't been loaded. You could try unloading the built-in modules with make modprobe-unload and then run make insmod to load the out-of-tree compiled ones in the correct order.

Yep, error occurred when running insmod on the newly built aggregator. I first unloaded the old aggregator and anything that relied on it. I use this makefile to build out of tree. So different directory, with surface_fan.c and surface_aggregator_registry.c copied to it and both added to obj-m in the Makefile. Either way, if I run into it next time I'll investigate more, easiest now was just to tell NixOS to build the entire kernel, which was a good end to end test I would've done anyway.

iwanders commented 9 months ago

Brief end of day update, finally some progress worth sharing. The past few days I've been trying in various ways to get more information from the Windows side of things.

The TL;DR of failed attempts:

Like I said in the original description, irpmon claims it can log during boot, but starting it and then rebooting didn't do it for me. I expected nothing told the driver what to trace and was considering forking it and just hardcoding what to record. I first used the procedure described here to get a boot trace to confirm irpmon comes up before the things we're interested in: This showed me that the irpmndrv starts at 2.5s into the boot, while the uart comes up at 5.5s, so hacking up irpmon seemed hopeful.

After some reading in the source code I discovered that it can read settings at boot from the registry. Then, reading the docs on the commandline do point out that flag on the --save-settings=1 to write it to the registry. So with those bits of information I finally got it working.

So, the steps to log the UART2 bus with irpmon during windows' boot are the following, use the gui (as administrator) to ensure the driver is setup to start at boot.

To enable run the following, written logs go to C:\Windows\

irpmonc.exe --input=D:\\.\irpmndrv  --hook-driver=ICD:\Driver\iaLPSS2_UART2_ADL --boot-log=1 --save-settings=1

Writing happens in chunks, current log may be zero bytes until flushed, or just disable it and reboot.

To disable, run:

irpmonc.exe --input=D:\\.\irpmndrv  --unhook-driver=\Driver\iaLPSS2_UART2_ADL --boot-log=0 --save-settings=1 

From the first boot log (50 mb .log file with above commands), I got the following for grepping on tc": 5:

``` {"ctrl": {"type": 64, "len": 0, "pad": 0, "seq": 30}}, {"ctrl": {"type": 128, "len": 12, "pad": 0, "seq": 2}, "cmd": {"type": 128, "tc": 5, "sid": 0, "tid": 1, "iid": 1, "rqid_lo": 43, "rqid_hi": 0, "cid": 8}, "payload": [0, 0, 0, 0], "time": "2023-12-08 12:16:08 AM"}, {"ctrl": {"type": 64, "len": 0, "pad": 0, "seq": 2}}, {"ctrl": {"type": 128, "len": 9, "pad": 0, "seq": 31}, "cmd": {"type": 128, "tc": 5, "sid": 1, "tid": 0, "iid": 1, "rqid_lo": 43, "rqid_hi": 0, "cid": 8}, "payload": [0], "time": "2023-12-08 12:16:08 AM"}, {"ctrl": {"type": 64, "len": 0, "pad": 0, "seq": 36}}, {"ctrl": {"type": 128, "len": 8, "pad": 0, "seq": 8}, "cmd": {"type": 128, "tc": 5, "sid": 0, "tid": 1, "iid": 1, "rqid_lo": 49, "rqid_hi": 0, "cid": 1}, "payload": [], "time": "2023-12-08 12:16:08 AM"}, {"ctrl": {"type": 64, "len": 0, "pad": 0, "seq": 8}}, {"ctrl": {"type": 128, "len": 10, "pad": 0, "seq": 37}, "cmd": {"type": 128, "tc": 5, "sid": 1, "tid": 0, "iid": 1, "rqid_lo": 49, "rqid_hi": 0, "cid": 1}, "payload": [0, 0], "time": "2023-12-08 12:16:08 AM"}, {"ctrl": {"type": 64, "len": 0, "pad": 0, "seq": 40}}, {"ctrl": {"type": 128, "len": 12, "pad": 0, "seq": 12}, "cmd": {"type": 128, "tc": 5, "sid": 0, "tid": 1, "iid": 1, "rqid_lo": 53, "rqid_hi": 0, "cid": 8}, "payload": [0, 0, 1, 0], "time": "2023-12-08 12:16:08 AM"}, {"ctrl": {"type": 64, "len": 0, "pad": 0, "seq": 13}}, {"ctrl": {"type": 128, "len": 9, "pad": 0, "seq": 42}, "cmd": {"type": 128, "tc": 5, "sid": 1, "tid": 0, "iid": 1, "rqid_lo": 53, "rqid_hi": 0, "cid": 8}, "payload": [0], "time": "2023-12-08 12:16:08 AM"}, ```

So cid with payload [0, 0, 1, 0] as a new command... unfortunately, that again doesn't seem to do anything :(

But I'm pretty sure something breaks in the irpmon_to_json.py script. I get

warning: expected TER, skipping data until next SYN
data dropped: 0xff
warning: expected SYN, skipping data until next SYN
data dropped: 0x55 0x40 0x0 0x0 0x48 0x90 0x23 0xff 0xff 0xff
warning: expected SYN, skipping data until next SYN

And there's an enormous blob of data in one command... (json is 5 mb);

cmd": {"type": 255, "tc": 255, "sid": 85, "tid": 170, "iid": 128, "rqid_lo": 70, "rqid_hi": 0, "cid": 73}, "payload": [24, 47, 128, 12, 0, 1, 0, 79, 0, 14, 255, 255, 4, 0, 1, 3, 0, 2, 2, 0, 3, 1, 0, 3, 9, 0, 1, 6, 0, 1, 255, 255, 255, 0, 0, 0, 0, 0, 0, 0, 0, 170, 85, 128, 8, 0, 39, 220, 164, 128, 2, 1, 0, 1, 80, 0, 13, 43, 16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 27, 248, 170, 85, 64, 0, 0, 39, 217, 190, 255, 255, 170, 85, 64, 0, 0, 73, 177, 51, 255, 255, 170, 85, 128, 8, 0, 40, 51, 85, 128, 1, 1, 0, 1, 81, 0, 64, 240, 102, 170, 85, 128, 12, 0, 74, 23, 197, 128, 2, 0, 1, 1, 80, 0, 13, 1, 0, 0, 0, 66, 159, 170, 85, 64, 0, 0, 40, 54, 79, 255, 255, 170, 85, 64, 0, 0, 74, 210, 3, 255, 255, 170, 85, 128, 17, 0, 41, 224, 152, 128, 12, 1, 0, 1, 82, 0, 12, 1, 0, 0, 0, 0, 0, 128, 0, 0, 209, 150, 170, 85, 64, 0, 0, 41, 23, 95, 255, 255, 170, 85, 128, 12, 0, 75, 54, 213, 128, 1, 0, 1, 1, 81, 0, 64, 3, 0, 0, 0, 163, 130, 170, 85, 64, 0, 0, 75, 243, 19, 255, 255, 170, 85, 128, 8, 0, 42, 113, 117, 128, 2, 1, 0, 1, 83, 0,

That payload does contain the magic aa 55 bytes multiple times. It's probably best to take that to an issue in surface-aggregator-module repo? But given that I've been doing all this in an attempt to make the fan rotate faster I thought I'd share my progress here already. I'll probably take a stab at fixing the conversion script over the weekend and then we should be able to confirm that we actually got more data on the commands being sent at startup.

qzed commented 9 months ago

But I'm pretty sure something breaks in the irpmon_to_json.py script.

That is very much possible. Maybe I should try and rewrite it from scratch at this point... If I remember correctly, irpmon sometimes also dropped some data however, which didn't make this easier.

Based on your description, I assume that it misinterprets/reads the wrong payload length of some command, causing it to
just read in way too much, including other commands. And then it probably ends up somewhere in the middle of another command, causing the warning: expected TER, skipping data until next SYN.

iwanders commented 9 months ago

Based on your description, I assume that it misinterprets/reads the wrong payload length of some command, causing it to just read in way too much, including other commands. And then it probably ends up somewhere in the middle of another command, causing the warning: expected TER, skipping data until next SYN.

Yeah, something like that, I saw more than STATUS_SUCCESS as well;

    # STATUS_NOT_SUPPORTED
    # STATUS_TIMEOUT
    # STATUS_PENDING
    # STATUS_CANCELLED
    # STATUS_SUCCESS

Those were all seen in the bootlog in the IOSB.Status constant field, so I think it's something around tracking each request by IRP address and discarding / using appropriately. The expected TER, skipping data until next SYN was something I already hacked into the script, because the current exits with an error if that TER isn't found, so at first I got nothing, then I got this. Like I said, still some work to be done, but just obtaining the log during the boot procedure felt like a win.

I should have time tomorrow to look into this in more depth.

iwanders commented 9 months ago

Ok, so this is a bit of a let down, but I tried replaying all commands that I recorded as being sent to the fan system when booting Windows.

This script, has both the recorded commands as well as the code that should replay them. Unfortunately, I don't achieve the fan control that Windows has above 30C. I wonder if the actual command to enable the override is not sent to the fan category but instead sent to another category. Or perhaps the values we have to send to CID 8 need to change each time. I'm at a loss :disappointed:.

I'll probably let this (overriding the fan control) rest for now, there's some other things I want to shift my focus to.

We can merge this as is, leave it open in the hopes someone figures it out, or remove the 'set' functionality.

qzed commented 9 months ago

I guess there could be some subsystem-interplay at work, as you suggest. I assume you played around with the platform profiles and checked whether it behaves differently for e.g. the best-performance one?

I'll probably let this (overriding the fan control) rest for now, there's some other things I want to shift my focus to.

Sure, no worries.

We can merge this as is, leave it open in the hopes someone figures it out, or remove the 'set' functionality.

I would vote for removing or guarding the set functionality for now. Leaving that in would probably confuse users as to why they can't set fan speed and would probably lead to a bunch of new issues. And seeing as reading the fan speed should work in any case (right?), I don't see why we should not have at least that part in. Maybe an alternative could be to feature-guard the set function with some (off-by-default) module parameter?

iwanders commented 9 months ago

And seeing as reading the fan speed should work in any case (right?), I don't see why we should not have at least that part in

Yes, that always works [^1]

I assume you played around with the platform profiles and checked whether it behaves differently for e.g. the best-performance one?

Yeah, I tried playing around with many things, including this.

This does make me wonder about my original statement;

When changing platform profiles in Windows, CID 14 is sent with an integer denoting the new profile (slightly different mapping). This does not appear affect fan speed. Footnote [2].

I'm going to do a more rigorous analysis of this, that should allow us to conclusively state if the fan's speed or ramp is affected by the fan profile as sent down.

[^1]: I did discover that either cid 0x33 or 0x34 to the SAM target results in the battery and fan module becoming unresponsive to requests, I think those two commands affect sleep modes in the controller.

qzed commented 9 months ago
  1. I did discover that either cid 0x33 or 0x34 to the SAM target results in the battery and fan module becoming unresponsive to requests, I think those two commands affect sleep modes in the controller.

The SAM 33/34 CIDs are D0 exit/entry commands (the request parameters we know are documented here) and generally sent when the device enters or exits some kind of sleep state. I'm not entirely sure what those do, but 15/16 are similar commands and silence SAM/EC events (at least on my SB2). So quite likely something is shut down or silenced on the EC with those as well. So I would say that that part is expected.

iwanders commented 9 months ago

I'm going to do a more rigorous analysis of this, that should allow us to conclusively state if the fan's speed or ramp is affected by the fan profile as sent down.

Ok, well, clear difference when doing a properly set-up test.

The log_sensors.py script is used to record data, it also contains capture of the irp log from WIndows when switching between the three profiles, from that we obtain:

performance profile -> fan profile
better performance, 3 -> 3
best performance, 4 -> 4
normal, 1 -> 2

Plots with this script.

Test 1, fan best

/home/ivor/.nix-profile/bin/python ./ctrl.py request 5 1 14 1 0 4

Recorded data sensors_log_platform_profile_performance_fan_best_2023_12_10__13_33.json.gz. I cancelled the stress -c 12 invocation after the fan went on full power for a bit, I wasn't too comfortable with it.

best For larger viewing, rightclick & download as, looks like Github's svg viewer fails with this matplotlib image.

Test2, fan normal

After roughly a 30 minute cooldown period.

/home/ivor/.nix-profile/bin/python ./ctrl.py request 5 1 14 1 0 2

Recorded data; platform_profile_performance_fan_normal_2023_12_10__14_30.json.gz. I cancelled the stress invocation later here, because I was more like; well, it should be able to handle this.

normal

Windows

edit; Well, I don't know how to get hard numbers on windows really. In best performance mode, using cpustress with 12 workers at 100% utilisation does not result in the fan going at max RPM in a test duration of 15 minutes. Rebooting from this hot state to linux I did not notice a change in fan speed, fan was going at about 5500 and decreasing. I expect Windows throttles the cpu to stay below the firmware's overheat handling? edit2; Ugh, I should've captured the IRP requests going to the fan to get the actual speeds. I may do that later this week.

Comparison

If we overlay the two fan profiles: fans

We see that the fan: best profile has a much higher plateau, and the first 'overheat event' where the fan spikes to max and the CPU gets drastically throttled is later in the recording.

We should probably add sending this to the fan module when we switch profiles from the platform profile module?

iwanders commented 9 months ago

Maybe an alternative could be to feature-guard the set function with some (off-by-default) module parameter?

I went with this for now, in 99e90e78d131101821e3356c22f4e7bc699027ce that is implemented, with this as a parameter is allows for easy experimentation.

In hwmon we can hide the fan1_target file easily depending on the flag, which is nice:

[root@papyrus:/sys/class/hwmon/hwmon3]# ls
device  fan1_input  fan1_max  fan1_min  name  power  subsystem  uevent

For the cooling device the best we can do is giving a permission error I think:

[root@papyrus:/sys/class/thermal/cooling_device15]# echo 1000 > cur_state 
bash: cur_state: Permission denied

I considered completely removing the registration for the thermal cooling device, but the ACPI fan is also in there and non functional, so it probably doesn't hurt. This way the fan can still be inspected through cur_state in the thermal subsystem, providing information for any consumers of those sysfs entries.

iwanders commented 8 months ago

Ok, I'm back with more analysis, I did a bunch of tests with IRPMon logging running on windows. I think Windows does not command the fan directly, it merely sets the fan profile and that makes it appear to do a better job than linux.

Basically, I did three tests:

Plotting the FAN rpms for these measurements, and the original two from linux:

fans_linux_windows svg

The fan rise curve is of identical slope in Windows when compare to Linux if the fan is in the 'performance mode / best' performance profile. From the battery test, we only depict the section on AC power here. Markers on the lines denote a message with that information from the IRPMon recording, so relatively sparse.

So, to my surprise, neither of the AC-only tests had CID 11 in it. The test that started on the battery did have that command in it, so I did a more thorough analysis of that.

Here's an annotated graph of that recording, including the section on battery power:

fans_on_battery_test_annotated svg

Based on these findings, I propose the following:

  1. I remove the enable_control parameter and the __ssam_fan_set functionality completely.
  2. I remove HWMON_F_TARGET from the hwmon system. We keep min, max and input to be able to query the fan speed and read them from sensors.
  3. Might as well remove it from the thermal subsystem thermal_cooling_device_register and all things associate to it? The hardware monitoring system seems more suited for the only thing we can do; monitor.
  4. I file a second PR that adds setting the fan profile to the surface_platform_profile module, ensuring the fan is in the same profile as the platform should help with cooling it.

That should make it clear there's only inspection functionality to all systems and users. It's unfortunate we can't offer more control, but by setting the fan profile we should at least be on par with Windows' cooling capacity.

qzed commented 8 months ago

Nice work!

  • I failed at recording the temperatures through Get-WmiObject on Windows, but I did identify two responses in the irp capture that follow what looks like the temperature, TMP category command 1, for instance id 9 and 5. I don't know fully how to scale them, but 9 appears to be close to the cpu while 5 is further away. Combining with the sensor readings from Linux may help figure out scaling / offset (or we discover it's just a duplicate of an acpi sensor).

AFAIK those values are just divided by 100 to get °C. On older devices, those are integrated into ACPI via the SAN driver (directly as _TMP methods), but newer devices don't use the SAN driver any more. Might be worth testing the thermal drivers I wrote a while ago for the SPX on the SP9 and finally upstream them... See here and here if you want to have a look at them (the first is the actual sensor driver, the second is a hub driver to instantiate all the sensor devices, because for that subsystem they're actually discoverable; btw if you figure out a way to get actual meaningful sensor names from that subsystem, that would be quite nice).

I think your plan sounds good. There's the question of how to deal with the platform profile integration though. In particular dealing with the difference between devices that have a fan and ones that do not (ideally, I don't want to rely on querying that by just seeing if the command fails). We might have to handle that via some device property and set that in the registry.

iwanders commented 8 months ago

AFAIK those values are just divided by 100 to get °C. On older devices, those are integrated into ACPI via the SAN driver (directly as _TMP methods), but newer devices don't use the SAN driver any more. Might be worth testing the thermal drivers I wrote a while ago for the SPX on the SP9 and finally upstream them... See here and here if you want to have a look at them (the first is the actual sensor driver, the second is a hub driver to instantiate all the sensor devices, because for that subsystem they're actually discoverable; btw if you figure out a way to get actual meaningful sensor names from that subsystem, that would be quite nice).

Oh nice! I may circle back to this when I get around to making a thermald config for my surface. There's quite a few items on my todo list though (currently picking away at making the on screen keyboard better).

I think your plan sounds good.

The last four commits implement this, the module is now very lean. I updated the Kconfig file to represent there's no fan control, and discovered there's a .clang-format file in the root of the repo and ran a format on the whole thing as well. I feel this is now in a much better state for a closer review, like I said before, I've never contributed anything to a C codebase, so I'm happy to incorporate any suggestions / recommendations.

In particular dealing with the difference between devices that have a fan and ones that do not (ideally, I don't want to rely on querying that by just seeing if the command fails). We might have to handle that via some device property and set that in the registry.

Yes, I like not performing the request if we don't know if we have a fan :+1: Took a bit of a cursory look, I don't have a full grasp of how the systems all interact, so I may be totally of the mark here. Initially I thought in __ssam_register_clients we assign the parent, we can grab the parent and (probably??) enumerate the children, if the platform profile finds the fan, we know we're good to set the fan profile.

But then I read some more and looked at way things are defined in the surface registry, like the platform profile, that software_node struct has a field called properties, the property_entry seems to be very well suited for what we need to convey. I'm thinking in the registry, we create two instances of the platform profile entry, one that specifies it has a property stating it has a fan, another one where we don't state that (so two flavours of ssam_node_tmp_pprof, one with a property). Then in the ssam_node_group_sp9 group, we use the platform profile with the fan property set? And I think, because of this line we can probably already access the fwnode_handle, which probably allows us to call fwnode_property_present? So I think practically all functionality is already in place to facilitate answering a question like this? I'll probably do a quick exploration of this approach over the weekend, if that goes anywhere I'll file a draft PR and we can iterate there. Edit; couldn't wait, so I did a quick test before work; 38475da36aaae4f620836a0e4ecb82a0568266ab this works, it prints 0 or 1 depending on whether ssam_node_tmp_pprof_with_fan or ssam_node_tmp_pprof is in the SP9 group. I'll flesh this out over the weekend with the proper logic to set the fan mode and file a draft PR for iteration on this front.

StollD commented 8 months ago

The last four commits implement this, the module is now very lean. I updated the Kconfig file to represent there's no fan control, and discovered there's a .clang-format file in the root of the repo and ran a format on the whole thing as well. I feel this is now in a much better state for a closer review, like I said before, I've never contributed anything to a C codebase, so I'm happy to incorporate any suggestions / recommendations.

For upstreaming, running scripts/checkpatch.pl is a good idea too.

qzed commented 8 months ago

But then I read some more and looked at way things are defined in the surface registry, like the platform profile, that software_node struct has a field called properties, the property_entry seems to be very well suited for what we need to convey. I'm thinking in the registry, we create two instances of the platform profile entry, one that specifies it has a property stating it has a fan, another one where we don't state that (so two flavours of ssam_node_tmp_pprof, one with a property). Then in the ssam_node_group_sp9 group, we use the platform profile with the fan property set? And I think, because of this line we can probably already access the fwnode_handle, which probably allows us to call fwnode_property_present?

Correct. That looks like the most reliable option. I'd suggest device_property_read_bool(), which is essentially again the exact same as the fwnode equivalent but does the dev_fwnode() call to get the fwnode from the device for you. So more or less something like device_property_read_bool(&sdev->dev, "has_fan").

And as @StollD said: Running scripts/checkpatch.pl is generally a good idea (at least once before upstream submission). This checks for known anti-patterns as well as formatting. Also there's a "strict" mode (so something like scripts/checkpatch.pl --strict <*.patch or file.c>), which nags a bit more about formatting and which some upstream people seem to prefer. So probably a good idea to use that flag right from the start.

iwanders commented 8 months ago

6f276682784a6f80efe5c6616d753a2c96e7ec7d Incorporates changes proposed by

$ git diff v6.5-surface-devel > /tmp/our_patch.diff
$ checkpatch.pl -strict /tmp/our_patch.diff 

Thanks for the tip!

iwanders commented 8 months ago

One thing I also want to mention: In the end (like right at the end after all changes are done/reviewed) it would be great if you could squash the changes into two commits: One for the registry stuff and one for the actual driver + KConfig etc. (with proper upstream-conform commit messages). That's then more or less the package that you can submit upstream.

Happy to squash and split them out when we're done iterating, until then I'll keep the commits separate for a clear changelog & review.

qzed commented 8 months ago

6f27668 Incorporates changes proposed by

$ git diff v6.5-surface-devel > /tmp/our_patch.diff
$ checkpatch.pl -strict /tmp/our_patch.diff 

For the final commit messages, you can use git format-patch -2 (or similar) and then checkpatch on the generated patches. It will then also check the commit messages (and once we're done here you can send those patches upstream).

qzed commented 8 months ago

I think for the final submission there should also be a MAINTAINERS entry for the new driver. Feel free to add my name/mail as maintainer (M) to that as well, then I'll get notified if that driver ever gets some patches in the future.

iwanders commented 8 months ago

In 213098012e9ef02082da387834945c8db0764425 I have moved the driver over to the hwmon directory. I also took a look at how to get your patch accepted into the hwmon subsystem, based on that I:

qzed commented 8 months ago

Apart from the comments above, this looks good to me.

iwanders commented 8 months ago

Apart from the comments above, this looks good to me.

Cool, I'll await @StollD 's final comments and remarks before reading up on the commit formatting rules and doing the squashing and splitting of the commits.

qzed commented 8 months ago

Not sure if you saw it (I added it a bit late), so I'll link it here: https://github.com/linux-surface/kernel/pull/144#discussion_r1428970212.

qzed commented 8 months ago

Awesome, thanks!

iwanders commented 8 months ago

c72e609b441b1000729ae14fd972559f97274e90 adds a tiny fix that swaps the file order in the MAINTAINERS file, and 19630ce2499e4186110a87c5028037c6147b7eb5 adds a license to the surface_fan.rst. That brings checkpatch to:

$ ./scripts/checkpatch.pl -strict /tmp/our_patch.diff 
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#14: 
new file mode 100644

total: 0 errors, 1 warnings, 0 checks, 218 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
      mechanically convert to the typical style using --fix or --fix-inplace.

/tmp/our_patch.diff has style problems, please review.

NOTE: If any of the errors are false positives, please report
      them to the maintainer, see CHECKPATCH in MAINTAINERS.

Which I think is as clean as it is going to get.

iwanders commented 8 months ago

Hmm, https://github.com/linux-surface/kernel/pull/144#pullrequestreview-1785168343 is still marked as 'change requested' by github, even though all threads in it are marked as resolved. I requested a re-review from you Maximilian, maybe that fixes the 'changes requested' state.

qzed commented 8 months ago

Hadn't marked it as approved yet because of the squashing stuff. I think I have to manually mark it as approved for that to go away. From my side this looks good now.

iwanders commented 8 months ago

Oh I see, I was hoping there was a 'mark review as resolved' from your side. Ok, will wait Dorian's final review before making a copy of the branch and squashing & splitting on this one.

qzed commented 8 months ago

Btw, I've just pushed the temperature sensor drivers to https://github.com/linux-surface/surface-aggregator-module/tree/feature/thermal-sensors, if you want to try that (but it's probably best to open a separate issue for that).

iwanders commented 8 months ago

Okay, I've gone over the submitting patches guide and spent some time reading in lore how contributions look. If I understand it correctly, the actual subject of the email and email body become the git commit summary and body. This is a bit of a surprise to me, after reading it all though, your (@qzed) comment about making this into two commits finally makes more sense :D

So I did the whole squash and then split it out again. The original branch with all the individual commits is on surface_fan_module-before-squash, git diff surface_fan_module surface_fan_module-before-squash is empty. I added Closes: and Links: to the commit with links back to this issue, such that anyone ever looking at the commits can understand how we arrived at this state. I'll update the description to reflect that ultimately no fan control was achieved (edit; done, added a small recap at the top just above the original description).

I split this in two commits as you suggested, a few questions, since you have experience with this:

4d895a9b0d0eaa1a08f044c6b063131141261edb that makes the changes to the registry, I followed the subject conventions I saw you use in this contribution, this commit is simple and probably needs little explanation. This should probably go to platform-driver-x86?

8019b4f3a877cfbbd3658504a711617390f698ee is the other more meaty counterpart, that adds the hwmon driver. This probably has to go to linux-hwmon? Currently the description of this commit is very short, there are many short description additions on the mailing list. I'm not sure what best practices are here, do we need to make a case why this should be considered for inclusion? Do we need to explain how it works technically with the whole ssam_driver stuff? I'd have expected that to go into the email body, not into the actual commit that's being proposed?

Given that the two patches go to different lists, they won't get the 1/2 and 2/2 numbering right? Both changes are independently merge-able without any problems, so that's probably not a problem?

qzed commented 8 months ago

If I understand it correctly, the actual subject of the email and email body become the git commit summary and body.

Correct. Sorry, should probably have explained that a bit better. So generally each commit that you want to send upstream is converted into a patch, which is sent as an email (and then converted back / applied as a commit). Generally for multi-patch series there's also a cover-letter, which is the very first email in that chain and contains a bit more info about the whole patchset. But you also want to make sure that the commit messages contain all relevant information.

You can generate the patches and cover letter with git format-patch -2 --cover-letter (the -2 is for the number of commits, you can also use a hash instead; and there are also some other useful options of that command to look at). Then (after filling in the cover letter) you can use ./scripts/checkpatch.pl to check them and git send-email (with the appropriate options like --cc/--to if you haven't specified/added those in the patches) to send them in.

I added Closes: and Links: to the commit with links back to this issue, such that anyone ever looking at the commits can understand how we arrived at this state

I'm not sure if Closes: will be accepted upstream, but for our repo here I think it's fine. Link: you can generally submit upstream too, so you could just convert that.

4d895a9 that makes the changes to the registry, I followed the subject conventions I saw you use in this contribution, this commit is simple and probably needs little explanation. This should probably go to platform-driver-x86?

Correct. I'd actually send both patches to platform-driver-x86. I don't think that the first one needs to go to the hwmon list, but the cover letter should go to both.

Do we need to explain how it works technically with the whole ssam_driver stuff?

I don't think that's needed here. You can add details like this to the cover letter if you want.

What might be useful to know in the commit message is that the fan speed is read-only and controlled by the EC, i.e. that we don't have direct control over it and can only read its RPM (but not much more details than that).

Given that the two patches go to different lists, they won't get the 1/2 and 2/2 numbering right? Both changes are independently merge-able without any problems, so that's probably not a problem?

I'd send them in as a pair. In particular because having the driver actually do something kind of depends on the device being there. So having this driver upstream doesn't make sense if there is no device that uses it. You can (and probably should) add a note that the patches can be applied individually to the cover letter. The maintainers will then generally pick their individual patch and apply them to their trees after the series has been reviewed.

iwanders commented 8 months ago

:+1: thanks so much for this write up, I really appreciate the help here, obviously I'd also like to upstream the followup PR for the fan profiles, so it's good to understand this now.

Generally for multi-patch series there's also a cover-letter

This was exactly the missing piece for me. It's not really documented in the submitting patches page, I was very unclear of the 'how do I add a message to go with this' and this --cover-letter flag solves that, I may try to submit a patch to add that explanation to that page after I've gone through it.

I'm not sure if Closes: will be accepted upstream, but for our repo here I think it's fine. Link: you can generally submit upstream too, so you could just convert that.

Closes is from the docs, but that states it is for bug fixes. I've switched it over to Link: instead.

if you haven't specified/added those in the patches

This small sentence also explains how it's possible to send it as one patchset but have different to addresses for individual patches.

What might be useful to know in the commit message is that the fan speed is read-only and controlled by the EC, i.e. that we don't have direct control over it and can only read its RPM (but not much more details than that).

Yes, good idea to state this explicitly! I've updated the commit message.

Check patch is clean as well. I think I understand it well enough to write a cover letter and send it out over the next day or two.

iwanders commented 8 months ago

Submitted the patches upstream, I can find my email in on the archive. The pointers you provided were really helpful, thank you @qzed.

qzed commented 8 months ago

Nice! If @StollD has no further comments, I'd merge this. If you want to, you can push further changes to another PR, but I can also just pull in the final version after it's been accepted upstream (fyi: for sending in new versions, you can use the -vN option of git format-patch where N stands for the version number). I don't expect there to be any major changes, so I think me just pulling in the final version should be fine. Otherwise (or if you want me to look at something in particular) just let me know.

StollD commented 8 months ago

In your cover letter you write

Both patches can be applied independently of each other.

Can they? My understanding is that without the second patch the first one will apply but the driver won't actually get probed. Or am I wrong? I don't really understand the nuances of the SSAM subsystem.

Regarding the actual code I don't have anything to add. :)

qzed commented 8 months ago

Both patches can be applied independently of each other.

That is meant as in they will compile without errors if applied independently, but you are correct. Maybe this could need some more clarification there.

Also I think it wouldn't be that bad of a thing to also send the second patch to the pdx86 list.

iwanders commented 8 months ago

I don't expect there to be any major changes, so I think me just pulling in the final version should be fine

Apparently I misunderstood the intent of fan_min and fan_max, according to the feedback they are intended to be writable attributes to the device, which our device doesn't support. They're not intended to communicate constants like absolute lower operation limit and upper operation limit. So there will be changes, only HWMON_F_INPUT will remain. We don't lose any functionality really, besides just not being able to communicate the constants.

I followed the hwmon:submitting patches page to the letter, and that it should at least support all limits combined with no intent or explanation in the hwmon sysfs attributes about the _min/_max attributes made me believe those were just the lower and upper bound of the fan's operating envelope, not configurable limits. I'll probably propose improvements to the documentation, then work on a v2 that eliminates all but HWMON_F_INPUT, should be a matter of just removing code :)

As for merging this, I would just wait for the version accepted (if at all) by upstream. I feel communicating the constant lower bound and upper bound to user space has value, but if we can't do that upstream, we shouldn't do that in our version either?

Also I think it wouldn't be that bad of a thing to also send the second patch to the pdx86 list.

Will do with v2.

qzed commented 8 months ago

Yeah I saw that comment (and as you might have gathered, the kernel maintainers are usually quite direct, so don't get discouraged by that).

As for merging this, I would just wait for the version accepted (if at all) by upstream. I feel communicating the constant lower bound and upper bound to user space has value, but if we can't do that upstream, we shouldn't do that in our version either?

Yeah, we shouldn't diverge from upstream unless absolutely necessary. As for merging, we could merge it earlier as it might take some time until it gets applied upstream. I don't mind updating it later on.

I followed the hwmon:submitting patches page to the letter, and that it should at least support all limits combined with no intent or explanation in the hwmon sysfs attributes about the _min/_max attributes made me believe those were just the lower and upper bound of the fan's operating envelope, not configurable limits. I'll probably propose improvements to the documentation, then work on a v2 that eliminates all but HWMON_F_INPUT, should be a matter of just removing code :)

Maybe also let Guenter know that (e.g. with a small comment/reply to the thread there). If the general rule is that the limits should not be there if they can't be programmed, the doc should reflect that, and I'm sure he'd also appreciate a patch for it.

iwanders commented 8 months ago

As for merging, we could merge it earlier as it might take some time until it gets applied upstream. I don't mind updating it later on.

Yeah, that's fine then, I'll keep this thread up to date on the upstream progress.

Maybe also let Guenter know that (e.g. with a small comment/reply to the thread there). If the general rule is that the limits should not be there if they can't be programmed, the doc should reflect that, and I'm sure he'd also appreciate a patch for it.

Yup, that's my plan, submit a patch to improve the docs and then reply to the feedback that I've proposed an improvement to the docs and will incorporate the proposed change in v2. I see at the bottom of the lore entry there's a command that should allow me to send a plaintext reply, that probably just takes a text file with From, Subject and body, judging from formatting of the other files. I'll try to do this this evening.

qzed commented 8 months ago

Depending on the email program you use, you can also just reply directly via that. At least in Thunderbird, there's an option to send plaintext-only messages. And generally the git send-email command should send a copy to your inbox as well (either as CC or To).

Also, there's a --dry-run option for git send-email, in case you missed that and if you want to reply using a text file.

iwanders commented 8 months ago

Yeah, I saw it's possible to get Thunderbird to use text only, but the wording in the documentation still made me a bit wary of it. So I want to stick to using the cli such that I'm more certain what's being sent :)

I managed to reply from the commandline and submitted this improvement to the documentation. It all sounds really daunting at first, but with the tips you gave and reading documentation & manuals I did seem to get it all right :)

I'll get started on V2 now.

qzed commented 8 months ago

FWIW I'm using Thunderbird and no one has complained yet. Make sure the "Compose messages in HTML format" option in "Composition & Addressing" is unchecked and you should be good.

A side note: New versions are generally not sent as direct reply to the older versions. But I guess in some cases (especially if it's just a single patch) it might be okay. Also it's usually a good idea to wait a bit before submitting a new version and give the maintainers some time to get around and have a look at it.

Also you should generally include a sort of changelog of what changed in-between different versions (keeping and extending that across all versions that you submit). You can add that (and any further patch-specific comments) after a line with ---. So something like

Subject: [Patch v3] The Subject
To: xyz
CC: abc

[... commit text and tags ...]

---
Changes in v2:
  - some change
  - some other change

Changes in v3:
  - more changes

---
[... patch contents ...]

Basically, anything after that first --- and before the actual diff/patch contents will be ignored when applying the patch, so that part won't show up in the git log.

iwanders commented 8 months ago

A side note: New versions are generally not sent as direct reply to the older versions. But I guess in some cases (especially if it's just a single patch, it might be okay).

Yeah, I wasn't too sure on it, I saw it in some of the lists and it seemed like a reasonable thing to do to me, but if it's not common, I won't do it :)

Also it's usually a good idea to wait a bit before submitting a new version and give the maintainers some time to get around and have a look at it.

Good to know, thank you.

Also you should generally include a sort of changelog of what changed in-between different versions (keeping and extending that across all versions that you submit). You can add that (and any further patch-specific comments) after a line with ---. So something like

Oh, so you don't put that changelog in the cover letter but instead in the patch itself? :O So like;

From 9ebf1d196147e68b34255b3c70f7838f4b6f2b00 Mon Sep 17 00:00:00 2001
From: Ivor Wanders <ivor@iwanders.net>
Date: Thu, 30 Nov 2023 20:20:24 -0500
Subject: [PATCH v2 2/2] hwmon: add fan speed monitoring driver for Surface
 devices

Adds a driver that provides read only access to the fan speed for Microsoft
Surface Pro devices. The fan speed is always regulated by the EC and cannot
be influenced directly.

Signed-off-by: Ivor Wanders <ivor@iwanders.net>
Link: https://github.com/linux-surface/kernel/pull/144
---
Changes in v2:
  - Removed min and max fan attributes.
---
 Documentation/hwmon/index.rst       |   1 +
 Documentation/hwmon/surface_fan.rst |  25 +++++++
 MAINTAINERS                         |   8 +++
 drivers/hwmon/Kconfig               |  13 ++++
 drivers/hwmon/Makefile              |   1 +
 drivers/hwmon/surface_fan.c         | 105 ++++++++++++++++++++++++++++
 6 files changed, 153 insertions(+)
 create mode 100644 Documentation/hwmon/surface_fan.rst
 create mode 100644 drivers/hwmon/surface_fan.c

diff --git a/Documentation/hwmon/index.rst b/Documentation/hwmon/index.rst
index 042e1cf95..4dfb3b9bd 100644
--- a/Documentation/hwmon/index.rst
+++ b/Documentation/hwmon/index.rst
@@ -202,6 +202,7 @@ Hardware Monitoring Kernel Drivers
    smsc47m1
    sparx5-temp
    stpddc60
+   surface_fan
... more of the actual patch

And when submitting v2, do you refer back to the original cover letter in lore in the new cover letter? And describe changes both in the patch as well as the cover letter I expect then?

Edit, just force pushed to this branch with the v2 state, I should wait a bit more before sending that in to give other reviewers time to chime in and provide feedback on v1?