Closed HeipKo closed 1 year ago
Hi HeipKo,
Apologies, the fan-control.log that the script generates currently is mostly just for the script's activity, it doesn't capture anything helpful in troubleshooting. That said, it's a very simple service control script that just executes the script, so if it's failing it's either a permissions problem (unlikely on proxmox unless the installer failed, I should add some checks probably...) or some other execution issue.
What happens if you try to execute the script directly?
cd /root/fan-control
python3 fan-control.py
It should generate some messages when run manually. If this is a fresh Proxmox install, I expect I have an idea what the issue is likely to be. It's probably missing ipmitool, which can be quickly fixed with the following:
apt install ipmitool
Thinking back, I remember having in my notes to ensure that installing that during the install script execution process happened, but looking at the script it's not there so that might be the problem. I don't recall if it's there as part of the native install or if I had to install it separately myself. If that fixes it for you please do let me know, I'll update the installer to avoid confusion in the future.
Hi jp-powers,
Yes, as you suspected, installing ipmi fixed the problem and was able to install normally via install.sh and the service started. But the fan curve of my DELL R730xd does not change. No matter what untopical values I enter in the config, what could be the reason?
I know you said the log file doesn't help, but maybe it will help anyway because here a query is running in a loop all the time. The log file is too big hence the screenshot.
The log does help there. Since the script is running now it's capturing some data from the Python error captures.
It looks like sensors isn't available, either. Interesting, I'll need to add that to the install script as well. I thought more of this was baked into Proxmox but it looks like I had it installed from my previous fan control method scripts. Without sensors it's not picking up CPU temps, which will mean it's getting null values for the CPU temp, which will cause the math to fail. The following should do the trick.
apt install lm-sensors
Thank you, by installing lm-sensors regulated now the fans of my server are regulated, but now there is another problem. I only have SSDs in my server and the temperatures are apparently read incorrectly here, since the log shows an average of over 70C and therefore the panic mode is executed.
Lets see what the temperate's are. This isn't include in the main package but I have a separate little bash script script to quickly dump the CPU and Disk temps. Look for the line that starts DISKS and create a similar list that you did in the config for the main script. NOTE: Python and Bash handle lists differently so note the formatting differences:
#!/bin/bash
## cpu
sensors
## disks
#DISKS=( sda sdb sdc sdd sde sdf sdg sdh sdi sdj )
DISKS=( sda sdb sdc sdd sde sdf sdg sdh sdi sdj )
for disk in "${DISKS[@]}"
do
echo "disk $disk"
smartctl -A /dev/$disk | grep "194 Temp"
done
Copy/paste or screenshot the output of that there.
Also, just to go ahead and make a separate point, it looks like you might have not made any changes to the gen-config.py script, so I have some thoughts how to think about making the changes.
Now, a couple caveats to keep in mind: My server rack is in my office, stuck in a corner with a sound dampening panel in front and a bookcase to the rack's otherwise open side with sound dampening foam on it's back side, so I can far more comfortably handle the higher noise of the higher fan speeds. I also live in Las Vegas, where even now on December 31, it's currently 62F outside, and in the summer it regularly breaches 100F if not 110F, and my office is oddly shaped so I have a couple of extra fans around the room just to move air around as best as I can, but the ambient temperature around my server rack is generally quite warm.
What I'm getting at is I have to run to my Dell r730xd's fans faster just to keep temps in a mostly comfortable place, and I've tried to build around the server rack in a way that makes the noise more manageable. Having the pfSense and TrueNAS servers in the same rack really amp up the noise so the difference of 5% fan speed on the Dell is hardly noticeable to me, but if you just have the Dell... it'll take some work to find a fan curve that is comfortable in terms of noise generated and keeping the temps in a good range.
You almost certainly don't need to target as high of a "base line" as I do, so you might be able to get away dropping the fan speeds by 5% across the board. I suggest having a base line speed of at least 15% starting at 0C to ensure the fans are always moving some air. I don't currently, but previously ran 4 1TB Corsair MX500 SSDs, and since their operating temp range is 0 to 70C, and SATA SSDs are (generally speaking) less prone to issues from operating at slightly higher temps compared to spinning HDDs, I'm comfortable running my temps a bit higher, and bumping my panic temp up to 50C and dropping the panic addition. The biggest problem I've had is that the SSDs tend to see "micro burst" temperature jumps. They'll average a good low temperature than when a VM hits them hard their temp shoots up during the operation and drops back down. That's why I target a higher base line temp, so there's more cushioning for those micro bursts. Without that cushioning the fans will randomly burst up to much higher speeds then drop back down right away.
All that said, if the temperatures really are as high as the script is reporting, that's beyond comfortable range. If you've been using anything else to keep the fans off/very low, it might be a matter of letting the script run for a while to get the temps down to range. Otherwise, I'm not sure what else to think besides environmental issues (high ambient temperature, broken/out of spec fans, the disks are reporting bad temperatures which I can't really do much about, etc).
I threw together a quick gen-config.py for you I'm pasting below. It includes my current fan curve/max temp adjustments and the full list of your SATA SSDs. Currently I don't have anything implemented for pulling NVMe temperatures, but it's on my to-do list. I don't use NVMe in my servers currently so it'll be difficult to test so it's pretty far back on the list, though. That said, copy/paste this into /root/fan-control, chmod 755 it, and execute it. It'll generate a new config.ini for you to try. After letting it run for a bit, try editing fan curve section until your comfortable with it.
#!/usr/bin/python3
##
## config_gen.py
##
## Purpose: create config file for fan-control.py to use
##
## Notes: All available options are listed so you can simply change commented lines. The only things that "require" changing values are the disks list, the CPU/HDD fan curves, and the hdd_panic values.
from configparser import ConfigParser
#Get the configparser object
config_object = ConfigParser()
# defining system info, including OS and hardware platform type
config_object["system_info"] = {
### system_os is the operating system this script is running on. Will determine how certain temperature detection is run.
"system_os": "Proxmox",
# "system_os": "TrueNAS",
# "system_os": "pfSense",
## ipmi_type is the hardware platform. Will determine how ipmitool raw commands are executed.
"ipmi_type": "iDRAC_Gen08",
# "ipmi_type": "SM_X10",
## single_zone is a boolean to define if the fan zones should be treated as linked or not. Depends on chassis/fan zone layout.
"single_zone": True,
# "single_zone": False,
## disk_list is the list of disks to monitor. da# is FreeBSD based, sdX is debian based
# "disks": ["da0", "da1", "da2", "da3", "da4", "da5", "da6", "da7", "da8", "da9", "da10", "da11", "da12", "da13"],
"disks": ["sda", "sdb", "sdc", "sdd", "sde", "sdf", "sdg", "sdh", "sdi", "sdj", "sdk", "sdl", "sdm", "sdn"],
}
### Fan Curve(s)
# left (the key) is detected temperate, right (the value) is the fan speed percentage
config_object["fan_curve"] = {
"cpu": [
[0, 0],
[25, 20],
[35, 20],
[40, 25],
[45, 30],
[50, 40],
[60, 50],
[70, 60],
[80, 100],
[90, 100],
[100, 100],
],
"hdd": [
[0, 20],
[25, 25],
[35, 25],
[40, 30],
[45, 35],
[50, 40],
[60, 50],
[70, 70],
[80, 80],
[90, 100],
[100, 100],
],
}
# Desired maximum HDD temp, and how much fan speed percentage to add if reached
config_object["hdd_panic"] = {
"max_temp": 50,
"panic_addition": 5
}
# timers, in seconds, for how frequently to check the temperatures and adjust fan speeds if needed
config_object["detect_timers"] = {
"cpu_timer": 1,
"hdd_timer": 30
}
# Logging configuration
config_object["log_config"] = {
"file_name": "/root/fan-control/fan-control.log",
"format": '%%(asctime)s %%(levelname)s: %%(message)s',# Due to how ConfigParser works, double your %'s to escape them properly. It will look weird in the .ini but should read fine.
"date_format": '%%Y/%%m/%%d %%I:%%M:%%S %%p',# Due to how ConfigParser works, double your %'s to escape them properly. It will look weird in the .ini but should read fine.
# "frequency": "Every"
"frequency": "On_Change"
# "frequency": "On_Panic"
}
#Write the above sections to config.ini file
with open('config.ini', 'w') as conf:
config_object.write(conf)
print("Configuration file successfully written")
I wish you a happy new year
Here are the values read from your bash script. The temperatures look great, the server is not yet running under load because I want to set up the fan control first. But something is still not right, because the script still recognizes a temp average that is much too high and therefore activates the panic mode again.
I live in Germany and the temperature isn't the problem here, the server is in my basement and it's nice and cool there even in summer. The problem I have is that I don't use Dell-certified hard drives and the server automatically increases the fan speed to at least 42% and that's totally unnecessary because it draws unnecessary power and is unnecessarily loud.
I also use non-certified PCIe cards in my server, but there is an option for this via IPMI Disable Third-Party PCIe Card Default Cooling Response. I did that too and it works. I just haven't found a way to do this for non-certified hard drives. That's why I want to regulate it via a fan controller.
OK, I see the problem. The Value (column after 0x0022), not the Raw Value (last column), is what the script uses to determine temp. Across the board your drives are reporting 75 to 79C. I'm not sure why your drives would be reporting such wildly different temperatures for the two values.
I can make some changes to the script to utilize the raw value instead but I'll need some time to do it. Right now I'm doing it in a fairly simply way and I might need to do something a bit more complex to get it consistently...
OK, I updated the function that discovers the temperature. It's actually a bit simpler now than it was previously in the process of correcting for this and I believe should be more consistent and potentially a touch faster hopefully.
Grab an updated version and re-run the installer and it should use the new method. If you'd rather make the change yourself you can refer to the changes shown in the Github diff here: https://github.com/jp-powers/fan-control/commit/5164e22ed4ad91f6f777394c7c63c7c75ae5f4c5
Let me know how that works for you. I've tested it against my Proxmox, TrueNAS, and pfSense boxes and they seem to be working well with the change.
Everything is working fine now, thank you very much.
Hello, I have the problem that service-control.service cannot start. fan-control.log