Open edrose opened 4 years ago
hello edrose, RK3399 does a Emergency Shutdown at ~85°C..
We could say that 80°C, is the max temperature it should ever reach.. it should start throttling at 80..
But Throttling means that you have a very poor Active thermal System.. With a good active thermal system, the system never do Throttling..
In my Case CPU always 55<->58°C, it never do Throttling.. If by some reason a system with active thermal cooling starts throttling, your thermal system is not good enough..
The trip points for thermal throttling, are the last resort before a Emergency Shutdown at 85°C in rk3399..
Active Cooling is the first mechanism to cool your system, Thermal throttling only exists in the event of a not very good Active Thermal Solution..
I decided for 70°C has the MAX, Because there are people like me, were rk3399 are always at full load 24x365 days, in a situation like this, you can't have a CPU operating continuously at 70°C..it will reduce its lifespan a lot( water starts boiling at 80°C.. )
I understand that temporarily, the cpu can reach more than 70°C, even 80°C, for short periods of time.. But its dangerous to operate in the higher zone of temps, an emergency shutdown can destroy even your sd-card, as it cuts power abruptly, there will be spikes that could even damage other hardware, its not a good situation to live with..
So I was shutting down at 70, but via the Operating system, were it brings the system down gracefully.. Maybe 75°C should be a more common sense value for that.. The Idea is preventing the abrupt shutdown that occurs at 85°C, and the Operating system shutdown still needs some seconds to do..
RK3399 does a Emergency Shutdown at ~85°C..
The RK3399 (when running Linux) has two different thermal shutdown trip points. First there is a software trip point, at which the kernel performs a clean shutdown. Then there is the hardware trip point in which the chip kills the power.
The software thermal shutdown temperature is defined in the file /sys/class/thermal/thermal_zone0/trip_point_2_temp
. On my system (Ubuntu 18.05 with ayufan kernel) this defaults to 100°C. You can edit it and set it to whatever value you like.
The hardware thermal shutdown is set in the device-tree using this parameter. For the mainline kernel this is defined here to be 95°C, and on the ayufan kernel it's set to 110°C here. So your absolute maximum is at a minimum 95°C, but could be 110°C depending on your kernel. You can of course edit the device tree to your preferences.
I did some testing, and I edited the software value and set it to 60°C. When the CPU reached that temperature, the system performed a clean shutdown. It is not dangerous to the SD card to let the system reach this temperature.
I also did some tests on my board without a heatsink to see how hot I could get it. I managed to get it to 80°C, at which point the CPU throttled and could not go any higher, despite the CPU governor on both clusters being performance
, stress-ng running on all six cores, and there not being a heatsink attached. I couldn't even get it to 85°C to test to see whether it cut off.
Note that 85°C is really quite cold in CPU terms. When I run a big compile job, like compiling a complete embedded Linux image, my computer's CPU will be at 99°C for about half an hour with both active and passive cooling preventing it from getting any hotter. For a smaller chip like the RK3399, keeping it below that point is preferable, but I can even get my smartphone CPU above 70°C when running AnTuTu.
But Throttling means that you have a very poor Active thermal System.. With a good active thermal system, the system never do Throttling..
Very true. However a thermal shutdown is a last resort. Active cooling should happen first, then CPU throttling, and then a thermal shutdown. When running ats
thermal shutdown occurs before CPU throttling, whereas CPU throttling should prevent a thermal shutdown from occuring in 99% of cases.
Because there are people like me, were rk3399 are always at full load 24x365 days, in a situation like this, you can't have a CPU operating continuously at 70°C..it will reduce its lifespan a lot( water starts boiling at 80°C.. )
I'm also going to be running mine 24/7, and I wouldn't want mine running continuously at 70°C either. However if it spikes to 70°C for a short period of time then that wouldn't be an issue - the chip can take temperature spikes of up to 125°C! (from the datasheet here, page 62) It would be more of an issue if my server went offline because it reached that temperature and I'm on the other side of the country unable to restart it.
Water boils at 100°C.
Maybe 75°C should be a more common sense value for that.. The Idea is preventing the abrupt shutdown that occurs at 85°C, and the Operating system shutdown still needs some seconds to do..
When the OS shuts down (at least when calling shutdown -h now
), one of two things will happen:
it will immediately kill the process that is heating up the CPU, meaning that the CPU will immediately cool down.
the process will prevent the shutdown (i.e. A stop job is running for ...
message) and it won't cool down until it either stops or the systemd stop job timer kills it, which could be a few minutes.
So the time it takes for a shutdown to occur is largely irrelevant. It'll either fix the problem immediately, or it'll fix it after the service stops which could be a few minutes. To perform an 'emergency' shutdown you should use systemctl poweroff --force
which will not allow services to interrupt the shutdown.
You're also constrained by your polling interval. If ats
is only going to check every 2 minutes when cool, you could reach >70°C before ats
even notices. This is why performing this action from user-space is a bad idea. This is something the kernel should handle.
So my recommendation for handling this, if you really want your board to never go above 70°C, would be to write new values into the kernel trip points when ats
starts. Not only trip_point_2
, but also trip_point_0
and trip_point_1
. Trip points 0 & 1 control throttling. When the CPU reaches trip point 0 it's basically going to immediately start cooling down - at that throttled clock speed and at room temperature (even without a heatsink) the chip won't heat up any more. At trip point 1 it'll really start cooling down, and thermal shutdown occurs at trip-point 2.
So to prevent it going above 70°C set trip_point_0_temp
to 70000
. Your CPU won't heat up above that point unless it's ambient temperature is very high. Set trip_point_1_temp
to 75000
and trip_point_2_temp
to 80000
for emergencies and to give some leeway against whatever maximum you want to use. After then there is no need to check and perform a shutdown from user-space - the kernel will handle it far quicker than ats
will. Please make any behaviour like this configurable.
hello edrose,
125°C is the "storage temperature" and its a max value.. It means with the SoC without operating it( only the Soc, not connected to any place, because that temperature is the Max Junction temperature allowed for the SoC, at that value Soc starts to blow, brake apart, by it self, the transistors junctions connections.. )
Storage Temps are [-20°C - 120( or 125 )°C ] The Operating Temperatures are [ 0, 80°C ] Emergency Shutdown - 85°C
3.1 Absolute Maximum Ratings Storage Temperature Tstg 125 ℃ Max Conjunction Temperature Tj 125 ℃
"Absolute maximum ratings specify the values beyond which the device may be damaged permanently. Long-term exposure to absolute maximum ratings conditions may affect device reliability."
3.2 Recommended Operating Conditions
Take a loot at the datasheet you mention page 64, at beginning of it ;)
Ambient Operating Temperature 1) Ta 0 25 80 ℃ "Notes: 1) Symbol name is same as the pin name in the IO descriptions 2) with the reference software setup, the reference software will limit the chipset temperature about 80℃"
Which mean Operation Temperatures[ 0 - 80 ]°C
But I decided to "brake the rules" and put operating temperatures [ -20 - 80 ]°C Which by itself its already dangerous, because -20°C is the lower limit for junction temperature disruption( the same way as +125°C ).. To do that you will notice that ATS bellow 0°C, will start to consume more CPU, and bellow -10°C will accelerate CPU consumption, with the purpose of heating up to safe values.. that's why I included -20°C as minimum, instead of recommended 0°C, as minimum.
Well I agree that it could take a lot of time to shutdown.. That is the initial motive for max at 70°C( so it can have some room in time till goes to 80 - 85°C.. ). I agree that 'systemctl poweroff --force' its maybe the best option here.
I was reluctant in getting out of vendor operating temperatures.. I don't feel safe about that, I even don't feel very nice to set minimum limit to -20°C( because this value has the same effect on rk3399 as +125°C )..but since I am at same time heating UP the CPU for him to get to safer values, I accepted the -20°C as absolute minimum..
If I put Maximum Rates available in '/etc/ats.conf', people will start doing nasty things, and some boards will fry out.. and later this people will come to blame me, don't you think?
In general I agree with you, That they should be used by the owner of the board in '/etc/ats.conf' at their own responsibilities.. But I fear that people will set, out of limit, values and then will blame me :(
I think that if they change the values, its their responsibility, But then we know that a lot of not so techy guys use the boards also, and they could have strange Ideas.. But by Definition they should be in '/etc/ats.conf', the Idea of this limits is related with some boards that people change some trip points at their will and go outside limits, and could be dangerous.. and they could come to blame me, because "they though it was safe"..
Which mean Operation Temperatures[ 0 - 80 ]°C
It's all in the terminology.
The Junction temperature is the temperature of the chip itself whilst operating. The Storage temperature is the maximum temperature that the chip can be stored at when not operating. The Ambient temperature is the temperature of the air around the chip. Notice how they are three distinct values in the datasheet.
There are only absolute maximum values defined for the junction and storage temperatures, which is 125°C. The ambient temperature specification is a recommendation. Intel defines the ambient temperature as the temperature of the air entering the thermal cooling solution, and they only rate their processors up to around 50°C ambient even through they can run at 100°C quite happily. The rk3399 is rated higher, presumably since it's designed as a smartphone chip where it could be running inside a small case with a higher ambient temperature. Either way, it is not the operating temperature of the chip - it's the temperature of the air immediately around the chip. The absolute maximum operating temperature of the chip is 125°C.
"2) with the reference software setup, the reference software will limit the chipset temperature about 80℃"
I can't actually find the section in the datasheet where reference 2) is - it seems to have been missed out. However lets assume it refers to the ambient temperature part.
It's unclear what the "reference software" is in this case, but it's almost certainly the Rockchip Linux branch. They're explaining that the chip may lose performance on their reference software if it reaches this temperature, not that any software running has to implement this.
It also doesn't mean that an emergency shutdown should occur at 80℃. The hardware shutdown temperature is 120°C on the rockchip version of Linux. It means that throttling should occur to keep the chip within this range. This is exactly what happens when running mainline/auyfan Linux - the chip would not go over 80℃ when I tried, even without a heatsink.
Nothing else needs to be done to meet this constraint. I expect the rockchip version of Linux was merged into mainline Linux at some point, which is why they're the same.
To do that you will notice that ATS bellow 0°C, will start to consume more CPU, and bellow -10°C will accelerate CPU consumption, with the purpose of heating up to safe values.. that's why I included -20°C as minimum, instead of recommended 0°C, as minimum.
As I mentioned above, the 0°C value is for ambient temperature not junction temperature. I would be very careful about implementing something that increases the junction temperature at low ambient temperatures. ats
may be advanced but it can't increase the temperature of the room ;)
In all seriousness though, the components that are likely to be suffering from the cold are the electrolytic capacitors, which you won't be heating up by increasing CPU usage. You've also got to consider thermal expansion in this as well. If you're on a cold PCB and you heat up just the chip, then thermal expansion has the potential to crack the solder joints under the chip, especially if it cycles on and off allowing it to heat and cool repeatedly. I'd be very careful about doing stuff like that is ats
.
If I put Maximum Rates available in '/etc/ats.conf', people will start doing nasty things, and some boards will fry out.. and later this people will come to blame me, don't you think?
If you're worried about it from a legal point of view, you should include an Open-Source Licence such as the GPLV3 licence (it's always a good idea anyway). Clause 16 of the GPLV3 would absolve you of any liability. Other licences have similar clauses.
If you're worried from a more social point of view, then I'd recommend adding a config option to allow overriding the absolute maximum. Something really obvious like I_ACCEPT_RESPONSIBILITY_LET_ME_OVERRIDE
. Then there is no excuse for anyone to blame you.
Hello edrose, You have some valid points here :)
Thanks for sharing your thoughts, definitely ats will not be able to heat the room, I agree about expansion and contraction, more even because of the BGA soldering behind the SoC :(
Thanks again about that, I looked in the upper limit, forgot that in the lower one.. I will consider hard-limits to be included in '/etc/ats.conf', and also a flag, like you suggest about responsibility of changes.. :+1:
Thermal shutdown of a processor is not usually the job of something running in user-space. This is handled either in hardware or by the kernel.
This seems to be defined in the device-tree here, so the chip should shut down at 110 degrees centigrade. The absolute maximum temperature of the chip is 120 degrees, and the maximum ambient operating temperature is 80 degrees.
Thermal throttling on the Rockpro64 also occurs at 80 degrees. Switching off at 70 degrees seems very premature, especially since it isn't even given a chance to throttle before that happens! I don't want my home server to shut itself down whilst I'm away, since I wouldn't be able to restart it without pressing the physical button.
I think the thermal shutdown feature in ats should be configurable to allow it to be disabled. I also think that the
ABSOLUTE_MAX_THERMAL_TEMP
should be configurable, so I can set myMAX_CONTINUOUS_THERMAL_TEMP
to be higher than 69.