google-coral / edgetpu

Coral issue tracker (and legacy Edge TPU API source)
https://coral.ai
Apache License 2.0
429 stars 125 forks source link

Possible overheating while reported temperature is normal #843

Closed csgabor closed 6 months ago

csgabor commented 7 months ago

Description

I recently got an M.2 A+E key Coral TPU to use with Frigate for detection. The system is a custom built home server in a mini itx case with plenty of airflow and an Asrock J3455-ITX motherboard with an added 80mm fan on top of the passive CPU heatsink.

After a couple of hours or a day it stops working when the attached log lines appear in dmesg and the reported temperature at /sys/bus/pci/devices/0000\:02\:00.0/apex/apex_0/temp goes to -87.9˚C and stays there. I've monitored the temperature in Home Assistant and it was usually between 46-50˚C, the max was 53˚C.

Screenshot_20240425_140026

I've tried all the proposed solutions I could find, including removing the device & rescanning PCIe, disabling ASPM, diasbling sleep on the device itself, but none had any effect.

The cpu fan usually runs at a very low RPM, so as a test I cranked it up to it's max RPM to get more airflow to the TPU (M.2 slot is close to the CPU). This resulted in reported temps between 40-42˚C with a peak of 47˚C. This seems to have solved the issue as it's been running for more than 24 hours which is the longest so far.

All this makes me think that there's a thermal issue but I don't understand how when the reported temps are not even close to get it to throttle the frequency, let alone to lead to a thermal shutdown. Is it possible that the reported temps are lower than they really are?

Click to expand! ### Issue Type Bug ### Operating System Linux ### Coral Device M.2 Accelerator A+E ### Other Devices _No response_ ### Programming Language _No response_ ### Relevant Log Output ```shell apex 0000:02:00.0: RAM did not enable within timeout (12000 ms) apex 0000:02:00.0: Error in device open cb: -110 apex 0000:02:00.0: Apex performance not throttled due to temperature <- this keeps repeating until rebooting ```
csgabor commented 6 months ago

I put he TPU into an M.2 -> PCIe adapter and it's been running for 11 days now with no issues so I guess it might have been a power issue.

google-coral-bot[bot] commented 6 months ago

Are you satisfied with the resolution of your issue? Yes No