I recently got an M.2 A+E key Coral TPU to use with Frigate for detection.
The system is a custom built home server in a mini itx case with plenty of airflow and an Asrock J3455-ITX motherboard with an added 80mm fan on top of the passive CPU heatsink.
After a couple of hours or a day it stops working when the attached log lines appear in dmesg and the reported temperature at /sys/bus/pci/devices/0000\:02\:00.0/apex/apex_0/temp goes to -87.9˚C and stays there.
I've monitored the temperature in Home Assistant and it was usually between 46-50˚C, the max was 53˚C.
I've tried all the proposed solutions I could find, including removing the device & rescanning PCIe, disabling ASPM, diasbling sleep on the device itself, but none had any effect.
The cpu fan usually runs at a very low RPM, so as a test I cranked it up to it's max RPM to get more airflow to the TPU (M.2 slot is close to the CPU). This resulted in reported temps between 40-42˚C with a peak of 47˚C. This seems to have solved the issue as it's been running for more than 24 hours which is the longest so far.
All this makes me think that there's a thermal issue but I don't understand how when the reported temps are not even close to get it to throttle the frequency, let alone to lead to a thermal shutdown.
Is it possible that the reported temps are lower than they really are?
Click to expand!
### Issue Type
Bug
### Operating System
Linux
### Coral Device
M.2 Accelerator A+E
### Other Devices
_No response_
### Programming Language
_No response_
### Relevant Log Output
```shell
apex 0000:02:00.0: RAM did not enable within timeout (12000 ms)
apex 0000:02:00.0: Error in device open cb: -110
apex 0000:02:00.0: Apex performance not throttled due to temperature <- this keeps repeating until rebooting
```
Description
I recently got an M.2 A+E key Coral TPU to use with Frigate for detection. The system is a custom built home server in a mini itx case with plenty of airflow and an Asrock J3455-ITX motherboard with an added 80mm fan on top of the passive CPU heatsink.
After a couple of hours or a day it stops working when the attached log lines appear in dmesg and the reported temperature at /sys/bus/pci/devices/0000\:02\:00.0/apex/apex_0/temp goes to -87.9˚C and stays there. I've monitored the temperature in Home Assistant and it was usually between 46-50˚C, the max was 53˚C.
I've tried all the proposed solutions I could find, including removing the device & rescanning PCIe, disabling ASPM, diasbling sleep on the device itself, but none had any effect.
The cpu fan usually runs at a very low RPM, so as a test I cranked it up to it's max RPM to get more airflow to the TPU (M.2 slot is close to the CPU). This resulted in reported temps between 40-42˚C with a peak of 47˚C. This seems to have solved the issue as it's been running for more than 24 hours which is the longest so far.
All this makes me think that there's a thermal issue but I don't understand how when the reported temps are not even close to get it to throttle the frequency, let alone to lead to a thermal shutdown. Is it possible that the reported temps are lower than they really are?
Click to expand!
### Issue Type Bug ### Operating System Linux ### Coral Device M.2 Accelerator A+E ### Other Devices _No response_ ### Programming Language _No response_ ### Relevant Log Output ```shell apex 0000:02:00.0: RAM did not enable within timeout (12000 ms) apex 0000:02:00.0: Error in device open cb: -110 apex 0000:02:00.0: Apex performance not throttled due to temperature <- this keeps repeating until rebooting ```