Open robmarkcole opened 5 years ago
OK notice that this has happened again and the white LED on the stick is blinking, according to the docs:
Edge TPU running | Pulse (breathe)
In the logs:
2019-08-15 04:57:14,046 INFO werkzeug MainThread : * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
2019-08-15 04:57:16,412 INFO werkzeug Thread-1 : 192.168.1.133 - - [15/Aug/2019 04:57:16] "POST /v1/vision/detection HTTP/1.1" 200 -
2019-08-15 05:06:39,330 INFO werkzeug Thread-2 : 127.0.0.1 - - [15/Aug/2019 05:06:39] "POST /v1/vision/detection HTTP/1.1" 200 -
2019-08-15 05:08:02,383 INFO werkzeug Thread-3 : 192.168.1.133 - - [15/Aug/2019 05:08:02] "POST /v1/vision/detection HTTP/1.1" 200 -
2019-08-15 05:17:25,939 INFO werkzeug Thread-4 : 192.168.1.164 - - [15/Aug/2019 05:17:25] "POST /v1/vision/detection HTTP/1.1" 200 -
2019-08-15 05:17:31,909 INFO werkzeug Thread-5 : 192.168.1.164 - - [15/Aug/2019 05:17:31] "POST /v1/vision/detection HTTP/1.1" 200 -
2019-08-15 05:17:32,442 INFO werkzeug Thread-6 : 192.168.1.164 - - [15/Aug/2019 05:17:32] "POST /v1/vision/detection HTTP/1.1" 200 -
2019-08-15 07:00:31,265 INFO werkzeug Thread-7 : 192.168.1.164 - - [15/Aug/2019 07:00:31] "POST /v1/vision/detection HTTP/1.1" 200 -
2019-08-15 07:32:01,268 INFO werkzeug Thread-8 : 192.168.1.164 - - [15/Aug/2019 07:32:01] "POST /v1/vision/detection HTTP/1.1" 200 -
2019-08-15 07:33:57,254 INFO werkzeug Thread-9 : 192.168.1.164 - - [15/Aug/2019 07:33:57] "POST /v1/vision/detection HTTP/1.1" 200 -
2019-08-15 07:35:06,279 INFO werkzeug Thread-10 : 192.168.1.164 - - [15/Aug/2019 07:35:06] "POST /v1/vision/detection HTTP/1.1" 200 -
2019-08-15 07:36:01,290 INFO werkzeug Thread-11 : 192.168.1.164 - - [15/Aug/2019 07:36:01] "POST /v1/vision/detection HTTP/1.1" 200 -
2019-08-15 07:39:55,270 INFO werkzeug Thread-12 : 192.168.1.164 - - [15/Aug/2019 07:39:55] "POST /v1/vision/detection HTTP/1.1" 200 -
2019-08-15 10:03:06,260 INFO werkzeug Thread-13 : 192.168.1.164 - - [15/Aug/2019 10:03:06] "POST /v1/vision/detection HTTP/1.1" 200 -
I then hammered it with 240 requests without issue.
Perhaps use a timeout on the request
I am speaking with Manoj (manojkum@google.com) at Google about this issue, however no resolution yet. The actions he suggested, my answers in bold:
Update from Manoj, 21/8 We have filed a bug with our development team regarding this issue and hope to get a response soon.
Just a bit more info from my experience. When my coral USB TPU gets into this state, I don't need to plug/replug - I can just restart the container (and in effect, the flask process) that has the device open. The replacement process can then pick up and continue processing as usual.
In a way I'm glad this isn't a unique problem, so very likely not a hardware problem, or a specific artifact of running in a Docker container.
I guess one workaround is to periodically restart the app, but hopefully Google can find and fix the bug so this isn't required
Latest advice is to try a 5v@3A power supply
Hmm.. I'm using the Coral TPU on an Intel i5 NUC "clone" in a USB 3 port, not a Raspberry Pi. I'll check the power supply voltage one of the USB ports to see if it looks out of spec.
As a workaround, I've built a healthcheck in the Dockerfile to detect when the container running the flask server stops responding. I had expected docker to automagically restart my container for me, but that's not the case, either starting it directly with docker run ...
or using docker-compose
to start the container. Then I found https://hub.docker.com/r/willfarrell/autoheal/ which will look for unhealthy containers and restart them. I just implemented that; hopefully it will workaround the problem in the near term.
I wonder if the problem is some sort of concurrency issue with multiple requests landing at the same time? I have 4 cameras I'm using and grabbing frames from them every 10 to 30 seconds or so. I've not investigated the coral API and associated python libraries to see if they're safe in that regard. Of course, it could be that Flask is serially processing requests and this can't happen.
I ran a script that hammered the server and it didn't fall over. It appears to fail when running for > 12 hours, regardless of load. In my case I am only doing about an image an hour. I guess we could remove flask from the equation by just having a script that periodically performs an inference and seeing when/if that fails
I purchased a 3Amp supply and this appears to resolve the issue - 3 days and no timeout. Will close this issue if it goes a week
@lmamakos are you still experiencing this issue?
I believe so, though I haven't checked recently. I built a health-check for the container and it gets restarted when it hangs.. so out of sight, out of mind. I will look at the log when I return from my business travel later today and see what it's been up to. I'll get a view of the current state of things, and then look at updating the Home Assistant component and to the latest Home Assistant release and watch it going forward as well.
As a reminder, I have the Coral USB stick plugged into an Intel i5 in a NUC-like system and not a Raspberry Pi. I'll put a voltmeter on the USB port to check, but it seems unlikely that power is a problem in my case. There's only SSD in this system (no spinning rust, also no spinning fans, either), so I'd expect there would be plenty of headroom in the power supply. I have the Coral USB stick plugged in using their provided USB cable.
Just took a look at the auto-restart log:
19-09-2019 04:45:48 Container /coral (9ec449c1390b) found to be unhealthy - Restarting container now with 10s timeout
19-09-2019 05:01:26 Container /coral (9ec449c1390b) found to be unhealthy - Restarting container now with 10s timeout
19-09-2019 07:54:54 Container /coral (9ec449c1390b) found to be unhealthy - Restarting container now with 10s timeout
22-09-2019 21:35:22 Container /coral (9ec449c1390b) found to be unhealthy - Restarting container now with 10s timeout
24-09-2019 10:14:44 Container /coral (9ec449c1390b) found to be unhealthy - Restarting container now with 10s timeout
25-09-2019 01:33:51 Container /coral (9ec449c1390b) found to be unhealthy - Restarting container now with 10s timeout
25-09-2019 18:16:20 Container /coral (9ec449c1390b) found to be unhealthy - Restarting container now with 10s timeout
30-09-2019 02:53:56 Container /coral (9ec449c1390b) found to be unhealthy - Restarting container now with 10s timeout
30-09-2019 03:14:36 Container /coral (9ec449c1390b) found to be unhealthy - Restarting container now with 10s timeout
01-10-2019 01:10:37 Container /coral (9ec449c1390b) found to be unhealthy - Restarting container now with 10s timeout
01-10-2019 01:41:21 Container /coral (9ec449c1390b) found to be unhealthy - Restarting container now with 10s timeout
01-10-2019 20:51:38 Container /coral (9ec449c1390b) found to be unhealthy - Restarting container now with 10s timeout
02-10-2019 11:13:26 Container /coral (9ec449c1390b) found to be unhealthy - Restarting container now with 10s timeout
02-10-2019 17:24:48 Container /coral (9ec449c1390b) found to be unhealthy - Restarting container now with 10s timeout
02-10-2019 18:15:29 Container /coral (9ec449c1390b) found to be unhealthy - Restarting container now with 10s timeout
03-10-2019 04:48:10 Container /coral (9ec449c1390b) found to be unhealthy - Restarting container now with 10s timeout
03-10-2019 07:35:35 Container /coral (9ec449c1390b) found to be unhealthy - Restarting container now with 10s timeout
04-10-2019 13:53:03 Container /coral (9ec449c1390b) found to be unhealthy - Restarting container now with 10s timeout
So a failure/hang about daily, more or less.
I connected my 20000 count Fluke voltmeter to the USB interface on my NUC (via a chopped off USB cable, and over a few hours of activity, measured a low voltage of 5.107 and a high voltage of 5.132, with it normally hanging out around 5.113 or so. Curiously, as the system is loaded, the USB voltage actually bumps up a hundredth of a volt or two, probably because the switching regulator in there is having to push harder on the CPU core voltage or something..
I ran a ZFS "scrub" operation which tends to bang on the SATA and PCI-e I/O (for the SATA and M.2 storage that I have), as well as at least one of the cores as does checksum verification of all the allocated disk blocks during this time. I wanted to try to load the system a little bit..
During the course of a few hours, the coral stick didn't hang, so a little inconclusive. There seems to be plenty of margin in the power supply.. but I've not captured the voltages when it enters this hang state. I'll have to try this again; I'm a little uncomfortable leaving the test cable plugged in unattended, what with the bare wires and curious cats. My Fluke meter samples about 3 times per second which also might not be frequently enough to capture a brief voltage sag. I don't have any sort of real data logger among my bag of tricks... even triggering my oscilloscope on a low voltage isn't really useful if there's no timestamp I can compare with the observed hangs.
Or it might just be haunted? In the US, Halloween is coming up later this month; perhaps the additional ghosts and spirits will mix things up a bit..
Well I assume google are aware of this issue and root causes, since they suggested the power supply fix. Therefore they will hopefully fix the issue in due course. In the meantime your restart procedure is working well enough, perhaps you could add it to the readme?
I just ran into this issue with rasbperry Pi 4. I'm hitting this issue with both a 2A power supply and the and the canakit 3.5A power supply. It works fine on my desktop though
I'm thinking about trying out an externally powered USB hub to see if that helps
Are other people still experiencing this issue?
It hangs pretty fast for me <10 minutes and restarting the process temporarily fixes it.
Using
sudo systemctl start coral.service
the app appears to die after 12 hours, no errors in logs, just doesn't respond to requests