Closed DD4WH closed 2 years ago
A serious issue is that after running for one hour or a bit longer, BirdNET-Pi seems to hang and no files are being processed any more. The log shows this:
Traceback (most recent call last): File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner self.run() File "/usr/lib/python3.9/threading.py", line 892, in run self._target(*self._args, **self._kwargs) File "/usr/local/bin/server.py", line 315, in handle_client writeResultsToFile(detections, min_conf, args.o) File "/usr/local/bin/server.py", line 204, in writeResultsToFile with open(path, 'w') as rfile: FileNotFoundError: [Errno 2] No such file or directory: '/home/pi/BirdSongs/February-2022/08-Tuesday/2022-02-08-birdnet-23:57:08.wav.csv
I experienced this three times yesterday and this night. So it seems it is a reproducable issue.
The error in log excerpt above is indicating that there is something wrong with the birdnet_recording.service
. The log excerpt ends the traceback with the FileNotFoundError
, which means the birdnet_recording.service
stopped for some reason. The next log excerpt supports this conclusion, as the "Restart ALL Services" script restarts the birdnet_recording.service
and the FileNotFound
error disappears.
After rebooting, log and analysis still seem to be stuck (the date and time of the file still shows a very old file):
with open(path, 'w') as rfile: FileNotFoundError: [Errno 2] No such file or directory: '/home/pi/BirdSongs/February-2022/08-Tuesday/2022-02-08-birdnet-23:57:08.wav.csv birdnet_server.service: Main process exited, code=killed, status=9/KILL birdnet_server.service: Failed with result 'signal'. Stopped BirdNET Analysis Server. birdnet_server.service: Consumed 18min 48.826s CPU time. Started BirdNET Analysis Server. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite delegate for select TF ops. INFO: TfLiteFlexDelegate delegate: 1 nodes delegated out of 182 nodes with 1 partitions.
If you checked these logs right after the reboot, then the portion at the top of the log excerpt above is reporting on the last thing the service did as the system was shutting down. Basically, after any reboot, if you check logs very soon thereafter (for any service on the system), you'll see what it was reporting as it was shutting down for the reboot. I can't say for certain that that is what is indicated in the log excerpt above, but it would explain the old timestamp and the successful service behavior that followed.
After RESTART ALL SERVICES it seems it has resumed analysis and shows the usual output, which confirms it really takes the real time recordings and processes without lag:
('Perdix perdix_Rebhuhn', 0.049771078) ('Perdix perdix_Rebhuhn', 0.0641239) ('Perdix perdix_Rebhuhn', 0.04529362) ('Perdix perdix_Rebhuhn', 0.051011525) [NEW CONNECTION] ('127.0.0.1', 33360) connected. [ACTIVE CONNECTIONS] 1 READING AUDIO DATA... DONE! READ 12 CHUNKS. FULL FILENAME: -/home/pi/BirdSongs/February-2022/09-Wednesday/2022-02-09-birdnet-08:50:33.wav- Date-time: 2022-02-09 08:50:33
Keeping with my original supposition, the "Restart ALL Services" script worked again in this scenario because the issues seemed to be in birdnet_recording.service
getting stuck.
The next time you see the FileNotFound
error, try restarting the recording service ("System" > "Tools" > "Manage Services" > "Restart Recording").
And if you have the ability to access the terminal when things are malfunctioning, could you please send the output of:
journalctl -n50 --no-pager --no-hostname -u birdnet_recording.service
Thank you for providing such great information on this issue (as I have come to expect from you :smile:) -- it is truly necessary when trying to troubleshoot that I have as much information as possible, and the more log out put I can get my hands on, the better, so thanks!
I've stopped testing the RPi0W2 for today so that I can work on other things (still only have one microphone!) so let me know what you observer as you observer it.
Surprisingly, the RPi Zero 2W has had no more issues since the reboot this morning . . . for 6 hours 50min now . . . Will report as soon as I recognize something strange :-).
I love it! Thanks for keeping me posted!
I have ordered thermal glue (my old glue is hard as concrete by now :-) for a heatsink to be glued on the Zero CPU. It gets really hot:
Yes I've noticed that right when you start things up, it performs very well, then the heat sets in and everything starts to get sluggish. The heatsink will be great. I wonder if there would be a way to thermally attach your heatsink to your outdoor metal enclosure? And if that enclosure were aluminum, perhaps even adhere the CPU itself to the enclosure with thermal paste? Just a tinkerer's thought.
Hmm, maybe it is the microUSB cable?
Does that mean it is undervoltaged ?
Nope 0x0
means everything is running perfectly well -- no throttling no voltage issues and no CPU limits
Ah, yes, thanks! Pretty clear! 0x0 means that bit 0 is NOT set . . . [bit 0 seems to be the indicator for undervoltage as far as I remember]
You just gave me a good idea -- I'm going to add vcgencmd get_throttled
to the network_info.sh
script that displays output on the "System" > "Info" page. Maybe I'll change network_info.sh
to just info.sh
and add more system info there.
Yes, that would be perfect to have that kind of indicator on the System info page!
Ah, yes, thanks! Pretty clear! 0x0 means that bit 0 is NOT set . . . [bit 0 seems to be the indicator for undervoltage as far as I remember]
A 0
bit may indeed indicate an undervoltage error, but these are not bits -- this is a Hex code. The bit translation for 0
(you only use the second half of the hex for the bit indication) is:
0000
Then you need to identify which bit represents which monitored state:
From the vcgencmd
manual page:
get_throttled
Returns the throttled state of the system. This is a bit pattern - a bit being set indicates the following
meanings:
Bit Meaning
──── ────────────────────────────────────
0 Under-voltage detected
1 Arm frequency capped
2 Currently throttled
3 Soft temperature limit active
16 Under-voltage has occurred
17 Arm frequency capping has occurred
18 Throttling has occurred
19 Soft temperature limit has occurred
A value of zero indicates that none of the above conditions is true.
To find if one of these bits has been set, convert the value returned to binary, then number each bit
along the top. You can then see which bits are set. For example:
0x50000 = 0101 0000 0000 0000 0000
Adding the bit numbers along the top we get:
19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
From this we can see that bits 18 and 16 are set, indicating that the Pi has previously been throttled due
to under-voltage, but is not currently throttled for any reason.
Then that means:
00000000000000000000
|||| ||||_ Under-voltage detected
|||| |||_ Arm frequency capped
|||| ||_ Currently throttled
|||| |_ Soft temperature limit active
||||_ Under-voltage has occurred since last reboot
|||_ Arm frequency capped has occurred
||_ Throttling has occurred
|_ Soft temperature limit has occurred
Notice, none of the bits are "on", so your system is in tip top shape!
I just realized I put all that together because I misread your last comment :) oh well, overkill I suppose
For some reason, I'm seeing the birdnet_recording.service
stop on my installation for no apparent reason :( this is no good!
I FOUND THE BUG!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! in birdnet_analysis.sh
19 ...
20 make_thisrun &> /dev/null
21 if ! diff ${LAST_RUN} ${THIS_RUN};then
22 echo "The birdnet.conf file has changed"
23 echo "Reloading services"
24 cat ${THIS_RUN} > ${LAST_RUN}
25 sudo systemctl stop birdnet_recording.service
26 sudo rm -rf ${RECS_DIR}/$(date +%B-%Y/%d-%A)/*
27 fi
28 ...
I forgot to restart the birdnet_recording.service
!!!!!!!!!!!!
I pushed the fix to forms
and am going to pull it into main now since it is a true bug
I'm going to keep this open while we continue to test, but I'm pretty confident that was the reason for all of this trouble
Sounds really good Will do an update and report.
heads up, the update takes a long time on the RPi0W2 -- if I were you, I would run the update via PuTTY SSH update_birdnet.sh
to watch and see the update (for testing purposes) in case anything weird happens. It isn't necessary that you run the update that way, but if you do use the Web GUI for the update, allow it a few more minutes than the default 60 seconds the web page suggests.
Thanks! Thats a good hint, probably one of the reasons why I experienced some strange effects last evening ;-).
BTW, by now even the Raspberry Pi Zero 2W is unobtainable here in Germany . . .
Hmm, it really takes a long time to update. The Pi is now working and trying to catch up with the recordings that have been made, but not yet analyzed. The time for the analysis of a 30seconds recording with 0.5 seconds overlap is on average about 12-14 seconds. However, sometimes it needs MUCH longer, I observed up to 58 seconds for the same recording length . . . in this case (where I captured a screenshot), it needed 39 seconds
Maybe an effect of the RPi still busy with updating ?
Another anecdotal observation:
myinterpreter = tflite.Interpreter(model_path='/home/pi/BirdNET-Pi/model/BirdNET_6K_GLOBAL_MODEL.tflite',num_threads=2)
I haven't had great success yet with getting the RPi0W2 to truly keep up. I haven't tried any recording length over 60 seconds, but even at 60 seconds, things really get backed up once there are detections. It runs really smoothly, until it needs to do its job!
The other consideration is that there are extractions going on for detections which are not doing anything until the analysis service gives it detections. Thus you end up getting backed up when it starts working :/
I'm going to experiment with cutting down on other resources, adding cooling, and maybe overclocking a tiny bit to see if I can get something that works well when the birds get talkative.
I think the reason why the Zero does not manage to keep up is that some analysis events last a huge lot longer than others. Spotted another of those events with 76 seconds instead of the usual 13 seconds !
After two hours, it seems the RPi Zero has kept up and the 10 minute lag is now not existent any more . . . I set overlap to 0.0 sec for that purpose, so 10 chunks instead of 12 chunks have to be processed in every 30second recording. Will now go for a longterm test and observe whether the initial bug occurs again. But I am pretty confident you fixed the issue ;-) ! Thanks a lot, Patrick!
FINGERS ARE CROSSED! :crossed_fingers:
just one observation:
UPDATE in Zero 2W over SSH (putty on Windows and then issue update_birdnet.sh
)
+ apt install -qqy php php-fpm php-mysql php-xml php-zip
Oops, the RPi refused to work any more, it got too hot to touch (the whole PCB), so I pulled the power supply plug . . . Seems the update was too demanding . . . Will try to restart as soon as it cools down
Yikes! Do this next time, before even accessing the web browser:
SSH using PuTTY, not the webterminal (just saw your edit :+1:)into the BirdNET-Pi and issue stop_core_services.sh
THEN do update_birdnet.sh
and see if that helps.
look at this --> 7.2 seconds for an analysis of 30 seconds of audio !!!
whats happening there? :-)
Is this really a function of the CPU temperature ? But why doesnt it tell us about throttling ?
Is this really a function of the CPU temperature ? No, it does not seem to be the case . . .
It is very curious -- with adequate cooling, we would know if that were the issue, without it, it will be a looming factor. It is possible that vcgen_cmd
doesn't report accurately for this new RPi0W2 chipset? It was the case that above 60degrees C throttling led to this exact issue for the RPi3B and placed a ARM frequency limit as a result.
I want to rule it out before I rule it out
hmm, its slow again every time I return from one of the other pages of BirdNET-Pi (and when I stay in the log window a little longer, the time goes down again):
Check the temperature now and see if it has neared 60C
no, it is at 54 degrees right after return from the log where it said 75 seconds for one recording . . . but the 75 seconds were the first figure reported after returning from the system info window --> so I suspect that the page handling / sysinfo is quite demanding!?
the strange thing is: I never had durations as low as 8 seconds. Before the update and overheating I reported, the minimum was about 11-12 seconds. But maybe that is caused by your bugfix ?
Next week I will have thermal glue and some heatsinks at my disposal and I will do exactly the same test WITH heatsink, we will see whether the behaviour changes.
The bugfix just restarted the birdnet_recording.service
after stopping it -- I only added:
sudo systemctl start birdnet_recording.service
after the lines where I stop it and remove truncated recordings.
Assuming this isn't a thermal issue, then this has something to do with other resources taking up CPU, OR something about sending the info in the socket. I've noticed it takes an exceptionally long time to send the message to the socket (you can journalctl --no-hostname -fu birdnet_analysis.service
to watch how long it takes to send each message). This is something I will try to keep a closer eye on as it may reveal when/where exactly the latency is being introduced.
I've noticed that the livestream service takes up a good bit of CPU. You can try disabling it for now:
sudo systemctl disable --now livestream.service
My system seems a little awkward now. I cant write any more to any file ("no permission") will reboot and then stop for today . . . heat issue in my head ;-)
Reboot did not fix it. Stopped services and did an update --> only about two minutes! After that I had to Restart all services, because nothing was running. And then normal function can be observed.
Unfortunately, in the night, the RPi Zero 2W stopped the analysis service again. It is reachable via the usual birdnetpi.local and everything seems normal, except that the log shows the same error message again, that I observed the night before. IDs are not being made.
Will not have much time the next days, so I will stop testing the Zero at this point.
If you have any ideas/directions for further tests with the Zero, I will resume testing next week.
Two things worth noting:
livestream.service
drastically reduces the CPU load outside of analysis.Recent tweaks that have helped a lot:
birdnet_recording.service
to require the server.py
socket be open and ready before collecting data (to mitigate data getting backed up).birdnet_server.service
and birdnet_analysis.service
logs so that the logs are more verbose and show their interaction (as well as the python
call that birdnet_analysis.service
uses to send the data to the server socket).I hope to get my hands on another microphone soon, so hopefully I will be able to keep testing this alongside my main installation to keep this developing.
My best, Patrick
@mcguirepr89 Thanks Patrick for your efforts!
birdnetpi.local
and IP:adress
birdnetpi.local
(and without typing in the IP adress in the Settings! This has never been possible before! Thanks Patrick, for that!UPDATE:
After another 30 minutes of playing with the system, it gets really laggy and inresponsive and then does not respond any more at all.
@DD4WH I really appreciate your efforts.
I'm going to test this intermittently as shifts towards efficiency and reduced resource usage come to mind. It seems SO close to being able to work, but invariably it gets bogged down from normal use. It may be that a solution could be to bypass the database and PHP altogether and have a system more like BirdNET-system wherein there is simply recordings and extractions with a file server. That's a much simpler "stack" and can still have BirdDB.txt
for exporting the data and importing it into another system's database (maybe to have RPi0W2 collecting and sorting data, then a RPi4B houses the database and all its PHP visualizations etc.:thinking:)
Hello!
Update: I am working on a new branch that will switch BirdNET-Pi over to using a single-file SQLite database to do away with the MariaDB server altogether. I have started testing with this and have already noticed that the RPi4B handles the database entries with MUCH more ease using the SQLite database, so I'm very hopeful that this is a step in the right direction in making things work on the RPi0W2.
Sounds very good! Let me know as soon as you want me to test the new branch!
I will! But it won't be for a little while -- I have a lot of testing I'd like to do before I ask any others to join in. In typical crappy-development-praxis-Patrick fashion, I will be working on the SQLite at the same time as a new, simplified installation (the user won't really notice a difference).
:crossed_fingers:
Testing the Zero 2W with the sqlite version of today:
After an hour of playing the Pi gets really slow and unresponsive . . .
It seems it is under 100% load, although the temperature of the CPU does not indicate that:
throttled is still at 0x0
OK, now the Pi is so slow and unresponsive, that it does not make sense to continue further testing.
I think I will now try to glue on some heatsinks onto the CPU and then retry with a cold start.
many problems after restart:
pulseaudio -k
in the web terminal fails:
The Pi is still slow to respond, needs some seconds to load a page, although the temperature is much lower now with the heatsink, has not exceeded 45 degrees
Patrick, thanks again for the accurate installation guide for RPi Zero 2W. HERE I open this issue to have a place specifically for feedback on how BirdNET-Pi v0.11 (forms branch) is performing on this hardware. Feel free to close this issue or move it to another -more appropriate- place :-).
These are my experiences running BirdNET-Pi for half a day now (changed overlap to 0.5 and min conf to 0.5, language to german, set appropriate URL IP adress to successfully connect via Android phone):
A serious issue is that after running for one hour or a bit longer, BirdNET-Pi seems to hang and no files are being processed any more. The log shows this:
I experienced this three times yesterday and this night. So it seems it is a reproducable issue.
After rebooting, log and analysis still seem to be stuck (the date and time of the file still shows a very old file):
After RESTART ALL SERVICES it seems it has resumed analysis and shows the usual output, which confirms it really takes the real time recordings and processes without lag:
Hope this helps in finding the issue. Keep up the excellent work!