Closed takerukoushirou closed 3 years ago
ensure we never inherit any database connection in a fork.
Are you talking about the underlying file descriptor? or pointer to sqlite3
struct?
The sqlite3
struct and all the obscure heap-allocated memory contained within. Closing the inherited pointer in the fork would be safe to do.
Okay, not a thousand(s) line change, but still 21 files changed, 624 insertions(+), 656 deletions(-)
.
This kept me busy for almost two hours including some testing. I'm fairly optimistic to have not made any severe errors but any further testing of you would again be appreciated. Whatever it results in, the code is now cleaner and maybe even a bit more performant in one or the other spot. So it was worth it.
My system was producing these errors 8-10 times over 6 hours earlier today using FTL v5.7. I applied tweak/memory vDev-0795cf6
and it's been running smoothly for ~2 hours now.
of course. thanks for putting your time into this. going to test it later and report back
It continues to work excellently on my raspberry pi B+.
It's being monitored externally every 15 seconds and there are some very clear indicators when the problem happens with 5-10 minutes of timeouts after 5 sec.
I applied the changes around 22:20 and the only timeout since then was at 6 am when the computer rebooted on schedule.
this update works great. ran for 6 hours and accumulated more than 700,000 queries. hadn't had a single problem. I think it's safe to say all of the stability issues are all fixed
now to wait for the next official release
After 24 hours I have no complaints. Completely fixed my problem.
now to wait for the next official release
The next release is already coming closer. We're currently waiting on dnsmasq
v2.85 which is currently in release-candidate state. There is a reported issue with compiling dnsmasq
on Debian Buster. However. as the issue is about a missing linking dependency and none of the related code changed at all, this is very likely just a user error.
I finally ran into the issue since starting debug logging (thus still on release FTL v5.7), took a while this time. As there has been lots of development, this is mainly for reference to check whether this looks like the same cause; it's also lock-related:
[2021-03-26 11:37:00.896 6669/F27456] gravityDB_open(): Setting busy timeout to 1000
[2021-03-26 11:37:00.896 6669/F27456] Initializing new sqlite3_stmt* vector with size 242
[2021-03-26 11:37:00.896 6669/F27456] Initializing new sqlite3_stmt* vector with size 242
[2021-03-26 11:37:00.897 6669/F27456] Initializing new sqlite3_stmt* vector with size 242
[2021-03-26 11:37:00.897 6669/F27456] gravityDB_open(): Setting busy timeout to zero
[2021-03-26 11:37:00.897 6669/F27456] gravityDB_open(): Successfully opened gravity.db
[2021-03-26 11:37:00.898 6669/F27456] Waiting for lock in _FTL_new_query() (/root/project/src/dnsmasq_interface.c:571)
[2021-03-26 11:37:00.898 6669/F27456] Obtained lock for _FTL_new_query() (/root/project/src/dnsmasq_interface.c:571)
[2021-03-26 11:37:00.898 27456M] Waiting for lock in _FTL_CNAME() (/root/project/src/dnsmasq_interface.c:342)
[2021-03-26 11:37:00.898 6669/F27456] **** new TCP query[A] query "r1---sn-mn4vg5aa-5hn6.googlevideo.com" from eth0:fd00::2435:e2c4:b150:d9a0 (ID 2858779, FTL 117191, /root/project/src/dnsmasq/forward.c:2048)
[2021-03-26 11:37:00.898 6669/F27456] getOverTimeID(1616754900): 141
[2021-03-26 11:37:00.899 6672/F27456] TCP worker forked for client fd00::2435:e2c4:b150:d9a0 on interface eth0 with IP fd00::ae6f:333e:41b1:f689
[2021-03-26 11:37:00.899 6672/F27456] gravityDB_open(): Trying to open /etc/pihole/gravity.db in read-only mode
[2021-03-26 11:37:00.901 6672/F27456] gravityDB_open(): Setting location for temporary object to MEMORY
[2021-03-26 11:37:00.901 6669/F27456] r1---sn-mn4vg5aa-5hn6.googlevideo.com is not known
[2021-03-26 11:37:00.901 6669/F27456] Getting sqlite3_stmt** 0x1c489c0[234] --> (nil)
[2021-03-26 11:37:00.901 6669/F27456] Initializing gravity statements for fd00::2435:e2c4:b150:d9a0
[2021-03-26 11:37:00.901 6672/F27456] gravityDB_open(): Preparing audit query
[2021-03-26 11:37:00.901 6669/F27456] Querying gravity database for client with IP fd00::2435:e2c4:b150:d9a0...
[2021-03-26 11:37:00.902 6669/F27456] --> No record for fd00::2435:e2c4:b150:d9a0 in the client table
[2021-03-26 11:37:00.902 6669/F27456] Querying gravity database for MAC address of fd00::2435:e2c4:b150:d9a0...
[2021-03-26 11:37:00.904 6671/F27456] gravityDB_open(): Setting busy timeout to 1000
[2021-03-26 11:37:00.904 6671/F27456] Initializing new sqlite3_stmt* vector with size 242
[2021-03-26 11:37:00.904 6671/F27456] Initializing new sqlite3_stmt* vector with size 242
[2021-03-26 11:37:00.904 6671/F27456] Initializing new sqlite3_stmt* vector with size 242
[2021-03-26 11:37:00.904 6671/F27456] gravityDB_open(): Setting busy timeout to zero
[2021-03-26 11:37:00.904 6671/F27456] gravityDB_open(): Successfully opened gravity.db
[2021-03-26 11:37:00.905 6671/F27456] Waiting for lock in _FTL_new_query() (/root/project/src/dnsmasq_interface.c:571)
[2021-03-26 11:37:00.906 27456/T27460] ---> OK
[2021-03-26 11:37:00.906 27456/T27460] Waiting for lock in parse_neighbor_cache() (/root/project/src/database/network-table.c:1107)
[2021-03-26 11:37:00.910 6672/F27456] gravityDB_open(): Setting busy timeout to 1000
[2021-03-26 11:37:00.911 6672/F27456] Initializing new sqlite3_stmt* vector with size 242
[2021-03-26 11:37:00.911 6672/F27456] Initializing new sqlite3_stmt* vector with size 242
[2021-03-26 11:37:00.911 6672/F27456] Initializing new sqlite3_stmt* vector with size 242
[2021-03-26 11:37:00.911 6672/F27456] gravityDB_open(): Setting busy timeout to zero
[2021-03-26 11:37:00.911 6672/F27456] gravityDB_open(): Successfully opened gravity.db
[2021-03-26 11:37:00.912 6672/F27456] Waiting for lock in _FTL_new_query() (/root/project/src/dnsmasq_interface.c:571)
[2021-03-26 11:38:00.063 27456/T27461] Waiting for lock in GC_thread() (/root/project/src/gc.c:50)
Is it by the way normal that there are about 30 pihole-FTL
processes running (this was the case when FTL froze today)?
that's the hang problem. it'll get fixed in the next release. but if you want to fix it now, running pihole checkout ftl tweak/memory
will do
@takerukoushirou Even better, try
pihole checkout ftl development
Is it by the way normal that there are about 30
pihole-FTL
processes running (this was the case when FTL froze today)?
Yes. Internet standards (RFCs) mandate that DNS cannot only be answered over UDP but also over TCP. In the latter case, steady connections are kept open to reduce the protocol overhead. For each of these connections, individual "forks" are created. On Linux they are shown as individual processes even when they are just dependent copies of the original process. Tools like htop
show the dependency quite nicely.
@binary-person Is right that this is an issue which manifests under extreme TCP load. Only a few devices actually do TCP lookups, however, some push that really hard. It strongly depends on the particular devices in your network and can be perfectly normal. It is just somewhat uncommon and FTL wasn't prepared for this in all kind of complex multi-tasking scenarios (it should be now).
@DL6ER thank you very much for the detailed explanation. Never saw that many forks before, maybe devices switched from UDP to TCP when they couldn't get a response via UDP.
I switched to the development branch this morning, as the issue re-appeared continuously this time. Running all fine since then 😃
I will wait for the next release
Please note that changing branches severely alters your Pi-hole subsystems
Features that work on the master branch, may not on a development branch
This feature is NOT supported unless a Pi-hole developer explicitly asks!
Have you read and understood this? [y/N] ^C
maybe devices switched from UDP to TCP when they couldn't get a response via UDP
No, the issue comes from too many TCP workers with a bug in concurrency. Maybe this is typical in your network and just goes away quickly so you never notices. Well, this time it didn't go away becaseu FTL froze. Or this was the first time you've had so many TCP workers at once and, hence, this triggered the bug.
For the next release we're basically waiting on two more things:
dnsmasq
v2.85 is also already in release-candidate state. There hasn't been much movement, however, it also seems there aren't any bug reported, so far.To confirm we can checkout master to revert once released right?
Absolutely, at any time before and after the release. Just note that the issue may reappear when you do it too early ;-)
The development branch FTL crashes with following log
[2021-03-27 12:28:00.246 1188/T1192] SQLite3 message: API call with invalid database connection pointer (21)
[2021-03-27 12:28:00.246 1188/T1192] SQLite3 message: misuse at line 125209 of [ea80f3002f] (21)
[2021-03-27 12:28:00.246 1188/T1192] ERROR: SQL query "END TRANSACTION" failed: bad parameter or other API misuse
[2021-03-27 12:28:00.246 1188/T1192] SQLite3 message: API call with invalid database connection pointer (21)
[2021-03-27 12:28:00.246 1188/T1192] SQLite3 message: misuse at line 165161 of [ea80f3002f] (21)
[2021-03-27 12:28:00.246 1188/T1192] Error while trying to close database: bad parameter or other API misuse
[2021-03-27 12:28:00.246 1188/T1192] ERROR: Storing devices in network table failed: bad parameter or other API misuse
[2021-03-27 12:28:00.246 1188/T1192] SQLite3 message: API call with invalid database connection pointer (21)
[2021-03-27 12:28:00.246 1188/T1192] SQLite3 message: misuse at line 165161 of [ea80f3002f] (21)
[2021-03-27 12:28:00.246 1188/T1192] Error while trying to close database: bad parameter or other API misuse
[2021-03-27 12:29:00.329 1188/T1192] SQLite3 message: no such column: name in "UPDATE network_addresses SET name = NULL WHERE nameUpdated < 1616655540;" (1)
[2021-03-27 12:29:00.329 1188/T1192] ERROR: SQL query "UPDATE network_addresses SET name = NULL WHERE nameUpdated < 1616655540;" failed: SQL logic error
My environment is a VPS with 2vCPU and 2GB of RAM running ubuntu 18.04 LTS. I have now switched to the tweak/memory branch. Now the system seems to work. I tried this as i was also facing the original issue and was handling the matter by monitoring IP network pending packets for read. Any buildup there for pihole-FTL process indicates some kind of lock-up.
Just for reference, i have moved the pihole-FTL.db to tmpfs (which means the db is sitting in RAM all the time).
@readall Can you say if this was a one-time issue or is it reproducible? Like, does the issue happen again if you switch from tweak/memory
back to development
?
I checked the code again. Are you sure the lines you posted are complete? I'm asking because there is a database action immediately in front of the END TRANSACTION
and it doesn't make sense that the database pointer becomes incorrect throughout the process. There should be more messages.
Hi @DL6ER There may have been more messages, but I run the system with all logs disabled. This I do for all things as i want to run zero log systems. I have switched again to development branch just now. Will monitor for few hours. Last time the crash happened in just few minutes. It has crossed few minutes. Looks good as of now. If it crashes, will try and post the logs here.
Update 1 It has been running now almost 20 hours without issues.
Update 2 It is now running smoothly. So Nothing to report.
I have been on development
branch ever since the memory fixes merge (a week ago, I think) and did not experience any issues since then.
I'm looking forward to the next release. :)
Agreed the same all is well. In fact, my lan services are actually functioning better ie plex emby etc. Like issues connecting to local servers are no more.
On Mon, Mar 29, 2021, 10:28 AM Anton Bershanskiy @.***> wrote:
I have been on development branch ever since the memory fixes merge (a week ago, I think) and did not experience any issues since then.
I'm looking forward to the next release. :)
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pi-hole/FTL/issues/1081#issuecomment-809424666, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOXG3SCB4H3OPON25F6Z33TGCFADANCNFSM4ZD7X7BQ .
Thanks for the feedback. We're waiting for dnsmasq
v2.85 which is currently in rc2, but there will also be a rc3.
I also have no problems to report after 9 days on tweak/memory vDev-0795cf6
on my RPI B+.
Good Day!
I was hoping for a build by now but I understand why you guys are waiting. Can someone please point me to a build of the FTL dev-branch, or to high-level instructions how I build my own release out of the FTL dev-branch?
Thanks!
@0schr0eder You can use this command: sudo pihole checkout ftl [branch]
If you are on Ubuntu, you can follow these instructions: https://docs.pi-hole.net/ftldns/compile/
Some pre-built binaries are uploaded here: https://ftl.pi-hole.net/
Also, FTL v5.8 nears release already.
I would need to compile this on the Raspberry but this is a start and I can give it a shot. Thank you very much for the link!
How can I rollback the previous version though.. Is it possible ?
@dvdvideo1234 There are multiple ways to install different versions: you can build from source, you use the install script, you can download pre-built binaries. You'll have better luck asking on https://discourse.pi-hole.net/
The pihole checkout ftl
command should be preferred before doing any self-compiling or downloading binaries manually. It will make sure you only get tested binaries which are validated to ensure no download error happened. You can even go back but if you do it is not guaranteed that everything still works (hint: usually, it does)
Sweet! Would I chose the "development" branch?
Sweet! Would I chose the "development" branch?
Since release/v5.8
is slightly ahead of development
, wouldn't it make sense to use it instead?
https://github.com/pi-hole/FTL/compare/development...release/v5.8
Edit: You might want to wait for v5.8.
Speaking of which, is there an ETA for 5.8?
Now.
The next version of FTL has been released. Please update and run
pihole checkout master
to get back on-track if you switched to a custom branch. The fix/feature branch you switched to will not receive any further updates.
Thanks for helping us to make Pi-hole better for us all!
If you have any issues, please either reopen this ticket or (preferably) create a new ticket describing the issues in further detail and only reference this ticket. This will help us to help you best.
Still it does not blink for updates.. I will wait a bit more
Wait it said there is a new version ... Installing...
Cycles on [i] Testing man page installation
it appears to be stuck, but maybe it needs more time... Done!
Current Pi-hole version is v5.3.1.
Current AdminLTE version is v5.5.
Current FTL version is v5.8.
The whole system got down. Restarting it does nothing and I cannot SSH. Will reinstall the Ubuntu on the weekends.
just spun up a fresh vps. ftl v5.8.1 works perfectly; 158,441 queries without any lagging or any out-of-the-blue hangs. thanks DL6ER for the big bug fix :pray:
@binary-person @DL6ER
VPS... You mean I can get it back to boot without reinstalling ?
Thanks very much, guys. I am gonna backup the memory card right now. What is the most efficient way.
I want to export the adlists
, allow
and blocked
information of the old Pi-Hole and put it to the new one.
This will not work for me as the Ubuntu does not boot anymore, so I can only copy files from the memory card to restore it.
Thank you !
Versions
Platform
Issue
DNS randomly stops working. Devices can no longer make a connection to pihole-FTL; connection attempts hang.
Last pihole-FTL log messages:
pihole restartdns
fails as pihole-FTL does not stop within the time limit. systemd journal:Manually restarting pihole-FTL via systemctl eventually succeeds.
Steps to reproduce
So far no pattern observed, happens randomly, sometimes within hours, sometimes within weeks.
Debug Token
https://tricorder.pi-hole.net/nkv45bn6cf