pi-hole / FTL

The Pi-hole FTL engine
https://pi-hole.net
Other
1.38k stars 196 forks source link

FTL crashing while DHCP server is enabled #718

Closed d4r1as closed 4 years ago

d4r1as commented 4 years ago

In raising this issue, I confirm the following (please check boxes, eg [X]) Failure to fill the template will close your issue:

How familiar are you with the codebase?:

7


[ISSUE] Expected Behaviour:

After update to vDev-b6364d0, FTL should not crash while DHCP server is enabled.

[BUG] Actual Behaviour:

FTL became unstable after update to vDev-b6364d0. It crashes after 10-30 seconds.

[BUG | ISSUE] Steps to reproduce:

Enable DHCP server. FTL should crash soon after that.

Log file output [if available]

https://tricorder.pi-hole.net/dli98t4why

(gdb) handle SIGHUP nostop SIGPIPE nostop
Signal        Stop      Print   Pass to program Description
SIGHUP        No        Yes     Yes             Hangup
SIGPIPE       No        Yes     Yes             Broken pipe
(gdb) continue
Continuing.
[Detaching after fork from child process 5222]
[New Thread 0x739ff460 (LWP 5242)]
[Thread 0x739ff460 (LWP 5242) exited]
[New Thread 0x739ff460 (LWP 5243)]
[Thread 0x739ff460 (LWP 5243) exited]
(gdb) [Detaching after fork from child process 5244]

Thread 5 "database" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x75369460 (LWP 5191)]
__strchrnul (s=0x2 <error: Cannot access memory at address 0x2>, c_in=37) at strchrnul.c:50
50      strchrnul.c: No such file or directory.
(gdb) backtrace
#0  __strchrnul (s=0x2 <error: Cannot access memory at address 0x2>, c_in=37) at strchrnul.c:50
#1  0x76d75174 in __find_specmb (format=0x2 <error: Cannot access memory at address 0x2>) at printf-parse.h:108
#2  _IO_vfprintf_internal (s=s@entry=0x75368ab8, format=format@entry=0x2 <error: Cannot access memory at address 0x2>, ap=..., ap@entry=...) at vfprintf.c:1315
#3  0x76e1d024 in __GI___vasprintf_chk (result_ptr=result_ptr@entry=0x75368c60, flags=flags@entry=1, format=0x2 <error: Cannot access memory at address 0x2>, format@entry=0x0, args=..., args@entry=...) at vasprintf_chk.c:66
#4  0x76e1cf30 in __asprintf_chk (result_ptr=result_ptr@entry=0x75368c60, flags=flags@entry=1, format=0x2 <error: Cannot access memory at address 0x2>) at asprintf_chk.c:32
#5  0x00448a76 in asprintf (__fmt=0x532f5c "SELECT id FROM network WHERE hwaddr = '%s';", __ptr=0x75368c60) at /usr/arm-linux-gnueabihf/include/bits/stdio2.h:178
#6  parse_neighbor_cache () at src/database/network-table.c:325
#7  0x0044b5e8 in DB_thread (val=<optimized out>) at src/database/database-thread.c:64
#8  0x76e88494 in start_thread (arg=0x75369460) at pthread_create.c:486
#9  0x76e0b578 in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:73 from /lib/arm-linux-gnueabihf/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Device specifics

Hardware Type: rPi 3B+ OS: Raspbian Buster

This template was created based on the work of udemy-dl.

DL6ER commented 4 years ago

I'm running FTL (with the exact same commit) as DHCP server locally as well and am not able to reproduce this issue. Could you run the following debug instructions when you observe the crash in the debugger? This should help us narrowing down the issue.

p linebuffer
p num
p ip
p iface
p hwaddr

Thanks a lot!

TylerVigario commented 4 years ago

Same symptoms as above as well as another who posted today on Reddit.

This is in pihole-FTL.log

[2020-03-30 15:52:00.376 14894] !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[2020-03-30 15:52:00.377 14894] ---------------------------->  FTL crashed!  <----------------------------
[2020-03-30 15:52:00.377 14894] !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[2020-03-30 15:52:00.378 14894] Please report a bug at https://github.com/pi-hole/FTL/issues
[2020-03-30 15:52:00.378 14894] and include in your report already the following details:
[2020-03-30 15:52:00.379 14894] FTL has been running for 20 seconds
[2020-03-30 15:52:00.379 14894] FTL branch: release/v5.0
[2020-03-30 15:52:00.380 14894] FTL version: vDev-b6364d0
[2020-03-30 15:52:00.380 14894] FTL commit: b6364d0
[2020-03-30 15:52:00.381 14894] FTL date: 2020-03-29 23:23:02 +0200
[2020-03-30 15:52:00.381 14894] FTL user: started as pihole, ended as pihole
[2020-03-30 15:52:00.382 14894] Compiled for x86_64 (compiled on CI) using gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
[2020-03-30 15:52:00.382 14894] Received signal: Segmentation fault
[2020-03-30 15:52:00.382 14894]      at address: 0x200000001
[2020-03-30 15:52:00.383 14894]      with code: SEGV_MAPERR (Address not mapped to object)
[2020-03-30 15:52:00.384 14894] Backtrace:
[2020-03-30 15:52:00.384 14894] B[0000]: 0x5635dd526189, /usr/bin/pihole-FTL(+0x31189) [0x5635dd526189]
[2020-03-30 15:52:00.385 14894] B[0001]: 0x7f82ed462dc0, /lib64/libpthread.so.0(+0x12dc0) [0x7f82ed462dc0]
[2020-03-30 15:52:00.385 14894] B[0002]: 0x7f82ed19543a, /lib64/libc.so.6(__vasprintf_chk+0x10a) [0x7f82ed19543a]
[2020-03-30 15:52:00.386 14894] B[0003]: 0x7f82ed195313, /lib64/libc.so.6(__asprintf_chk+0x93) [0x7f82ed195313]
[2020-03-30 15:52:00.386 14894] B[0004]: 0x5635dd51bc78, /usr/bin/pihole-FTL(parse_neighbor_cache+0x248) [0x5635dd51bc78]
[2020-03-30 15:52:00.387 14894] B[0005]: 0x5635dd51f87d, /usr/bin/pihole-FTL(DB_thread+0x10d) [0x5635dd51f87d]
[2020-03-30 15:52:00.387 14894] B[0006]: 0x7f82ed4582de, /lib64/libpthread.so.0(+0x82de) [0x7f82ed4582de]
[2020-03-30 15:52:00.387 14894] B[0007]: 0x7f82ed184133, /lib64/libc.so.6(clone+0x43) [0x7f82ed184133]
[2020-03-30 15:52:00.388 14894] ------ Listing content of directory /dev/shm ------
[2020-03-30 15:52:00.388 14894] File Mode User:Group  Filesize Filename
[2020-03-30 15:52:00.389 14894] rwxrwxrwx root:root 300 .
[2020-03-30 15:52:00.390 14894] rwxr-xr-x root:root 4K ..
[2020-03-30 15:52:00.390 14894] rw------- pihole:pihole 4K FTL-per-client-regex
[2020-03-30 15:52:00.390 14894] rw------- pihole:pihole 4K FTL-dns-cache
[2020-03-30 15:52:00.391 14894] rw------- pihole:pihole 12K FTL-overTime
[2020-03-30 15:52:00.391 14894] rw------- pihole:pihole 5M FTL-queries
[2020-03-30 15:52:00.392 14894] rw------- pihole:pihole 4K FTL-upstreams
[2020-03-30 15:52:00.392 14894] rw------- pihole:pihole 340K FTL-clients
[2020-03-30 15:52:00.393 14894] rw------- pihole:pihole 98K FTL-domains
[2020-03-30 15:52:00.393 14894] rw------- pihole:pihole 106K FTL-strings
[2020-03-30 15:52:00.394 14894] rw------- pihole:pihole 12 FTL-settings
[2020-03-30 15:52:00.394 14894] rw------- pihole:pihole 124 FTL-counters
[2020-03-30 15:52:00.395 14894] rw------- pihole:pihole 48 FTL-lock
[2020-03-30 15:52:00.395 14894] rw-r--r-- plex:plex 32 sem.F8EEB26D-AB23-4061-8250-130AB5285A4A
[2020-03-30 15:52:00.396 14894] rw-r--r-- plex:plex 32 sem.9334581e-7251-4ef7-a8ec-5bfe8e89ff68
[2020-03-30 15:52:00.396 14894] ---------------------------------------------------
[2020-03-30 15:52:00.397 14894] Thank you for helping us to improve our FTL engine!
[2020-03-30 15:52:00.397 14894] FTL terminated!
unbekannt3 commented 4 years ago

Same issue with a Pi4 but not running a DHCP server on it

[2020-03-31 02:42:00.180 991] !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[2020-03-31 02:42:00.180 991] ---------------------------->  FTL crashed!  <----------------------------
[2020-03-31 02:42:00.180 991] !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[2020-03-31 02:42:00.180 991] Please report a bug at https://github.com/pi-hole/FTL/issues
[2020-03-31 02:42:00.180 991] and include in your report already the following details:
[2020-03-31 02:42:00.180 991] FTL has been running for 25 seconds
[2020-03-31 02:42:00.180 991] FTL branch: release/v5.0
[2020-03-31 02:42:00.180 991] FTL version: vDev-b6364d0
[2020-03-31 02:42:00.180 991] FTL commit: b6364d0
[2020-03-31 02:42:00.180 991] FTL date: 2020-03-29 23:23:02 +0200
[2020-03-31 02:42:00.180 991] FTL user: started as pihole, ended as pihole
[2020-03-31 02:42:00.180 991] Compiled for armhf (compiled on CI) using arm-linux-gnueabihf-gcc (Debian 6.3.0-18) 6.3.0 20170516
[2020-03-31 02:42:00.181 991] Received signal: Segmentation fault
[2020-03-31 02:42:00.181 991]      at address: 0xb6cea9ec
[2020-03-31 02:42:00.181 991]      with code: SEGV_MAPERR (Address not mapped to object)
[2020-03-31 02:42:00.181 991] Backtrace:
[2020-03-31 02:42:00.181 991] B[0000]: 0x51c618, /usr/bin/pihole-FTL(+0x23618) [0x51c618]
[2020-03-31 02:42:00.181 991] B[0001]: 0xb6d4d130, /lib/arm-linux-gnueabihf/libc.so.6(__default_rt_sa_restorer+0) [0xb6d4d130]
[2020-03-31 02:42:00.181 991] B[0002]: 0x521b50, /usr/bin/pihole-FTL(resolveClients+0xb7) [0x521b50]
[2020-03-31 02:42:00.181 991] B[0003]: 0x521d7e, /usr/bin/pihole-FTL(DNSclient_thread+0x85) [0x521d7e]
[2020-03-31 02:42:00.181 991] ------ Listing content of directory /dev/shm ------
[2020-03-31 02:42:00.182 991] File Mode User:Group  Filesize Filename
[2020-03-31 02:42:00.182 991] rwxrwxrwx root:root 260 .
[2020-03-31 02:42:00.182 991] rwxr-xr-x root:root 4K ..
[2020-03-31 02:42:00.182 991] rw------- pihole:pihole 4K FTL-per-client-regex
[2020-03-31 02:42:00.182 991] rw------- pihole:pihole 4K FTL-dns-cache
[2020-03-31 02:42:00.183 991] rw------- pihole:pihole 29K FTL-overTime
[2020-03-31 02:42:00.183 991] rw------- pihole:pihole 18M FTL-queries
[2020-03-31 02:42:00.183 991] rw------- pihole:pihole 20K FTL-upstreams
[2020-03-31 02:42:00.183 991] rw------- pihole:pihole 41K FTL-clients
[2020-03-31 02:42:00.183 991] rw------- pihole:pihole 66K FTL-domains
[2020-03-31 02:42:00.184 991] rw------- pihole:pihole 45K FTL-strings
[2020-03-31 02:42:00.184 991] rw------- pihole:pihole 12 FTL-settings
[2020-03-31 02:42:00.184 991] rw------- pihole:pihole 124 FTL-counters
[2020-03-31 02:42:00.184 991] rw------- pihole:pihole 28 FTL-lock
[2020-03-31 02:42:00.184 991] ---------------------------------------------------
[2020-03-31 02:42:00.184 991] Thank you for helping us to improve our FTL engine!
[2020-03-31 02:42:00.184 991] FTL terminated!
DL6ER commented 4 years ago

I'm not yet convinced @unbekannt3DE 's bug is the same as in the other two backtraces (which are very likely the exact same thing), however, it's possible that they have the same origin somewhere else. And I wasn't expecting the DHCP server to be the reason, so that matches with @unbekannt3DE's observation.

Could you two also run FTL in the debugger? See https://docs.pi-hole.net/ftldns/debugging/ for more details.

unbekannt3 commented 4 years ago

I restarted my Pi 4 several times tonight and then restarted the FTL service manually about 3 hours ago. It hasn't crashed again since.

d4r1as commented 4 years ago

@DL6ER - Sorry for the late reply.

Here is the output you requested - I hope I did it right :)

[New LWP 20695]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
__GI___poll (timeout=-1, nfds=9, fds=0x1e0c1d8) at ../sysdeps/unix/sysv/linux/poll.c:29
29      ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
(gdb) handle SIGHUP nostop SIGPIPE nostop
Signal        Stop      Print   Pass to program Description
SIGHUP        No        Yes     Yes             Hangup
SIGPIPE       No        Yes     Yes             Broken pipe
(gdb) continue
Continuing.
[New Thread 0x72dff460 (LWP 20716)]
[Thread 0x72dff460 (LWP 20716) exited]
[New Thread 0x72dff460 (LWP 20717)]
[Thread 0x72dff460 (LWP 20717) exited]
[New Thread 0x72dff460 (LWP 20718)]
[Thread 0x72dff460 (LWP 20718) exited]
[New Thread 0x72dff460 (LWP 20719)]
[Thread 0x72dff460 (LWP 20719) exited]
[New Thread 0x72dff460 (LWP 20720)]
[Thread 0x72dff460 (LWP 20720) exited]
[New Thread 0x72dff460 (LWP 20721)]
[Thread 0x72dff460 (LWP 20721) exited]
[New Thread 0x72dff460 (LWP 20722)]
[Thread 0x72dff460 (LWP 20722) exited]
[New Thread 0x72dff460 (LWP 20725)]
[Thread 0x72dff460 (LWP 20725) exited]
[New Thread 0x72dff460 (LWP 20726)]
[Thread 0x72dff460 (LWP 20726) exited]
[New Thread 0x72dff460 (LWP 20727)]
[Thread 0x72dff460 (LWP 20727) exited]
[New Thread 0x72dff460 (LWP 20728)]
[Thread 0x72dff460 (LWP 20728) exited]
[New Thread 0x72dff460 (LWP 20729)]
[Thread 0x72dff460 (LWP 20729) exited]
[New Thread 0x72dff460 (LWP 20730)]
[Thread 0x72dff460 (LWP 20730) exited]
[New Thread 0x72dff460 (LWP 20731)]
[Thread 0x72dff460 (LWP 20731) exited]
[New Thread 0x72dff460 (LWP 20732)]
[Thread 0x72dff460 (LWP 20732) exited]
[New Thread 0x72dff460 (LWP 20733)]
[Thread 0x72dff460 (LWP 20733) exited]
[New Thread 0x72dff460 (LWP 20734)]
[Thread 0x72dff460 (LWP 20734) exited]
[New Thread 0x72dff460 (LWP 20735)]
[Detaching after fork from child process 20736]

Thread 5 "database" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x75201460 (LWP 20693)]
__strchrnul (s=0x1 <error: Cannot access memory at address 0x1>, c_in=37) at strchrnul.c:50
50      strchrnul.c: No such file or directory.
(gdb) p linebuffer
$1 = 0x0
(gdb) p num
$2 = num
(gdb) p ip
No symbol "ip" in current context.
(gdb) p iface
No symbol "iface" in current context.
(gdb) p hwaddr
No symbol "hwaddr" in current context.
(gdb)

Edit:

Also, maybe this helps. These strange MAC's were always there, but maybe with the last FTL update something changed ...

1

DL6ER commented 4 years ago

@ionutgalita No worries. Unfortunately, the crash happened in another place this time so the instructions I gave you were useless. Could you try this again maybe once or twice? And please always include a backtrace output as well.

pralor-bot commented 4 years ago

This issue has been mentioned on Pi-hole Userspace. There might be relevant details there:

https://discourse.pi-hole.net/t/ftl-crash-after-update-v4-3-1/29920/28

d4r1as commented 4 years ago

@DL6ER I updated FTL to version vDev-81c4eac and everything seems to work now.

I will report again tomorrow to confirm.

TylerVigario commented 4 years ago

Seems like I got to the same place as above

[root@tylervigario ~]# pihole restartdns
  [✓] Restarting DNS server
[root@tylervigario ~]# sudo gdb -p $(pidof pihole-FTL)
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-6.el8
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 13040
[New LWP 13041]
[New LWP 13042]
[New LWP 13043]
[New LWP 13044]
[New LWP 13045]
[New LWP 13046]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

warning: Loadable section ".note.gnu.property" outside of ELF segments
0x00007f8d809a1211 in poll () from /lib64/libc.so.6
Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-72.el8_1.1.x86_64 sssd-client-2.2.0-19.el8.x86_64
(gdb) handle SIGHUP nostop SIGPIPE nostop
Signal        Stop      Print   Pass to program Description
SIGHUP        No        Yes     Yes             Hangup
SIGPIPE       No        Yes     Yes             Broken pipe
(gdb) continue
Continuing.
[Detaching after fork from child process 13069]

Thread 5 "database" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f8d7dd4c700 (LWP 13044)]
0x00007f8d809bd43a in __vasprintf_chk () from /lib64/libc.so.6
(gdb) continue
Continuing.
warning: Loadable section ".note.gnu.property" outside of ELF segments
warning: Loadable section ".note.gnu.property" outside of ELF segments
warning: Loadable section ".note.gnu.property" outside of ELF segments
warning: Loadable section ".note.gnu.property" outside of ELF segments
warning: Loadable section ".note.gnu.property" outside of ELF segments
warning: Loadable section ".note.gnu.property" outside of ELF segments
warning: Loadable section ".note.gnu.property" outside of ELF segments

Thread 5 "database" received signal SIGABRT, Aborted.
0x00007f8d808e78df in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: yum debuginfo-install libblkid-2.32.1-17.el8.x86_64 libgcc-8.3.1-4.5.el8.x86_64 libmount-2.32.1-17.el8.x86_64 libselinux-2.9-2.1.el8.x86_64 libuuid-2.32.1-17.el8.x86_64 pcre2-10.32-1.el8.x86_64 systemd-libs-239-18.el8_1.4.x86_64
(gdb) backtrace
#0  0x00007f8d808e78df in raise () from /lib64/libc.so.6
#1  0x00007f8d808d1cf5 in abort () from /lib64/libc.so.6
#2  0x000056264d42f7f6 in SIGSEGV_handler (sig=<optimized out>, si=<optimized out>, unused=<optimized out>)
    at src/signals.c:80
#3  <signal handler called>
#4  0x00007f8d809bd43a in __vasprintf_chk () from /lib64/libc.so.6
#5  0x00007f8d809bd313 in __asprintf_chk () from /lib64/libc.so.6
#6  0x000056264d424c10 in asprintf (__fmt=0x56264d5816c8 "SELECT id FROM network WHERE hwaddr = '%s';",
    __ptr=0x7f8d00000001) at /usr/include/x86_64-linux-gnu/bits/stdio2.h:178
#7  parse_neighbor_cache () at src/database/network-table.c:382
#8  0x000056264d428e03 in DB_thread (val=<optimized out>) at src/database/database-thread.c:68
#9  0x00007f8d80c802de in start_thread () from /lib64/libpthread.so.0
#10 0x00007f8d809ac133 in clone () from /lib64/libc.so.6
(gdb) p linebuffer
$1 = 0x0
(gdb) p num
No symbol "num" in current context.
(gdb) p ip
No symbol "ip" in current context.
(gdb) p iface
No symbol "iface" in current context.
(gdb) p hwaddr
No symbol "hwaddr" in current context.
(gdb)
d4r1as commented 4 years ago

@MeekLogic

It happend to me also, run pihole -r (repair) and everything should be fixed.

TylerVigario commented 4 years ago

@MeekLogic

It happend to me also, run pihole -r (repair) and everything should be fixed.

Unfortunately, this did not fix the issue for me.

FTL still crashes like crazy and I feel like I am not getting the desired output from the debugger.

d4r1as commented 4 years ago

@MeekLogic

Sorry to hear that. Worked for me (at least for the last ~9 hours)

I am on release/5.0 branch (81c4eac).

In case you use DHCP, try flushing Network table (Settings > Flush network table), restart FTP from cli sudo service pihole-FTL restartand after run again sudo pihole -r.

This is exactly what I did last night and now it seems to work.

TylerVigario commented 4 years ago

@ionutgalita Thanks for the help but I'm pretty sure @DL6ER just needs the right debug output from one of us. He's stated the "core" teams' inability to produce it "in-house", so that means it lands on one of us to work with them.

d4r1as commented 4 years ago

@DL6ER

Yes, I know. Watching closely to see if my installation crashes again.

DL6ER commented 4 years ago

Ah, yeah, sorry. We coded FTL asynchronously multi-threaded so it can do multiple tasks at the same time independently. This is to ensure highest performance and low delays for DNS resolution. Debugging such applications is generally more tricky.

When you see an output like

Thread 5 "database" received signal ...

this indicates in which thread the issue occurred. Thurs does not mean that gdb will also analyze this thread. Extract the number from above's message (5) and switch the context using:

thread 5

Then run the p ... commands again. Debugging multi-threaded code can become much more complex, but let's that we won't need this for now.

pralor-bot commented 4 years ago

This issue has been mentioned on Pi-hole Userspace. There might be relevant details there:

https://discourse.pi-hole.net/t/ftl-crash-after-update-v4-3-1/29920/47

FizzGiGG commented 4 years ago

FWIW I do not run, nor have I ever, DHCP Server. My crashing issue started last night with the update to vDev-81c4eac, I believe.

https://tricorder.pi-hole.net/zo2xrt03ql

TylerVigario commented 4 years ago

@DL6ER Got it, thanks for that information.

If you need further information just let me know. I'll do my best to be thorough and prompt.

Thread 5 "database" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f426655c700 (LWP 1781)]
0x00007f42691cd43a in __vasprintf_chk () from /lib64/libc.so.6
(gdb) thread 5
[Switching to thread 5 (Thread 0x7f426655c700 (LWP 1781))]
#0  0x00007f42691cd43a in __vasprintf_chk () from /lib64/libc.so.6
(gdb) backtrace
#0  0x00007f42691cd43a in __vasprintf_chk () from /lib64/libc.so.6
#1  0x00007f42691cd313 in __asprintf_chk () from /lib64/libc.so.6
#2  0x0000564f91ca0c10 in asprintf (__fmt=0x564f91dfd6c8 "SELECT id FROM network WHERE hwaddr = '%s';",
    __ptr=0x7f4200000001) at /usr/include/x86_64-linux-gnu/bits/stdio2.h:178
#3  parse_neighbor_cache () at src/database/network-table.c:382
#4  0x0000564f91ca4e03 in DB_thread (val=<optimized out>) at src/database/database-thread.c:68
#5  0x00007f42694902de in start_thread () from /lib64/libpthread.so.0
#6  0x00007f42691bc133 in clone () from /lib64/libc.so.6
(gdb) p linebuffer
$1 = 0x0
(gdb) p num
No symbol "num" in current context.
(gdb) p ip
No symbol "ip" in current context.
(gdb) p iface
No symbol "iface" in current context.
(gdb) p hwaddr
No symbol "hwaddr" in current context.
(gdb) continue
Continuing.
[New Thread 0x7f425df34700 (LWP 7260)]

Thread 5 "database" received signal SIGABRT, Aborted.
0x00007f42690f78df in raise () from /lib64/libc.so.6
(gdb) thread 5
[Switching to thread 5 (Thread 0x7f426655c700 (LWP 1781))]
#0  0x00007f42690f78df in raise () from /lib64/libc.so.6
(gdb) backtrace
#0  0x00007f42690f78df in raise () from /lib64/libc.so.6
#1  0x00007f42690e1cf5 in abort () from /lib64/libc.so.6
#2  0x0000564f91cab7f6 in SIGSEGV_handler (sig=<optimized out>, si=<optimized out>, unused=<optimized out>)
    at src/signals.c:80
#3  <signal handler called>
#4  0x00007f42691cd43a in __vasprintf_chk () from /lib64/libc.so.6
#5  0x00007f42691cd313 in __asprintf_chk () from /lib64/libc.so.6
#6  0x0000564f91ca0c10 in asprintf (__fmt=0x564f91dfd6c8 "SELECT id FROM network WHERE hwaddr = '%s';",
    __ptr=0x7f4200000001) at /usr/include/x86_64-linux-gnu/bits/stdio2.h:178
#7  parse_neighbor_cache () at src/database/network-table.c:382
#8  0x0000564f91ca4e03 in DB_thread (val=<optimized out>) at src/database/database-thread.c:68
#9  0x00007f42694902de in start_thread () from /lib64/libpthread.so.0
#10 0x00007f42691bc133 in clone () from /lib64/libc.so.6
(gdb) p linebuffer
$2 = 0x0
(gdb) p num
No symbol "num" in current context.
(gdb) p ip
No symbol "ip" in current context.
(gdb) p iface
No symbol "iface" in current context.
(gdb) p hwaddr
No symbol "hwaddr" in current context.
(gdb) continue
Continuing.
Couldn't get registers: No such process.
Couldn't get registers: No such process.
(gdb) [Thread 0x7f425df34700 (LWP 7260) exited]
[Thread 0x7f426555a700 (LWP 1783) exited]
[Thread 0x7f4265d5b700 (LWP 1782) exited]
[Thread 0x7f426655c700 (LWP 1781) exited]
[Thread 0x7f4266d5d700 (LWP 1780) exited]
[Thread 0x7f4267d5f700 (LWP 1778) exited]
[Thread 0x7f4269acd740 (LWP 1777) exited]

Program terminated with signal SIGABRT, Aborted.
The program no longer exists.
DL6ER commented 4 years ago

Thanks, what you did was almost correct, just don't run a continue in between, because FTL will, well, try to continue and debugging doesn't work because FTL starts to do something else.

d4r1as commented 4 years ago

@DL6ER

I hope is correct now :)

Thread 5 "database" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7513c460 (LWP 17761)]
__strchrnul (s=0x1 <error: Cannot access memory at address 0x1>, c_in=37)
    at strchrnul.c:50
50      strchrnul.c: No such file or directory.
(gdb) thread 5
[Switching to thread 5 (Thread 0x7513c460 (LWP 17761))]
#0  __strchrnul (s=0x1 <error: Cannot access memory at address 0x1>, c_in=37)
    at strchrnul.c:50
50      in strchrnul.c
(gdb) backtrace
#0  __strchrnul (s=0x1 <error: Cannot access memory at address 0x1>, c_in=37)
    at strchrnul.c:50
#1  0x76d44174 in __find_specmb (
    format=0x1 <error: Cannot access memory at address 0x1>)
    at printf-parse.h:108
#2  _IO_vfprintf_internal (s=s@entry=0x7513ba68,
    format=format@entry=0x1 <error: Cannot access memory at address 0x1>,
    ap=..., ap@entry=...) at vfprintf.c:1315
#3  0x76dec024 in __GI___vasprintf_chk (
    result_ptr=result_ptr@entry=0x7513bc60, flags=flags@entry=1,
    format=0x1 <error: Cannot access memory at address 0x1>, format@entry=0x0,
    args=..., args@entry=...) at vasprintf_chk.c:66
#4  0x76debf30 in __asprintf_chk (result_ptr=result_ptr@entry=0x7513bc60,
    flags=flags@entry=1,
    format=0x1 <error: Cannot access memory at address 0x1>)
    at asprintf_chk.c:32
#5  0x004edb2a in asprintf (
    __fmt=0x5d85d4 "SELECT id FROM network WHERE hwaddr = '%s';",
    __ptr=0x7513bc60) at /usr/arm-linux-gnueabihf/include/bits/stdio2.h:178
#6  parse_neighbor_cache () at src/database/network-table.c:382
#7  0x004f0ae0 in DB_thread (val=<optimized out>)
    at src/database/database-thread.c:68
#8  0x76e57494 in start_thread (arg=0x7513c460) at pthread_create.c:486
--Type <RET> for more, q to quit, c to continue without paging--c
#9  0x76dda578 in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:73 from /lib/arm-linux-gnueabihf/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) p num
$1 = num
(gdb) p ip
No symbol "ip" in current context.
(gdb) p iface
No symbol "iface" in current context.
(gdb) p hwaddr
No symbol "hwaddr" in current context.
(gdb) p linebuffer
$2 = 0x0
(gdb)
DL6ER commented 4 years ago

Hmm, this was done correctly, but it still doesn't show the information I wanted to see, but, well, this is going to be a bit complicated, could you try:

pihole checkout ftl fix/neighcrash

?

This will also print some more debugging output.

edit Renamed branch. Just in case someone else tries this as well.

d4r1as commented 4 years ago

@DL6ER

pi@pi:~ $ pihole checkout ftl fix/neigh_crash
  Please note that changing branches severely alters your Pi-hole subsystems
  Features that work on the master branch, may not on a development branch
  This feature is NOT supported unless a Pi-hole developer explicitly asks!
  Have you read and understood this? [y/N] y

  [✗] Requested branch "fix/neigh_crash" is not available
  [i] Available branches for FTL are:
      - FTLDNS
      - bughaunting/overTime
      - development
      - feature/deb-and-rpm
      - fix/dnssec-retry-crash
      - fix/issue_template
      - fix/msatter_crazy_IPv6
      - fix/neigh_crash
      - master
      - master-find-missing-reference
      - new/GeoIP
      - new/all_clients_network_table_no-auto
      - new/http
      - release/v5.0
      - revert-689-ltaub
      - tweak/external_blocked_IPs
      - tweak/log_upstream_errors
      - tweak/remove-unused-FTL-var
      - update/dnsmasq
DL6ER commented 4 years ago

You were likely too fast for our automated build system, please try again.

d4r1as commented 4 years ago

@DL6ER

I am .. same result until now

yubiuser commented 4 years ago

I'm not affected by the bug, but I can checkout the branch

nanopi@nanopi:~$ pihole checkout ftl fix/neigh_crash
  Please note that changing branches severely alters your Pi-hole subsystems
  Features that work on the master branch, may not on a development branch
  This feature is NOT supported unless a Pi-hole developer explicitly asks!
  Have you read and understood this? [y/N] y

  [✓] Branch fix/neigh_crash exists
  [✓] Downloading and Installing FTL
  [✓] Restarting pihole-FTL service...
  [✓] Enabling pihole-FTL service to start on reboot...
DL6ER commented 4 years ago

@ionutgalita No need to ping me every time ;-)

[✗] Requested branch "fix/neigh_crash" is not available

but below there is:

  • fix/neigh_crash

Can you check if maybe the underscore was copied as some strange special character? This happens with github from time to time, maybe try

pihole checkout ftl fix/neighcrash

(without underscore), which I just created.

@yubiuser I renamed the branch, the other one will not be maintained, just to mention.

d4r1as commented 4 years ago

@DL6ER

Still no luck ... I can checkout any other branch ... except this one.

DL6ER commented 4 years ago

I will trigger a complete re-build, please standby for 3-5 minutes and then try again! (with fix/neighcrash)

DL6ER commented 4 years ago

Okay,most builds are done, just the musl and docker ones are missing (they take some more time), please try again @ionutgalita

d4r1as commented 4 years ago

@DL6ER

Checkout worked now - but FTL is not crashing anymore 😠 😠 😠

Will keep debugging and let you know

DL6ER commented 4 years ago

Well,

FTL is not crashing anymore

is a good thing ;-)

I do did change something (skip failed memory allocations of the cache parser), even when I haven't expected this to change anything since nothing has changed in this region of the code in weeks/months, but if it fixes it now ... okay with me. I will post this in the other issue thread on Discourse and check if the crash disappears for the others as well.

TylerVigario commented 4 years ago

@DL6ER Here is my results from fix/neighcrash

BTW, it seems to automatically switch to the thread context that raises the error.

[root@tylervigario ~]# sudo gdb -p $(pidof pihole-FTL)
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-6.el8
Attaching to process 24247
[New LWP 24248]
[New LWP 24249]
[New LWP 24250]
[New LWP 24251]
[New LWP 24252]
[New LWP 24253]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

warning: Loadable section ".note.gnu.property" outside of ELF segments
0x00007f8816531211 in poll () from /lib64/libc.so.6
Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-72.el8_1.1.x86_64 sssd-client-2.2.0-19.el8.x86_64
(gdb) handle SIGHUP nostop SIGPIPE nostop
Signal        Stop      Print   Pass to program Description
SIGHUP        No        Yes     Yes             Hangup
SIGPIPE       No        Yes     Yes             Broken pipe
(gdb) continue
Continuing.
[New Thread 0x7f88120d9700 (LWP 24371)]
[Thread 0x7f88120d9700 (LWP 24371) exited]
[Detaching after fork from child process 24372]

Thread 5 "database" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f88138dc700 (LWP 24251)]
0x00007f881654d43a in __vasprintf_chk () from /lib64/libc.so.6
(gdb) info threads
  Id   Target Id                                           Frame
  1    Thread 0x7f8816e4d740 (LWP 24247) "pihole-FTL"      0x00007f8816531211 in poll ()
   from /lib64/libc.so.6
  2    Thread 0x7f88150df700 (LWP 24248) "telnet-IPv4"     0x00007f8816819ca7 in accept ()
   from /lib64/libpthread.so.0
  3    Thread 0x7f88148de700 (LWP 24249) "telnet-IPv6"     0x00007f8816819ca7 in accept ()
   from /lib64/libpthread.so.0
  4    Thread 0x7f88140dd700 (LWP 24250) "socket listener" 0x00007f8816819ca7 in accept ()
   from /lib64/libpthread.so.0
* 5    Thread 0x7f88138dc700 (LWP 24251) "database"        0x00007f881654d43a in __vasprintf_chk ()
   from /lib64/libc.so.6
  6    Thread 0x7f88130db700 (LWP 24252) "housekeeper"     0x00007f881653379f in select ()
   from /lib64/libc.so.6
  7    Thread 0x7f88128da700 (LWP 24253) "DNS client"      0x00007f8816531211 in poll ()
   from /lib64/libc.so.6
(gdb) thread 5
[Switching to thread 5 (Thread 0x7f88138dc700 (LWP 24251))]
#0  0x00007f881654d43a in __vasprintf_chk () from /lib64/libc.so.6
(gdb) backtrace
#0  0x00007f881654d43a in __vasprintf_chk () from /lib64/libc.so.6
#1  0x00007f881654d313 in __asprintf_chk () from /lib64/libc.so.6
#2  0x00005580934cec7c in asprintf (__fmt=0x55809362b7c8 "SELECT id FROM network WHERE hwaddr = '%s';",
    __ptr=0x7f8800000002) at /usr/include/x86_64-linux-gnu/bits/stdio2.h:178
#3  parse_neighbor_cache () at src/database/network-table.c:399
#4  0x00005580934d2ed3 in DB_thread (val=<optimized out>) at src/database/database-thread.c:68
#5  0x00007f88168102de in start_thread () from /lib64/libpthread.so.0
#6  0x00007f881653c133 in clone () from /lib64/libc.so.6
(gdb) p linebuffer
$1 = 0x0
(gdb) p num
No symbol "num" in current context.
(gdb) p ip
No symbol "ip" in current context.
(gdb) p iface
No symbol "iface" in current context.
(gdb) p hwaddr
No symbol "hwaddr" in current context.
(gdb)
DL6ER commented 4 years ago

Could you check the last lines of /var/log/pihole-FTL.log when the crash happens? There should be additional output there.

TylerVigario commented 4 years ago

Sure, is this what you are looking for?

[2020-04-01 12:05:00.215 23190] linebuffer: 0x7ff4d4036d00 (linebuffersize: 120, strlen: 55), arpfp: 0x7ff4d4020760
[2020-04-01 12:05:00.219 23190] linebuffer: 0x7ff4d4036d00 (linebuffersize: 120, strlen: 60), arpfp: 0x7ff4d4020760
[2020-04-01 12:05:00.219 23190] linebuffer: 0x7ff4d4036d00 (linebuffersize: 120, strlen: 60), arpfp: 0x7ff4d4020760
[2020-04-01 12:05:00.220 23190] linebuffer: 0x7ff4d4036d00 (linebuffersize: 120, strlen: 55), arpfp: 0x7ff4d4020760
[2020-04-01 12:05:00.220 23190] linebuffer: 0x7ff4d4036d00 (linebuffersize: 120, strlen: 55), arpfp: 0x7ff4d4020760
[2020-04-01 12:05:00.221 23190] linebuffer: 0x7ff4d4036d00 (linebuffersize: 120, strlen: 55), arpfp: 0x7ff4d4020760
[2020-04-01 12:05:00.224 23190] linebuffer: 0x7ff4d4036d00 (linebuffersize: 120, strlen: 56), arpfp: 0x7ff4d4020760
[2020-04-01 12:05:00.225 23190] linebuffer: 0x7ff4d4036d00 (linebuffersize: 120, strlen: 59), arpfp: 0x7ff4d4020760
[2020-04-01 12:05:00.230 23190] linebuffer: 0x7ff4d4036d00 (linebuffersize: 120, strlen: 59), arpfp: 0x7ff4d4020760
[2020-04-01 12:05:00.230 23190] linebuffer: 0x7ff4d4036d00 (linebuffersize: 120, strlen: 55), arpfp: 0x7ff4d4020760
[2020-04-01 12:05:00.231 23190] linebuffer: 0x7ff4d4036d00 (linebuffersize: 120, strlen: 60), arpfp: 0x7ff4d4020760
[2020-04-01 12:05:00.232 23190] linebuffer: 0x7ff4d4036d00 (linebuffersize: 120, strlen: 60), arpfp: 0x7ff4d4020760
[2020-04-01 12:05:00.232 23190] linebuffer: 0x7ff4d4036d00 (linebuffersize: 120, strlen: 55), arpfp: 0x7ff4d4020760
[2020-04-01 12:05:00.233 23190] linebuffer: 0x7ff4d4036d00 (linebuffersize: 120, strlen: 58), arpfp: 0x7ff4d4020760
[2020-04-01 12:05:00.233 23190] linebuffer: 0x7ff4d4036d00 (linebuffersize: 120, strlen: 60), arpfp: 0x7ff4d4020760
[2020-04-01 12:05:00.234 23190] linebuffer: 0x7ff4d4036d00 (linebuffersize: 120, strlen: 56), arpfp: 0x7ff4d4020760
[2020-04-01 12:05:00.235 23190] linebuffer: 0x7ff4d4036d00 (linebuffersize: 120, strlen: 79), arpfp: 0x7ff4d4020760
[2020-04-01 12:05:00.235 23190] linebuffer: 0x7ff4d4036d00 (linebuffersize: 120, strlen: 79), arpfp: 0x7ff4d4020760
[2020-04-01 12:05:00.236 23190] linebuffer: 0x7ff4d4036d00 (linebuffersize: 120, strlen: 79), arpfp: 0x7ff4d4020760
[2020-04-01 12:05:00.236 23190] linebuffer: 0x7ff4d4036d00 (linebuffersize: 120, strlen: 74), arpfp: 0x7ff4d4020760
[2020-04-01 12:05:00.237 23190] linebuffer: 0x7ff4d4036d00 (linebuffersize: 120, strlen: 59), arpfp: 0x7ff4d4020760
[2020-04-01 12:05:00.238 23190] linebuffer: 0x7ff4d4036d00 (linebuffersize: 120, strlen: 72), arpfp: 0x7ff4d4020760
DL6ER commented 4 years ago

Slightly above this, there should be something starting in linebuffer... directly above. edit: You updated your post, I updated my post.

Please update/re-checkout, I pushed another commit with even more debugging output.

Everything immediately above the !!!!!!! line will be of interest.

TylerVigario commented 4 years ago

gdb

Thread 2 "telnet-IPv4" received signal SIGABRT, Aborted.
[Switching to Thread 0x7f260a9af700 (LWP 26610)]
0x00007f260bd478df in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: yum debuginfo-install libblkid-2.32.1-17.el8.x86_64 libgcc-8.3.1-4.5.el8.x86_64 libmount-2.32.1-17.el8.x86_64 libselinux-2.9-2.1.el8.x86_64 libuuid-2.32.1-17.el8.x86_64 pcre2-10.32-1.el8.x86_64 systemd-libs-239-18.el8_1.4.x86_64
(gdb) backtrace
#0  0x00007f260bd478df in raise () from /lib64/libc.so.6
#1  0x00007f260bd31cf5 in abort () from /lib64/libc.so.6
#2  0x00007f260bd9138a in __malloc_assert () from /lib64/libc.so.6
#3  0x00007f260bd9678d in calloc () from /lib64/libc.so.6
#4  0x00005617901e9106 in FTLcalloc (nmemb=nmemb@entry=1, size=size@entry=4,
    file=file@entry=0x56179033e693 "src/api/socket.c",
    function=function@entry=0x56179033ec10 <__FUNCTION__.7502> "telnet_listening_thread_IPv4",
    line=line@entry=460) at src/memory.c:54
#5  0x00005617901e49d6 in telnet_listening_thread_IPv4 (args=<optimized out>) at src/api/socket.c:460
#6  0x00007f260c0e02de in start_thread () from /lib64/libpthread.so.0
#7  0x00007f260be0c133 in clone () from /lib64/libc.so.6
(gdb) p linebuffer
$1 = 0x0
(gdb) p num
No symbol "num" in current context.
(gdb) p ip
No symbol "ip" in current context.
(gdb) p iface
No symbol "iface" in current context.
(gdb) p hwaddr
No symbol "hwaddr" in current context.

pihole-FTL.log

[2020-04-01 12:20:00.355 26404] linebuffer: 0x7f3e880207d0 (linebuffersize: 120, strlen: 55), arpfp: 0x7f3e880206c0
[2020-04-01 12:20:00.356 26404] ip: "0x7f3e95ac3d50", iface: "0x7f3e95ac3e30", hwaddr: "0x7f3e95ac3dc0"
[2020-04-01 12:20:00.356 26404] ip: "192.168.1.41", iface: "enp3s0", hwaddr: "3c:8d:20:53:eb:4e"
[2020-04-01 12:20:00.360 26404] linebuffer: 0x7f3e880207d0 (linebuffersize: 120, strlen: 60), arpfp: 0x7f3e880206c0
[2020-04-01 12:20:00.360 26404] ip: "0x7f3e95ac3d50", iface: "0x7f3e95ac3e30", hwaddr: "0x7f3e95ac3dc0"
[2020-04-01 12:20:00.361 26404] ip: "192.168.1.136", iface: "enp3s0", hwaddr: "44:65:0d:03:18:5e"
[2020-04-01 12:20:00.361 26404] linebuffer: 0x7f3e880207d0 (linebuffersize: 120, strlen: 60), arpfp: 0x7f3e880206c0
[2020-04-01 12:20:00.362 26404] ip: "0x7f3e95ac3d50", iface: "0x7f3e95ac3e30", hwaddr: "0x7f3e95ac3dc0"
[2020-04-01 12:20:00.362 26404] ip: "192.168.1.117", iface: "enp3s0", hwaddr: "5c:49:7d:82:47:a4"
[2020-04-01 12:20:00.363 26404] linebuffer: 0x7f3e880207d0 (linebuffersize: 120, strlen: 55), arpfp: 0x7f3e880206c0
[2020-04-01 12:20:00.363 26404] ip: "0x7f3e95ac3d50", iface: "0x7f3e95ac3e30", hwaddr: "0x7f3e95ac3dc0"
[2020-04-01 12:20:00.364 26404] ip: "192.168.1.25", iface: "enp3s0", hwaddr: "7c:d9:5c:b5:41:2a"
[2020-04-01 12:20:00.364 26404] linebuffer: 0x7f3e880207d0 (linebuffersize: 120, strlen: 59), arpfp: 0x7f3e880206c0
[2020-04-01 12:20:00.365 26404] ip: "0x7f3e95ac3d50", iface: "0x7f3e95ac3e30", hwaddr: "0x7f3e95ac3dc0"
[2020-04-01 12:20:00.365 26404] ip: "192.168.1.40", iface: "enp3s0", hwaddr: "88:3d:24:02:2f:10"
[2020-04-01 12:20:00.366 26404] linebuffer: 0x7f3e880207d0 (linebuffersize: 120, strlen: 55), arpfp: 0x7f3e880206c0
[2020-04-01 12:20:00.366 26404] ip: "0x7f3e95ac3d50", iface: "0x7f3e95ac3e30", hwaddr: "0x7f3e95ac3dc0"
[2020-04-01 12:20:00.366 26404] ip: "192.168.1.77", iface: "enp3s0", hwaddr: "18:b4:30:20:1e:c7"
[2020-04-01 12:20:00.370 26404] linebuffer: 0x7f3e880207d0 (linebuffersize: 120, strlen: 56), arpfp: 0x7f3e880206c0
[2020-04-01 12:20:00.370 26404] ip: "0x7f3e95ac3d50", iface: "0x7f3e95ac3e30", hwaddr: "0x7f3e95ac3dc0"
[2020-04-01 12:20:00.370 26404] ip: "192.168.1.111", iface: "enp3s0", hwaddr: "7c:d9:5c:b9:dd:6e"
[2020-04-01 12:20:00.371 26404] linebuffer: 0x7f3e880207d0 (linebuffersize: 120, strlen: 55), arpfp: 0x7f3e880206c0
[2020-04-01 12:20:00.372 26404] ip: "0x7f3e95ac3d50", iface: "0x7f3e95ac3e30", hwaddr: "0x7f3e95ac3dc0"
[2020-04-01 12:20:00.372 26404] ip: "192.168.1.33", iface: "enp3s0", hwaddr: "10:08:c1:b5:cf:74"
[2020-04-01 12:20:00.378 26404] linebuffer: 0x7f3e880207d0 (linebuffersize: 120, strlen: 59), arpfp: 0x7f3e880206c0
[2020-04-01 12:20:00.378 26404] ip: "0x7f3e95ac3d50", iface: "0x7f3e95ac3e30", hwaddr: "0x7f3e95ac3dc0"
[2020-04-01 12:20:00.379 26404] ip: "192.168.1.24", iface: "enp3s0", hwaddr: "14:b3:1f:0c:e9:bf"
[2020-04-01 12:20:00.379 26404] linebuffer: 0x7f3e880207d0 (linebuffersize: 120, strlen: 55), arpfp: 0x7f3e880206c0
[2020-04-01 12:20:00.380 26404] ip: "0x7f3e95ac3d50", iface: "0x7f3e95ac3e30", hwaddr: "0x7f3e95ac3dc0"
[2020-04-01 12:20:00.380 26404] ip: "192.168.1.47", iface: "enp3s0", hwaddr: "7c:d9:5c:b5:41:2a"
[2020-04-01 12:20:00.381 26404] linebuffer: 0x7f3e880207d0 (linebuffersize: 120, strlen: 56), arpfp: 0x7f3e880206c0
[2020-04-01 12:20:00.381 26404] ip: "0x7f3e95ac3d50", iface: "0x7f3e95ac3e30", hwaddr: "0x7f3e95ac3dc0"
[2020-04-01 12:20:00.382 26404] ip: "192.168.1.114", iface: "enp3s0", hwaddr: "ec:f0:0e:50:7f:87"
[2020-04-01 12:20:00.382 26404] linebuffer: 0x7f3e880207d0 (linebuffersize: 120, strlen: 56), arpfp: 0x7f3e880206c0
[2020-04-01 12:20:00.383 26404] ip: "0x7f3e95ac3d50", iface: "0x7f3e95ac3e30", hwaddr: "0x7f3e95ac3dc0"
[2020-04-01 12:20:00.383 26404] ip: "192.168.1.103", iface: "enp3s0", hwaddr: "40:b4:cd:34:de:d5"
[2020-04-01 12:20:00.384 26404] linebuffer: 0x7f3e880207d0 (linebuffersize: 120, strlen: 55), arpfp: 0x7f3e880206c0
[2020-04-01 12:20:00.385 26404] ip: "0x7f3e95ac3d50", iface: "0x7f3e95ac3e30", hwaddr: "0x7f3e95ac3dc0"
[2020-04-01 12:20:00.385 26404] ip: "192.168.1.31", iface: "enp3s0", hwaddr: "7c:d9:5c:b9:dd:6e"
[2020-04-01 12:20:00.385 26404] linebuffer: 0x7f3e880207d0 (linebuffersize: 120, strlen: 58), arpfp: 0x7f3e880206c0
[2020-04-01 12:20:00.386 26404] ip: "0x7f3e95ac3d50", iface: "0x7f3e95ac3e30", hwaddr: "0x7f3e95ac3dc0"
[2020-04-01 12:20:00.386 26404] ip: "192.168.1.1", iface: "enp3s0", hwaddr: "48:f8:b3:ff:b7:7d"
[2020-04-01 12:20:00.387 26404] linebuffer: 0x7f3e880207d0 (linebuffersize: 120, strlen: 56), arpfp: 0x7f3e880206c0
[2020-04-01 12:20:00.387 26404] ip: "0x7f3e95ac3d50", iface: "0x7f3e95ac3e30", hwaddr: "0x7f3e95ac3dc0"
[2020-04-01 12:20:00.388 26404] ip: "2601:201:4300:79:7ed9:5cff:feb9:dd6e", iface: "enp3s0", hwaddr: "48:f8:b3:ff:b7:7d"
[2020-04-01 12:20:00.389 26404] linebuffer: 0x7f3e880207d0 (linebuffersize: 120, strlen: 56), arpfp: 0x7f3e880206c0
[2020-04-01 12:20:00.389 26404] ip: "0x7f3e95ac3d50", iface: "0x7f3e95ac3e30", hwaddr: "0x7f3e95ac3dc0"
[2020-04-01 12:20:00.390 26404] ip: "2601:201:4300:79:76d4:35ff:fe5e:cf88", iface: "enp3s0", hwaddr: "48:f8:b3:ff:b7:7d"
[2020-04-01 12:20:00.390 26404] linebuffer: 0x7f3e880207d0 (linebuffersize: 120, strlen: 79), arpfp: 0x7f3e880206c0
[2020-04-01 12:20:00.391 26404] ip: "0x7f3e95ac3d50", iface: "0x7f3e95ac3e30", hwaddr: "0x7f3e95ac3dc0"
[2020-04-01 12:20:00.391 26404] ip: "2601:201:4300:79:bc49:94e8:13f3:fc3e", iface: "enp3s0", hwaddr: "38:c9:86:0b:c3:d7"
[2020-04-01 12:20:00.392 26404] linebuffer: 0x7f3e880207d0 (linebuffersize: 120, strlen: 79), arpfp: 0x7f3e880206c0
[2020-04-01 12:20:00.392 26404] ip: "0x7f3e95ac3d50", iface: "0x7f3e95ac3e30", hwaddr: "0x7f3e95ac3dc0"
[2020-04-01 12:20:00.392 26404] ip: "fe80::4af8:b3ff:feff:b77d", iface: "enp3s0", hwaddr: "48:f8:b3:ff:b7:7d"
[2020-04-01 12:20:00.393 26404] linebuffer: 0x7f3e880207d0 (linebuffersize: 120, strlen: 79), arpfp: 0x7f3e880206c0
[2020-04-01 12:20:00.394 26404] ip: "0x7f3e95ac3d50", iface: "0x7f3e95ac3e30", hwaddr: "0x7f3e95ac3dc0"
[2020-04-01 12:20:00.394 26404] ip: "2601:201:4300:79:986c:aaa3:67c7:5942", iface: "enp3s0", hwaddr: "04:4b:ed:a1:08:e5"
DL6ER commented 4 years ago

Memory allocation fails here (but this time somewhere else), can you check how your memory usage looks like? Just before the crash?

TylerVigario commented 4 years ago

Unfortunately, I'm not too sure about that.

Memory usage hovers around 2.1GB (8GB total) even when running Pihole multiple times.

If you have any more precise methods I'd be more than happy to do that.

DL6ER commented 4 years ago

For those of you still seeing the crash, could you try something else as well? We still need to isolate where the code leading to the crash was introduced. I ran pihole-FTL through a full memory leak check and nothing was found, so a leak seems rather unlikely (however, it may still be triggerable only under very specific conditions).

pihole checkout ftl 71e8498

pihole checkout ftl f9476dd

pihole checkout ftl 7759a76

pihole checkout ftl ec0579a

The first is before #708 and #711 were merged, the second is with only #708, the third is with only #711, the fourth one is with both of them.

If one of them fails to check out, they are still being built. I pushed all four at the same time, this will keep our binary building jobs busy for some time. In this case, please try again later.

TylerVigario commented 4 years ago

Did in reverse order

71e8498 did not immediately crash

f9476dd did not immediately crash

7759a76 did not immediately crash

ec0579a

Thread 5 "database" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ff583cdc700 (LWP 1276)]
0x00007ff58694d43a in __vasprintf_chk () from /lib64/libc.so.6
(gdb) backtrace
#0  0x00007ff58694d43a in __vasprintf_chk () from /lib64/libc.so.6
#1  0x00007ff58694d313 in __asprintf_chk () from /lib64/libc.so.6
#2  0x000056488c500c38 in asprintf (__fmt=0x56488c65cd98 "SELECT id FROM network WHERE hwaddr = '%s';",
    __ptr=0x7ff500000001) at /usr/include/x86_64-linux-gnu/bits/stdio2.h:178
#3  parse_neighbor_cache () at src/database/network-table.c:325
#4  0x000056488c50483d in DB_thread (val=<optimized out>) at src/database/database-thread.c:64
#5  0x00007ff586c102de in start_thread () from /lib64/libpthread.so.0
#6  0x00007ff58693c133 in clone () from /lib64/libc.so.6
DL6ER commented 4 years ago

Thanks. Another two then:

pihole checkout ftl 149f656
pihole checkout ftl c3147cc

(the two commits between 7759a76 and ec0579a)

You should try the 149f656 first, as it is the first one that the CI will build binaries for (will be available earlier).

TylerVigario commented 4 years ago

149f656 did not immediately crash

c3147cc crashed immediately

DL6ER commented 4 years ago

This is a really tough bug. The only difference in c3147cc is the introduction of the SQLite3 extension for CIDR filtering. Please test, once again,

pihole checkout ftl fix/neighcrash

I added an explicit initialization of the SQLite3 library here. We never used to do this, however, it's the only thing I could see right now.

TylerVigario commented 4 years ago

The same results, unfortunately.

DL6ER commented 4 years ago

Okay, thanks for your assistance, another try...

This time, I added a new config option:

USE_SQLITE3_EXTENSION

defaulting to true.

The expected behavior is that not setting this (or setting it to true) in /etc/pihole/pihole-FTL.conf crashes while FTL runs stable when setting USE_SQLITE3_EXTENSION=false in said file.

Can you confirm this? If so, then we know that it is the registration of the SQLite3 extension we wrote recently. It's completely unclear to me, so far, how this can crash for some users but not for the majority. And also why it crashes FTL at some very different location in the code.

TylerVigario commented 4 years ago

Both setups crashed the same.

I updated to latest fix/neighcrash

And then tried again with USE_SQLITE3_EXTENSION=false in /etc/pihole/pihole-FTL.conf

DL6ER commented 4 years ago

This is bizarre, could you test again with https://github.com/pi-hole/FTL/commit/149f6568804d06ba2532092de63295413bf3b381 if this really doesn't crash?

TylerVigario commented 4 years ago

This is bizarre, could you test again with 149f656 if this really doesn't crash?

My apologies, I was just able to get 149f656 to crash as well.

I'll test previous versions more thoroughly.

DL6ER commented 4 years ago

Well, so, at least we know now that it is not the SQLite3 extension and I reviewed this code now three times very carefully and added some minor optimizations ;-)

Onto the next one...