sm0svx / svxlink

Advanced repeater system software with EchoLink support for Linux including a GUI, Qtel - the Qt EchoLink client
http://svxlink.org/
Other
435 stars 173 forks source link

SVXLink 17.12.1 --logfile= causes a terminal error #363

Closed f5vmr closed 6 years ago

f5vmr commented 6 years ago

Latest master running on a raspberry Pi B+ on Raspbian Stretch works perfectly when run from the terminal command displaying the journal, and works equally well with --daemon, but the addition of --logfile=/var/log/svxlink.log causes a cessation of work after connecting to aprs/echolink servers. No errors reported, but a sudo killall svxlink reports that no process is running. Danilo Bozic reported this initially but I now confirm the problem exists.

sm0svx commented 6 years ago

Could it be that the svxlink process do not have persmission to write to the log file?

sm0svx commented 6 years ago

Or maybe disk full, read-only filesystem or something like that?

f5vmr commented 6 years ago

No, thought of those. I have applied full access right to the folder chain, and the sdcard is 16 GB with 8 Gb free. I have done nothing different to a Jessie install.

f5vmr commented 6 years ago

I have also tried sudo svxlink --daemon --logfile=svxlink.log - the log is written up to the point of declaring connection to the aprs server, then halts. (I was following by tail -f), and a sudo killall svxlink again reports no process.

sm0svx commented 6 years ago

Did a quick test here and I cannot get the same behavior. Does it do the same thing if you run it in the foreground, without --daemon but with --logfile? If that give the same problem, try:

$ gdb /path/to/svxlink
gdb> run --logfile /path/to/logfile
--- possible crash ---
gdb> bt

If you do not have gdb installed you will have to install it.

sm0svx commented 6 years ago

It is unclear which version you use. The title say 17.12.1 but in the text you say "latest master". Which one is it?

f5vmr commented 6 years ago

It was yesterday's version, but not sure if I beat you to your messages of yesterday or not. I'll load gdb tomorrow and give it a whirl. Thanks Tobias.

f5vmr commented 6 years ago

Mornig Tobias - gdb installed as standard in Stretch - result below (gdb) run --logfile /var/log/svxlink.log Starting program: /usr/bin/svxlink --logfile /var/log/svxlink.log [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1". [New Thread 0xb6640450 (LWP 1584)] [Thread 0xb6640450 (LWP 1584) exited]

Thread 1 "svxlink" received signal SIGABRT, Aborted. __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. (gdb) bt

0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51

1 0xb686a824 in __GI_abort () at abort.c:89

2 0xb6ef6280 in Tcl_PanicVA () from /usr/lib/arm-linux-gnueabihf/libtcl8.5.so

3 0xb6ef62c4 in Tcl_Panic () from /usr/lib/arm-linux-gnueabihf/libtcl8.5.so

4 0xb6ed2220 in Tcl_RegisterChannel ()

from /usr/lib/arm-linux-gnueabihf/libtcl8.5.so

5 0xb6edbfb0 in ?? () from /usr/lib/arm-linux-gnueabihf/libtcl8.5.so

Backtrace stopped: previous frame identical to this frame (corrupt stack?) (gdb)

Could this be an operating system error in Stretch or one of its packages?

f5vmr commented 6 years ago

Running gdb without a log process

Starting program: /usr/bin/svxlink [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1". SvxLink v1.6.99.5 Copyright (C) 2003-2017 Tobias Blomberg / SM0SVX

SvxLink comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it in accordance with the terms and conditions in the GNU GPL (General Public License) version 2 or later.

Using configuration file: /etc/svxlink/svxlink.conf --- Using sample rate 48000Hz [New Thread 0xb6640450 (LWP 1877)]

Starting logic: RepeaterLogic Loading RX: Rx1 [Thread 0xb6640450 (LWP 1877) exited] Loading TX: Tx1 RepeaterLogic: Event handler script successfully loaded. Connected to APRS server 62.77.224.245 on port 14580 RepeaterLogic: Sending short identification... Tx1: Turning the transmitter ON Tx1: Turning the transmitter OFF [Inferior 1 (process 1862) exited normally] (gdb) terminated with a q command as the repeater was operating normally.

f5vmr commented 6 years ago

I am left puzzled by this situation. I have built a second card with the absolutely identical set up except for changing the boot up sequence for terminal logon as opposed to desktop configuration. The error I pointed out as above does not occur in this configuration.

f5vmr commented 6 years ago

I suppressed the desktop in the original installation and rebooted, but the svxlink still fails when writing to a log, always at the same point. When run in direct mode i.e. sudo svxlink --logfile=/var/log/svxlink.log the log still gets written up to and including connection to the aprs server, then the svxlink terminates with "Aborted" in a new line before displaying a fresh prompt.

sm0svx commented 6 years ago

Sounds like something is wrong with your other card. Should not have anything to do with desktop. The easiest way is probably to reinstall the card. Have you considered using the package based installation method?

s58DB commented 6 years ago

Hello Tobias,

I have same problems with this version of SVXLink. Chris describe vary well what is problem.

cat /etc/os-release PRETTY_NAME="Raspbian GNU/Linux 9 (stretch)" NAME="Raspbian GNU/Linux" VERSION_ID="9" VERSION="9 (stretch)" ID=raspbian ID_LIKE=debian HOME_URL="http://www.raspbian.org/" SUPPORT_URL="http://www.raspbian.org/RaspbianForums" BUG_REPORT_URL="http://www.raspbian.org/RaspbianBugs"

I try run this gdb and get this lines:

----------------------[Start test]---------------------- gdb /usr/bin/svxlink (gdb) run Starting program: /usr/bin/svxlink [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1". SvxLink v1.6.99.8 Copyright (C) 2003-2017 Tobias Blomberg / SM0SVX

SvxLink comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it in accordance with the terms and conditions in the GNU GPL (General Public License) version 2 or later.

Using configuration file: /etc/svxlink/svxlink.conf --- Using sample rate 48000Hz [New Thread 0x76414450 (LWP 840)] [Thread 0x76414450 (LWP 840) exited]

Starting logic: SimplexLogic Loading RX: Rx1 Loading TX: Tx1 Loading module "ModuleHelp" into logic "SimplexLogic" Found /usr/lib/arm-linux-gnueabihf/svxlink/ModuleHelp.so Module Help v1.0.0 starting... Loading module "ModuleEchoLink" into logic "SimplexLogic" Found /usr/lib/arm-linux-gnueabihf/svxlink/ModuleEchoLink.so Module EchoLink v1.4.99.0 starting... [New Thread 0x759d7450 (LWP 841)] Loading module "ModuleTclVoiceMail" into logic "SimplexLogic" [Thread 0x759d7450 (LWP 841) exited] Found /usr/lib/arm-linux-gnueabihf/svxlink/ModuleTcl.so Module Tcl v1.0.1 starting... Loading module "ModuleMetarInfo" into logic "SimplexLogic" Found /usr/lib/arm-linux-gnueabihf/svxlink/ModuleMetarInfo.so Module MetarInfo v1.1.0 starting... Loading module "ModuleParrot" into logic "SimplexLogic" Found /usr/lib/arm-linux-gnueabihf/svxlink/ModuleParrot.so Module Parrot v1.1.1 starting... SimplexLogic: Event handler script successfully loaded. Connected to APRS server 109.72.122.37 on port 14580 EchoLink directory status changed to ON [New Thread 0x759d7450 (LWP 842)] [Thread 0x759d7450 (LWP 842) exited] --- EchoLink directory server message: --- EchoLink Server v2.5.9997

ECHOEC2-1: Herndon, VA USA

[Inferior 1 (process 837) exited normally]

----------------------[END]----------------------

And second line with --logfile

----------------------[START test --logfile]---------------------- gdb /usr/bin/svxlink (gdb) run --logfile /home/pi/svxlink.log Starting program: /usr/bin/svxlink --logfile /home/pi/svxlink.log [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1". [New Thread 0x76414450 (LWP 844)] [Thread 0x76414450 (LWP 844) exited] [New Thread 0x759d7450 (LWP 845)] [Thread 0x759d7450 (LWP 845) exited] [New Thread 0x759d7450 (LWP 846)] [Thread 0x759d7450 (LWP 846) exited]

Thread 1 "svxlink" received signal SIGABRT, Aborted. __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.

----------------------[END]----------------------

f5vmr commented 6 years ago

Satisfied that the 8gb card works according to plan, I reformatted the 16gb card, and rebuilt it with the November version of Stretch instead of the updated and upgraded august version I had saved. I then went through the build and git clone of 17.12.1.154 again. On completion I am sorry to report the same results as I reported earlier with gdb. I see this error appears in applications other than svxlink. What I will do tomorrow however is run the expand OS on the raspi-config, although Raspbian stretch is supposed to do it on boot up. However df reports a full size card in use in the table.

s58DB commented 6 years ago

One more report. Same thing

cat /etc/os-release PRETTY_NAME="Raspbian GNU/Linux 8 (jessie)" NAME="Raspbian GNU/Linux" VERSION_ID="8" VERSION="8 (jessie)" ID=raspbian ID_LIKE=debian HOME_URL="http://www.raspbian.org/" SUPPORT_URL="http://www.raspbian.org/RaspbianForums" BUG_REPORT_URL="http://www.raspbian.org/RaspbianBugs"

Starting program: /usr/bin/svxlink --logfile=/var/log/svxlink.log [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1". [New Thread 0x764ca450 (LWP 4732)] [Thread 0x764ca450 (LWP 4732) exited] [New Thread 0x755bb450 (LWP 4733)] [New Thread 0x74d2e450 (LWP 4734)] [Thread 0x74d2e450 (LWP 4734) exited] [New Thread 0x74d2e450 (LWP 4735)] [Thread 0x74d2e450 (LWP 4735) exited]

Program received signal SIGABRT, Aborted. 0x7686ff70 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.

f5vmr commented 6 years ago

That is a Jessie install! I didn't have that problem with raspbian Jessie, but then I hadn't installed 17.12.1.154 until Raspbian Stretch.

s58DB commented 6 years ago

Look my first post I add os-release. All is make from today updated svxlink folder via GIT. Only on Jessie I have previous install svxlink 15.11 and then updated to newer version.

f5vmr commented 6 years ago

Ok - Something is very odd - new card - new install of Raspbian Stretch, with package based install this time, and the same result except for the uprated package libtcl8.6 instead of 8.5. (gdb) run --logfile=/var/log/svxlink.log Starting program: /usr/bin/svxlink --logfile=/var/log/svxlink.log [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1". [New Thread 0xb6591450 (LWP 1449)] [Thread 0xb6591450 (LWP 1449) exited]

Thread 1 "svxlink" received signal SIGABRT, Aborted. __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. (gdb) bt

0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51

1 0xb690c824 in __GI_abort () at abort.c:89

2 0xb6ed2cfc in Tcl_PanicVA () from /usr/lib/arm-linux-gnueabihf/libtcl8.6.so

3 0xb6ed2d40 in Tcl_Panic () from /usr/lib/arm-linux-gnueabihf/libtcl8.6.so

4 0xb6eac0c0 in Tcl_RegisterChannel ()

from /usr/lib/arm-linux-gnueabihf/libtcl8.6.so

5 0xb6eb6aac in ?? () from /usr/lib/arm-linux-gnueabihf/libtcl8.6.so

Backtrace stopped: previous frame identical to this frame (corrupt stack?) (gdb)

f5vmr commented 6 years ago

https://stackoverflow.com/questions/48278881/gdb-complaining-about-missing-raise but it solves nothing.

f5vmr commented 6 years ago

Well that was a problem overcome. I found that the dirty bit was set, which in my case indicated a formatting error, probably causing a stack overflow error. I flattened the card, reformatting it and rebuilt it. I then tackled the Package Install again, and it works without the error. In view that others have seen the same problem, I recommend they review their formatting and imaging procedure with fsck. I consider that unless anyone else reports an issue I consider this problem closed. Chris

f5vmr commented 6 years ago

I though it was too good to be true. Initially the system was working. After setting up for a full repeater connection, and tail-ing the log, I noticed that the log was displaying that there was a constant restart of the svxlink system. then I run the gdb again and exactly the same fault had developed. I removed the card to a linux laptop and found the dirty bit had been set again. I ran fsck and reset it, and returned the sdcard to the raspberry. This time this fault is show, and the Rx1 is not recognised. (gdb) run --logfile=/var/log/svxlink.log Starting program: /usr/bin/svxlink --logfile=/var/log/svxlink.log [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1". [New Thread 0xb653d450 (LWP 6064)] [Thread 0xb653d450 (LWP 6064) exited] [Inferior 1 (process 6063) exited with code 01] (gdb) bt No stack. (gdb)

f5vmr commented 6 years ago

Stopped the run, rebooted and re-run under gdb a second time. (gdb) run --logfile=/var/log/svxlink.test2 Starting program: /usr/bin/svxlink --logfile=/var/log/svxlink.test2 [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1". [New Thread 0xb653d450 (LWP 1017)] [Thread 0xb653d450 (LWP 1017) exited]

Thread 1 "svxlink" received signal SIGABRT, Aborted. __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. (gdb) bt

0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51

1 0xb690b824 in __GI_abort () at abort.c:89

2 0xb6ed4cfc in Tcl_PanicVA () from /usr/lib/arm-linux-gnueabihf/libtcl8.6.so

3 0xb6ed4d40 in Tcl_Panic () from /usr/lib/arm-linux-gnueabihf/libtcl8.6.so

4 0xb6eae0c0 in Tcl_RegisterChannel () from /usr/lib/arm-linux-gnueabihf/libtcl8.6.so

5 0xb6eb8aac in ?? () from /usr/lib/arm-linux-gnueabihf/libtcl8.6.so

Backtrace stopped: previous frame identical to this frame (corrupt stack?) (gdb) I am really beginning to think the card type is at fault. This was a SanDisk Ultra 16 Gb HC I.

sm0svx commented 6 years ago

Yes it's a really strange behaviour and if the exact same installation work on other cards I'd say it's a hardware problem.

dl1hrc commented 6 years ago

I can confirm this on a "normal" server, no hardware problem with sd cards or something like this. SvxLink is running as root, so it can not be a right issue. It terminates exactly every full minute when started with "--logfile", error:

Program received signal SIGABRT, Aborted. 0x00007ffff4f7d0a7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 56 ../nptl/sysdeps/unix/sysv/linux/raise.c: Datei oder Verzeichnis nicht gefunden. (gdb)

sm0svx commented 6 years ago

Very strange indeed. It seems to have something with TCL to do but I cannot understand what's happening, and why it's related to the logfile. Hmmm...

dl1hrc commented 6 years ago

I removed all own tcl environment, deleted all specific *.so libraries and reinstalled it. It occurs on these systems: 1.6.99.0, 1.6.99.2 and 1.6.99.9 with different soft-/hardware (bananian + bananapi and opensuse + x86_64)

f5vmr commented 6 years ago

I've just built another card for F5ZGM-R (as the current installation has gone READ-ONLY for some reason,) and this new install has also got the same issue when running --logfile= but runs normally without a log. Again it crashed without --daemon and with --logfile=. With --daemon and no logfile it runs perfectly. So is it a coding issue in associated software? Incidentally the current F5ZGM-R does not have this issue as I haven't upgraded it to Raspbian Stretch, but I noticed that in the root a file dead.letter of 7 Gb size has appeared - an entirely different issue I'm sure. F5ZGM-R does have --logfile= and sends me a daily log by email. I am now attempting a 4 Gb card with Stretch-Lite to see if that generates a problem with svxlink.

f5vmr commented 6 years ago

Report: 4Gb MicroSD Card, Raspbian Stretch-Lite fully updated, darkice installed with the package based install of svxlink, the log creation works, and everything appears to work to normal parameters. In case anyone thinks the 4 Gb card may be a bit small.....

Filesystem     1K-blocks    Used Available Use% Mounted on
/dev/root        3741504       1212136   2344664  35% /
devtmpfs        217972         0               217972   0% /dev
tmpfs              222264         0               222264   0% /dev/shm
tmpfs              222264         3172         219092   2% /run
tmpfs              5120             4               5116       1% /run/lock
tmpfs              222264         0               222264   0% /sys/fs/cgroup
/dev/mmcblk0p1     41853   21328     20525  51% /boot
tmpfs              44452       0     44452   0% /run/user/1000
dl1hrc commented 6 years ago

Have installed the change described in #373 on two of my systems, it seems to work ;)

dh1dm commented 6 years ago

Could be the same problem as issue #373 and the TCL crash bug described in http://svxlink.996268.n3.nabble.com/Unexpected-Segfault-Tcl-RegisterChannel-names-td1443.html It is triggered by calling close(notifier_rd); in destructor ~CppDnsLookupWorker() and only happens when a log file is specified and the console is detached. The "deamon" option has no influence. I'd suppose that the STDIN descriptor 0 is re-used in TCL_RegisterChannel and maybe someone is checking a function (e.g. open) return value for beeing > 0. The problem re-appeared after the introduction of deferred resource release using the runTask initiated task timer driven delete/destructor calls. The problem seems to be solved when inserting the line dup2(open("/dev/null", O_RDONLY), STDIN_FILENO); after close(STDIN_FILENO); in the main function in svxlink.cpp. This keeps the STDIN descriptor allocated. Further tests are still pending...

dh1dm commented 6 years ago

close(STDOUT_FILENO), close(STDERR_FILENO), close(STDIN_FILENO) can be removed since dup2 implements an atomic close/re-open mechanism (see dup2 man page)

f5vmr commented 6 years ago

Interesting - I have recompiled my intended replacement logic with the dup2(ope..etc) line in svxlink.cpp, and it is a solution. The logic doesn't fail and the log writes as it should. I am not enough experienced in C++ to follow the logic, but I cannot see that any changes have been done to this segment of code by Tobias for at least several months. Perhaps the code is referenced by one of the newer changes in the SvxReflector code that may incorporate a global variable that causes the failure?

dh1dm commented 6 years ago

I don't use the svxlink code directly from GIT. Instead, I'm maintaining a dedicated low latency audio streaming variant and take over the GIT code by cherry picking. This also helps to isolate potential problems, since the new code undergoes an additional review and test. As I did not have taken over the SvxReflector code from GIT, it is definitely not involved in the observed behaviour. But because first signs of this problem seem to date back to at least 2011, the cause search space should not be limited to recent changes.

sm0svx commented 6 years ago

That was some very good bughunting, @dh1dm! I don't think that the non-atomic close is a problem since this is done very early in the application but it's clearly unnecessary and conceptually wrong so it should be removed. On the other hand that closing of stdin was very sneaky and I guess that only bad things can come if someone else is assigned that filedescriptor randomly.

Please test the fix in latest master to see if that is the correct solution.

Thanks to everyone who helped solving this!

f5vmr commented 6 years ago

I concur. Well done! Tobias, F5ZGM-R is today in service with the latest master, with the svxlink.cpp single line modification. I installed it this morning. At the same time I offline the 50 MHz repeater for essential maintenance, and I will be doing the same to its logic. Regards to all, Chris F5VMR

s58DB commented 6 years ago

Now working fine on my repeater too. Thanks.

sm0svx commented 6 years ago

Ok, that's good news! @s58DB, are you running lastest master without the one-liner fix? It's important since the fix in master is not exactly the same as the original fix. It's the same solution but written with some additional error handling.

s58DB commented 6 years ago

@sm0svx I am just update svxlink folder with git pull and compile all again.

sm0svx commented 6 years ago

Ok, thanks! I then consider this bug solved.

PU2YYP commented 2 years ago

Hello. I have a problem on my raspberry, using svxlink tbd in a conference room. After some time of operation, this message appeared in the startup of the machine and I cannot connect the conference in the system. Error 100005 appears. Can someone help me? Is it some log file that has reached its maximum capacity?