kersing / lora_gateway

Other
9 stars 26 forks source link

Multi-Tech Conduit hangs occasionally (downlink traffic)? #1

Open terrillmoore opened 7 years ago

terrillmoore commented 7 years ago

This issue documents a problem report that first was reported on www.thethingsnetwork.org/forums: Latest MultiTech Packet Forwarder Stops Sending Packets.

The brief form: after updating to V3.0.0-r5 from V2.x, we observe hangs of the packet router code. Easiest way to detect this is to look at the timestamp on /var/log/lora-pkt-fwd.log:

root@mtcdt:~# ls -lrt /var/log/lora-pkt-fwd.log
-rw-r--r--    1 root     root        384148 Jun 19 12:31 /var/log/lora-pkt-fwd.log
root@mtcdt:~# date
Mon Jun 19 12:53:10 CST 2017

(Note the difference in time.)

The forum thread may describe additional issues; this thread only documents those observed by me. I have seen symptoms like this on other gateways, but only finally caught things in the act yesterday.

Status of the packet forwarder (from ps au) was:

  516 ?        Sl    34:01 /opt/lora/mp_pkt_fwd -c /var/config/lora -l /var/log/

I believe that the Sl flags are significant.

Restarting the packet forwarder does not clear the problem.

root@mtcdt:~# /etc/init.d/ttn-pkt-forwarder restart
Stopping ttn-packet-forwarder: OK
Found MTAC-LORA-915 with MTAC-LORA-1.0 hardware
Starting ttn-packet-forwarder: OK
root@mtcdt:~#

The relevant portion of /var/log/lora-packet-fwd.log (after the above restart) was:

INFO: [main] Starting the concentrator
lgw_connect:532: INFO: no FPGA detected or version not supported (v103)
Note: success connecting the concentrator
ERROR: SPI ERROR DURING REGISTER BURST READ
INFO: tx_start_delay=1497 (1497.000000) - (1497, bw_delay=0.000000, notch_delay=0.000000)

INFO: End of upstream thread
ERROR: SPI ERROR DURING REGISTER WRITE

Rebooting the Conduit with init 6 solved the problem.

During this hang, no traffic was forwarded upstream.

Both observed failures were correlated with downlink (join) traffic. (This application otherwise does not use downlink traffic, and there are no other known applications on the gateway.)

Gateway version info:

root@mtcdt:~# uname -a
Linux mtcdt 3.12.27 #1 Thu May 5 18:53:23 CDT 2016 armv5tejl GNU/Linux

The desperation workaround is to have a daemon watching the timestamp on the log, and if it becomes older than one or two minutes, to reboot the Conduit.

johanstokking commented 6 years ago

@kersing any update on this?

kersing commented 6 years ago

@johanstokking There have been improvements and code (and MultiTech builds) should be more stable. However the issue is still not 100% tackled. I'm still trying to find the cause of the remaining issue. Making progress, but very slowly as it is hard to reproduce.

johanstokking commented 6 years ago

I understand. I really appreciate the work on this.

Anything we, the community, can do?

kersing commented 6 years ago

I've updated the sources of several components in an attempt to solve this issue. RPi users and people creating their own builds should pull updates and rebuild.

An updated build for MultiTech is available from https://raw.github.com/kersing/multitech-installer/master/ (to update conduits running an older version just add a file '/etc/opkg/mp-feed.conf' containing one line: src/gz mpfwd https://raw.github.com/kersing/multitech-installer/master/

run opkg install mp-packet-forwarder to update the software. Restart /etc/init.d/ttn-pkt-forwarder restart after updating.

Lucianovici commented 6 years ago

I have mp-packet-forwarder (3.0.0-r14) installed. And I get this:

00:17:18  INFO: Description configured to ""
00:17:18  INFO: [Transports] Initializing protocol for 1 servers
00:17:19  INFO: [TTN] server "192.168.0.21" connected
00:17:19  INFO: [main] Starting the concentrator
ERROR: FAIL TO CONNECT BOARD
INFO: FPGA supported features: [TX filter]  [Spectral Scan]
00:17:19  ERROR: [main] failed to start the concentrator
00:17:21  *** Multi Protocol Packet Forwarder for Lora Gateway ***
Version: 3.0.10

Is this the same issue?

NOTES:

  1. I have tested the Multitech Conduit with the default network server that comes with it, and works fine.
  2. I tested with Loriot, and I have similar error message. I also wrote a reply on their forum https://openlora.com/forum/index.php/topic,43.msg202.html#msg202
  3. Using Multi-Tech MultiConnect Conduit mLinux MTCDT-246L-US-EU-GB with mCard V1.5 LoRa 868 MTAC-LORA-H-868

Any suggestions are more than welcomed. Thanks!

kersing commented 6 years ago

Not the same issue. Have you used https://www.thethingsnetwork.org/docs/gateways/multitech/ to install the software? If not please use those instructions and if you still have issues go to the TTN forum and request help there first, when requesting help include the full log of the software including the first lines with the version numbers. (I expect your current setup tries to use the USB version of the software where the SPI version is required.)

Lucianovici commented 6 years ago

@kersing Alright. Thanks for your fast response.

I followed https://www.thethingsnetwork.org/docs/gateways/multitech/mlinux.html I already manually configured the internet connection of the gateway, so I just wget that installer.

Is there a difference between that and this? https://github.com/kersing/multitech-installer/blob/master/installer.sh

How can I tell which is which? USB vs SPI versions.

Thanks. I appreciate your help here, even if we're on a different issue thread.

kersing commented 6 years ago

For help with the installer please use the TTN forum. These issues are used for tracking software defects.

efpage commented 5 years ago

Hy all,

we are running the Packet forwarder spf_3.1.0-klk18_4.1.3-klk12_klk_wifc on a Kerlink iFemtocell. As the issue is very similar to this post I try to get some information here.

The SPF is restarted frequently, every 10 minutes to 1 hour.

in the system logs (/var/log/messages) we find: Sep 7 09:29:53 klk-wifc-040187 user.err monit[1138]: 'spf' process is not running Sep 7 09:29:53 klk-wifc-040187 user.info monit[1138]: 'spf' trying to restart Sep 7 09:29:53 klk-wifc-040187 user.info monit[1138]: 'spf' start: '/user/spf/bin/execute_spf.sh' Sep 7 09:29:58 klk-wifc-040187 daemon.debug netctl[919]: Wi-Fi scan finished Sep 7 09:29:58 klk-wifc-040187 daemon.debug netctl[919]: AP found: Funknetz2 (online-69-psk/wps-True) Sep 7 09:29:58 klk-wifc-040187 daemon.info lighttpd[1086]: 192.168.178.94 192.168.178.45 - [07/Sep/2019:09:29:58 +0000] "GET /application/administration/wlan HTTP/1.1" 200 237 "http://192.168.178.45/" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36" Sep 7 09:30:04 klk-wifc-040187 user.info monit[1138]: 'spf' process is running with pid 3325 Sep 7 09:37:15 klk-wifc-040187 local3.info KLK: STATM INFO CPU usage: 0.8 Sep 7 09:37:15 klk-wifc-040187 local3.info KLK: STATM INFO RAM usage: 33.3 Sep 7 09:44:37 klk-wifc-040187 user.err monit[1138]: 'spf' process is not running Sep 7 09:44:37 klk-wifc-040187 user.info monit[1138]: 'spf' trying to restart Sep 7 09:44:37 klk-wifc-040187 user.info monit[1138]: 'spf' start: '/user/spf/bin/execute_spf.sh' Sep 7 09:44:48 klk-wifc-040187 user.info monit[1138]: 'spf' process is running with pid 3353

in the spf.log we find:

Sep 7 09:29:50 klk-wifc-040187 local1.notice spf: INFO: concentrator stopped successfully Sep 7 09:29:50 klk-wifc-040187 local1.notice spf: INFO: Exiting packet forwarder program

which looks like a regular stop.

Any known reason, why the SPF ist stopped??

Best regards, Eckehard