kersing / packet_forwarder

Multi protocol packet forwarder supporting the TTN gateway-connector protocol.
Other
86 stars 56 forks source link

autoquit is not working #28

Closed pe1mew closed 4 years ago

pe1mew commented 5 years ago

The feature "autoquit" does not seem to work. When the forwarder is detecting a link failure on UDP protocol the forwarder does not terminate and cannot reboot.

08:06:51 INFO: [TTN] bridge.eu.thethings.network RTT 26 08:06:51 INFO: [TTN] send status success for bridge.eu.thethings.network 08:07:06 INFO: [down] for server rfseebridge.duckdns.org the last 5 PULL_DATA were not ACKed, exiting down thread for this server. 08:07:06 INFO: End of downstream thread 08:07:06 INFO: End of JIT thread 08:07:06 INFO: End of GPS thread 08:07:06 INFO: End of upstream thread 08:07:06 INFO: End of validation thread 08:07:21 INFO: [TTN] Disconnecting server "bridge.eu.thethings.network"

pe1mew commented 5 years ago

Just found that [ctrl-C] does not terminate the forwarder:

14:09:59  WARNING: [gps] read() returned valu^C
14:10:00  14:10:00  14:10:00  INFO: End of JIT thread
14:10:00  INFO: End of validation thread
INFO: End of GPS thread
14:10:00  INFO: End of downstream thread
INFO: End of upstream thread
14:10:00  INFO: [TTN] Disconnecting server "bridge.eu.thethings.network"
14:10:00  INFO: End of downstream thread
^C^C

Also after killing the process some orphans appear in the process list.

pe1mew commented 4 years ago

I have tested mp_pkt_fwd and have not been able to reproduce the observed error. Here is a desciption of the tests I executed:

Note: For the first 3 test the following note applies:

Source: https://github.com/jpmeijers/ttn-resin-gateway-rpi Note: because original git source of wiringpi was not available this was replaced by a fork: "ttn-resin-gateway-rpi/dev/packet_forwarder/mp_pkt_fwd/build-pi.sh" was modified:

    git clone https://github.com/WiringPi/WiringPi.git || { echo 'Cloning wiringPi failed.' ; exit 1; }
#    git clone git://git.drogon.net/wiringPi  || { echo 'Cloning wiringPi failed.' ; exit 1; }

The gateway is configured to connect to TTN and to a UDP server (test server using packet_forwarder/util_ack)

Test 1: verify if [ctrl-c] termintes mp_pkt_fwd in the correct way.

Configuration:

For this the RPI gateway was accessed with two ssh sessions. Session 1 is running htop filtering for 'mp_pkt_fwd'. It shows 8 processes. When [ctrl-c] is pressed, mp_pkt_fwd terminates and after some time all processes in htop ware gone.

It was not possible to reproduce the problem that mp_pkt_fwd is not ending properly (hanging).

PASS

Test 2: verify if mp_pkt_fwd (Version: 3.0.25) stops when "autoquit_threshold" detects failing acks on UDP link.

Configuration:

Expected result: mp_pkt_fwd stops when 5 times no ack is received from the UDP test server.

result: mp_pkt_fwd (Version: 3.0.25) stops without problems after 5 missing acks.

PASS

Test 3: verify if mp_pkt_fwd stops when "autoquit_threshold" detects failing acks on UDP link.

Configuration:

Expected result: mp_pkt_fwd stops when 5 times no ack is received from the UDP test server. After this the balena script restarts the gateway.

result: mp_pkt_fwd (Version: 3.0.25) stops without problems after 5 missing acks. the Balena script restarts the gateway.

PASS

Test 4: verify if mp_pkt_fwd stops when "autoquit_threshold" detects failing acks on UDP link.

Configuration:

Expected result: mp_pkt_fwd stops when 5 times no ack is received from the UDP test server. After this the balena script restarts the gateway.

result: mp_pkt_fwd (Version: 3.0.25) stops without problems after 5 missing acks. the Balena script restarts the gateway.

PASS

Conclusion:

Because observed behaviour cannot be reproduced: CLOSE issue

pe1mew commented 4 years ago

I have reproduced the issue in a Balena (Resin.io) environment

A gateway in Balena (mp_pkt_fwd gateway) connected to TTN over UDP exclusively.

At TTN Console I see connection is lost: Last Seen: 12/8/2019 21:30:25

AT Balena the Logs show:

08.12.19 21:32:02 (+0100)  main  20:32:02  INFO: [down] for server router.eu.thethings.network the last 5 PULL_DATA were not ACKed, exiting down thread for this server.
08.12.19 21:32:02 (+0100)  main  20:32:02  INFO: End of downstream thread
08.12.19 21:32:02 (+0100)  main  20:32:02  INFO: End of upstream thread
08.12.19 21:32:02 (+0100)  main  20:32:02  INFO: End of GPS thread
08.12.19 21:32:02 (+0100)  main  20:32:02  INFO: End of JIT thread
08.12.19 21:32:02 (+0100)  main  20:32:02  INFO: End of validation thread

The python script does not restart and the gateway is hanging.

Through inspection in Balena I accessed the main container trough a terminal. There I inspected the processes running and found that still a thread was operational:

root@77e597a:/opt/ttn-gateway# ps -e -T
  PID  SPID TTY          TIME CMD
    1     1 pts/0    00:00:00 bash
   13    13 pts/0    00:00:00 python
   41    41 pts/0    00:00:00 mp_pkt_fwd
   41    42 pts/0    00:00:00 mp_pkt_fwd
   49    49 pts/1    00:00:00 bash
   71    71 pts/1    00:00:00 ps

Than I killed process 42: This results in restarting the gateway:

09.12.19 11:46:42 (+0100)  main  10:46:42  INFO: concentrator stopped successfully
09.12.19 11:46:42 (+0100)  main  10:46:42  INFO: Exiting packet forwarder program

now the gateway restarts again!

Conclusion: not all threads terminate!

Question is: how can we identify the hanging process?

Comment: This issue happens to UDP links to both TTN and ChirpStack.

pe1mew commented 4 years ago

Verification: I send a SIGINT trough a kill to the process. this results in good termination of mp_pkt_fwd: When mp_pkt_fwd is running all threads are active

root@77e597a:/opt/ttn-gateway# ps -e -T
  PID  SPID TTY          TIME CMD
    1     1 pts/0    00:00:00 bash
   13    13 pts/0    00:00:00 python
   49    49 pts/1    00:00:00 bash
   76    76 pts/0    00:00:00 mp_pkt_fwd
   76    77 pts/0    00:00:00 mp_pkt_fwd
   76    78 pts/0    00:00:02 mp_pkt_fwd
   76    79 pts/0    00:00:00 mp_pkt_fwd
   76    80 pts/0    00:00:00 mp_pkt_fwd
   76    81 pts/0    00:00:00 mp_pkt_fwd
   76    82 pts/0    00:00:00 mp_pkt_fwd
   76    83 pts/0    00:00:00 mp_pkt_fwd
   84    84 pts/1    00:00:00 ps

After sending SIGINT mp_pkt_fwd terminates OK: root@77e597a:/opt/ttn-gateway# kill -s 2 76 Results in:

09.12.19 11:56:52 (+0100)  main  10:56:52  INFO: End of upstream thread
09.12.19 11:56:52 (+0100)  main  10:56:52  INFO: End of JIT thread
09.12.19 11:56:53 (+0100)  main  10:56:53  INFO: End of GPS thread
09.12.19 11:56:53 (+0100)  main  10:56:53  INFO: End of validation thread
09.12.19 11:56:54 (+0100)  main  10:56:54  INFO: End of downstream thread
09.12.19 11:56:54 (+0100)  main  10:56:54  INFO: concentrator stopped successfully
09.12.19 11:56:54 (+0100)  main  10:56:54  INFO: Exiting packet forwarder program
pe1mew commented 4 years ago

Additional analysis.

An inventarisation of gateways included in the research learns that it is likely that the release of mp_pkt_fwd is suffering from this isue:

GW: Issue: Hardware RPI mp_pkt_fwd version: OS version:
HA* YES IMST IC880a 3 3.0.25 Resin OS 2.13.6+rev1
RAK831 YES RAK831 3 3.0.25 balenaOS 2.32.0+rev1
Rpi1 YES RAK831 1 3.0.25 balenaOS 2.44.0+rev1
JF* YES IMST IC880a 3 3.0.25 Resin OS 2.7.5+rev1
he* YES IMST IC880a 3 3.0.25 balenaOS 2.43.0+rev1
Ho* NO IMST IC880a 3 3.0.20 balenaOS 2.32.0+rev1
Ra* NO IMST IC880a 3 3.0.20 Resin OS 2.12.0+rev1
St* NO IMST IC880a 3 3.0.20 Resin OS 2.12.7+rev1
We* NO IMST IC880a 3 3.0.20 Resin OS 2.7.5+rev1
Ud* NO IMST IC880a 3 3.0.20 balenaOS 2.32.0+rev1
Eu* NO IMST IC880a 3 3.0.20 Resin OS 2.7.5+rev1
pe1mew commented 4 years ago

When the gateway is running on rasbain teh problem can easily be reproduced. Configure one UDP link to either TTN or a ChirpStack server and after a while TTN will stop sending ACKs. Then this output can be observed:

19:14:23  INFO: [down] for server router.eu.thethings.network the last 5 PULL_DATA were not ACKed, exiting down thread for this server.
19:14:23  INFO: End of downstream thread
19:14:23  INFO: End of upstream thread
19:14:23  INFO: End of JIT thread

Trough a second ssh session the gaeway can be inspected:

pi@raspberrypi:~ $ ps -eLF | grep mp_pkt_fwd
root      4098  4088  4098  0    1  2487  3248   0 18:55 pts/4    00:00:00 sudo ./mp_pkt_fwd
root      4103  4098  4103  0    2 11641  2220   1 18:55 pts/4    00:00:00 ./mp_pkt_fwd
root      4103  4098  4104  0    2 11641  2220   0 18:55 pts/4    00:00:00 ./mp_pkt_fwd
pi        5012  2567  5012  0    1  1837   484   0 19:35 pts/2    00:00:00 grep --color=auto mp_pkt_fwd

pi@raspberrypi:~ $ sudo strace -p 4104
strace: Process 4104 attached
recv(3, ^Cstrace: Process 4104 detached
 <detached ...>

pi@raspberrypi:~ $ sudo strace -p 4103
strace: Process 4103 attached
futex(0x76c204c8, FUTEX_WAIT, 4104, NULL^Cstrace: Process 4103 detached
 <detached ...>

By sending a SIGINT to the thread process the packet forwarder wil terminate:

sudo kill -s 2 4104

pe1mew commented 4 years ago

After some discussions with Jac I learned that “push_timeout_ms” shall be in the config file set to make the gateway work normal while autoquit is operational.

Tests with this settings are confirming that teh forwarder is working fine with this setting.

Workaround

Because the current code is not setting “push_timeout_ms” to a default value it is mandatory to set this parameter in the config file to make the gateway work OK.

Permanent solution

Modify the code to set “push_timeout_ms” to a default value of 100. This will prevent the program to malfunction when “push_timeout_ms” is not set in the configuration file.

I recommend to keep the issue open until it is fixed in the code.