Closed pe1mew closed 4 years ago
Just found that [ctrl-C] does not terminate the forwarder:
14:09:59 WARNING: [gps] read() returned valu^C
14:10:00 14:10:00 14:10:00 INFO: End of JIT thread
14:10:00 INFO: End of validation thread
INFO: End of GPS thread
14:10:00 INFO: End of downstream thread
INFO: End of upstream thread
14:10:00 INFO: [TTN] Disconnecting server "bridge.eu.thethings.network"
14:10:00 INFO: End of downstream thread
^C^C
Also after killing the process some orphans appear in the process list.
I have tested mp_pkt_fwd and have not been able to reproduce the observed error. Here is a desciption of the tests I executed:
Source: https://github.com/jpmeijers/ttn-resin-gateway-rpi Note: because original git source of wiringpi was not available this was replaced by a fork: "ttn-resin-gateway-rpi/dev/packet_forwarder/mp_pkt_fwd/build-pi.sh" was modified:
git clone https://github.com/WiringPi/WiringPi.git || { echo 'Cloning wiringPi failed.' ; exit 1; }
# git clone git://git.drogon.net/wiringPi || { echo 'Cloning wiringPi failed.' ; exit 1; }
The gateway is configured to connect to TTN and to a UDP server (test server using packet_forwarder/util_ack)
Configuration:
For this the RPI gateway was accessed with two ssh sessions. Session 1 is running htop filtering for 'mp_pkt_fwd'. It shows 8 processes. When [ctrl-c] is pressed, mp_pkt_fwd terminates and after some time all processes in htop ware gone.
It was not possible to reproduce the problem that mp_pkt_fwd is not ending properly (hanging).
PASS
Configuration:
Expected result: mp_pkt_fwd stops when 5 times no ack is received from the UDP test server.
result: mp_pkt_fwd (Version: 3.0.25) stops without problems after 5 missing acks.
PASS
Configuration:
Expected result: mp_pkt_fwd stops when 5 times no ack is received from the UDP test server. After this the balena script restarts the gateway.
result: mp_pkt_fwd (Version: 3.0.25) stops without problems after 5 missing acks. the Balena script restarts the gateway.
PASS
Configuration:
Expected result: mp_pkt_fwd stops when 5 times no ack is received from the UDP test server. After this the balena script restarts the gateway.
result: mp_pkt_fwd (Version: 3.0.25) stops without problems after 5 missing acks. the Balena script restarts the gateway.
PASS
Because observed behaviour cannot be reproduced: CLOSE issue
I have reproduced the issue in a Balena (Resin.io) environment
A gateway in Balena (mp_pkt_fwd gateway) connected to TTN over UDP exclusively.
At TTN Console I see connection is lost:
Last Seen: 12/8/2019 21:30:25
AT Balena the Logs show:
08.12.19 21:32:02 (+0100) main 20:32:02 INFO: [down] for server router.eu.thethings.network the last 5 PULL_DATA were not ACKed, exiting down thread for this server.
08.12.19 21:32:02 (+0100) main 20:32:02 INFO: End of downstream thread
08.12.19 21:32:02 (+0100) main 20:32:02 INFO: End of upstream thread
08.12.19 21:32:02 (+0100) main 20:32:02 INFO: End of GPS thread
08.12.19 21:32:02 (+0100) main 20:32:02 INFO: End of JIT thread
08.12.19 21:32:02 (+0100) main 20:32:02 INFO: End of validation thread
The python script does not restart and the gateway is hanging.
Through inspection in Balena I accessed the main container trough a terminal. There I inspected the processes running and found that still a thread was operational:
root@77e597a:/opt/ttn-gateway# ps -e -T
PID SPID TTY TIME CMD
1 1 pts/0 00:00:00 bash
13 13 pts/0 00:00:00 python
41 41 pts/0 00:00:00 mp_pkt_fwd
41 42 pts/0 00:00:00 mp_pkt_fwd
49 49 pts/1 00:00:00 bash
71 71 pts/1 00:00:00 ps
Than I killed process 42: This results in restarting the gateway:
09.12.19 11:46:42 (+0100) main 10:46:42 INFO: concentrator stopped successfully
09.12.19 11:46:42 (+0100) main 10:46:42 INFO: Exiting packet forwarder program
now the gateway restarts again!
Conclusion: not all threads terminate!
Question is: how can we identify the hanging process?
Comment: This issue happens to UDP links to both TTN and ChirpStack.
Verification: I send a SIGINT trough a kill to the process. this results in good termination of mp_pkt_fwd: When mp_pkt_fwd is running all threads are active
root@77e597a:/opt/ttn-gateway# ps -e -T
PID SPID TTY TIME CMD
1 1 pts/0 00:00:00 bash
13 13 pts/0 00:00:00 python
49 49 pts/1 00:00:00 bash
76 76 pts/0 00:00:00 mp_pkt_fwd
76 77 pts/0 00:00:00 mp_pkt_fwd
76 78 pts/0 00:00:02 mp_pkt_fwd
76 79 pts/0 00:00:00 mp_pkt_fwd
76 80 pts/0 00:00:00 mp_pkt_fwd
76 81 pts/0 00:00:00 mp_pkt_fwd
76 82 pts/0 00:00:00 mp_pkt_fwd
76 83 pts/0 00:00:00 mp_pkt_fwd
84 84 pts/1 00:00:00 ps
After sending SIGINT mp_pkt_fwd terminates OK:
root@77e597a:/opt/ttn-gateway# kill -s 2 76
Results in:
09.12.19 11:56:52 (+0100) main 10:56:52 INFO: End of upstream thread
09.12.19 11:56:52 (+0100) main 10:56:52 INFO: End of JIT thread
09.12.19 11:56:53 (+0100) main 10:56:53 INFO: End of GPS thread
09.12.19 11:56:53 (+0100) main 10:56:53 INFO: End of validation thread
09.12.19 11:56:54 (+0100) main 10:56:54 INFO: End of downstream thread
09.12.19 11:56:54 (+0100) main 10:56:54 INFO: concentrator stopped successfully
09.12.19 11:56:54 (+0100) main 10:56:54 INFO: Exiting packet forwarder program
An inventarisation of gateways included in the research learns that it is likely that the release of mp_pkt_fwd is suffering from this isue:
GW: | Issue: | Hardware | RPI | mp_pkt_fwd version: | OS version: |
---|---|---|---|---|---|
HA* | YES | IMST IC880a | 3 | 3.0.25 | Resin OS 2.13.6+rev1 |
RAK831 | YES | RAK831 | 3 | 3.0.25 | balenaOS 2.32.0+rev1 |
Rpi1 | YES | RAK831 | 1 | 3.0.25 | balenaOS 2.44.0+rev1 |
JF* | YES | IMST IC880a | 3 | 3.0.25 | Resin OS 2.7.5+rev1 |
he* | YES | IMST IC880a | 3 | 3.0.25 | balenaOS 2.43.0+rev1 |
Ho* | NO | IMST IC880a | 3 | 3.0.20 | balenaOS 2.32.0+rev1 |
Ra* | NO | IMST IC880a | 3 | 3.0.20 | Resin OS 2.12.0+rev1 |
St* | NO | IMST IC880a | 3 | 3.0.20 | Resin OS 2.12.7+rev1 |
We* | NO | IMST IC880a | 3 | 3.0.20 | Resin OS 2.7.5+rev1 |
Ud* | NO | IMST IC880a | 3 | 3.0.20 | balenaOS 2.32.0+rev1 |
Eu* | NO | IMST IC880a | 3 | 3.0.20 | Resin OS 2.7.5+rev1 |
When the gateway is running on rasbain teh problem can easily be reproduced. Configure one UDP link to either TTN or a ChirpStack server and after a while TTN will stop sending ACKs. Then this output can be observed:
19:14:23 INFO: [down] for server router.eu.thethings.network the last 5 PULL_DATA were not ACKed, exiting down thread for this server.
19:14:23 INFO: End of downstream thread
19:14:23 INFO: End of upstream thread
19:14:23 INFO: End of JIT thread
Trough a second ssh session the gaeway can be inspected:
pi@raspberrypi:~ $ ps -eLF | grep mp_pkt_fwd
root 4098 4088 4098 0 1 2487 3248 0 18:55 pts/4 00:00:00 sudo ./mp_pkt_fwd
root 4103 4098 4103 0 2 11641 2220 1 18:55 pts/4 00:00:00 ./mp_pkt_fwd
root 4103 4098 4104 0 2 11641 2220 0 18:55 pts/4 00:00:00 ./mp_pkt_fwd
pi 5012 2567 5012 0 1 1837 484 0 19:35 pts/2 00:00:00 grep --color=auto mp_pkt_fwd
pi@raspberrypi:~ $ sudo strace -p 4104
strace: Process 4104 attached
recv(3, ^Cstrace: Process 4104 detached
<detached ...>
pi@raspberrypi:~ $ sudo strace -p 4103
strace: Process 4103 attached
futex(0x76c204c8, FUTEX_WAIT, 4104, NULL^Cstrace: Process 4103 detached
<detached ...>
By sending a SIGINT to the thread process the packet forwarder wil terminate:
sudo kill -s 2 4104
After some discussions with Jac I learned that “push_timeout_ms” shall be in the config file set to make the gateway work normal while autoquit is operational.
Tests with this settings are confirming that teh forwarder is working fine with this setting.
Because the current code is not setting “push_timeout_ms” to a default value it is mandatory to set this parameter in the config file to make the gateway work OK.
Modify the code to set “push_timeout_ms” to a default value of 100. This will prevent the program to malfunction when “push_timeout_ms” is not set in the configuration file.
I recommend to keep the issue open until it is fixed in the code.
The feature "autoquit" does not seem to work. When the forwarder is detecting a link failure on UDP protocol the forwarder does not terminate and cannot reboot.