LBS handling of Network Server going down and recovering

MaggieSalak commented 2 years ago

We have noticed that when Network Server goes down and then back up after a certain downtime, LBS doesn't seem to handle the situation as expected. Below are the logs from LBS we collected. The scenario is as follows:

09:50:42 The Network Server goes down For approximately 2 min the LBS is trying to connect to the data endpoint. 09:52:57 we see the error Muxs connect failed - URI: ws://192.168.86.33:5000/router-data/B827EBFFFED381B7 After that the LBS is trying to connect to the discovery endpoint. After successfully connecting, all upstream messages that were sent in the meantime are lost.

Could you please provide some insights on the following concerns:

The default TC_TIMEOUT in LBS is 60 sec - if that is the case why is the LBS waiting ~2min before reconnecting to discovery endpoint?
After reconnecting, why is the LBS dropping all messages?

2022-02-11 09:50:42.307 [AIO:ERRO] Recv failed: NET - Connection was reset by peer
2022-02-11 09:50:42.307 [AIO:DEBU] [3] WS connection shutdown...
2022-02-11 09:50:42.307 [TCE:VERB] Connection to MUXS closed in state 4
2022-02-11 09:50:42.307 [TCE:INFO] MUXS reconnect backoff 4s (retry 2)
2022-02-11 09:50:45.248 [any:XDEB] RX mod=LORA f=868500000 bw=125 sz=16 dr=2 4056341202008D0308E926A1B1E91310
2022-02-11 09:52:57.010 [AIO:ERRO] [3] WS connect failed: NET - The connection to the given server / port failed
2022-02-11 09:52:57.010 [AIO:DEBU] [3] WS connection shutdown...
2022-02-11 09:52:57.010 [TCE:ERRO] Muxs connect failed - URI: ws://192.168.86.33:5000/router-data/B827EBFFFED381B7
2022-02-11 09:52:57.010 [any:XDEB] RX mod=LORA f=868500000 bw=125 sz=16 dr=2 4056341202008E03088E898439AF5A6E
2022-02-11 09:52:57.011 [any:XDEB] RX mod=LORA f=868300000 bw=125 sz=16 dr=2 4056341202008F0308DF0C808E3D6C40
2022-02-11 09:52:57.011 [any:XDEB] RX mod=LORA f=868300000 bw=125 sz=16 dr=2 40563412020090030877D405A6FE1AA1
2022-02-11 09:52:57.011 [any:XDEB] RX mod=LORA f=868300000 bw=250 sz=33 dr=2 FBB9A57A403C23F311556CF592204AF4F3D7DEDD95F1FACEC4AAD26E207225309D
2022-02-11 09:52:57.011 [any:XDEB] Dropped frame without CRC or with broken CRC
2022-02-11 09:52:57.012 [any:XDEB] RX mod=LORA f=868100000 bw=125 sz=16 dr=2 4056341202009103087556D9F9055886
2022-02-11 09:52:57.012 [any:XDEB] RX mod=LORA f=868300000 bw=125 sz=16 dr=2 405634120200920308172634DAB8A835
2022-02-11 09:52:57.013 [any:XDEB] RX mod=LORA f=868100000 bw=125 sz=229 dr=2 F696F0F014F4DDBEB110F4E64F9FF3D7B02153422CC29AF1AAD90166893129FEF53BC53CE5278917820FC52683B2BC452C515DD64CF13300F0D227B3C847EEC0BA91C79BDDB770576019B0223A483A5A8DEBC129C94C13A30EE3BB771797C869E0D8A3A777810FF40C7A74C37B4CA5C6E22EB0917A1BB073CC976468CADBA215800BB7B4D66E4DA462CC2C67484C44FF90997E6D7711A4914E5795C4A8B468DC60C85956A104EAAE8D202C373BAD01CE6F0C5DD9DF84E6A0DDD94413E10964473E86E65D4F0F32D7F607D217C0B33F93E41F607D3DAB50D172022-02-11 09:52:57.013 [any:XDEB] Dropped frame without CRC or with broken CRC
2022-02-11 09:52:57.013 [any:XDEB] RX mod=LORA f=868100000 bw=125 sz=16 dr=2 405634120200930308B6735B907ADBBD
2022-02-11 09:52:57.014 [any:XDEB] RX mod=LORA f=868100000 bw=125 sz=16 dr=2 4056341202009403088298548A23704F
2022-02-11 09:52:57.014 [any:XDEB] RX mod=LORA f=868100000 bw=125 sz=16 dr=2 405634120200950308E28DCFE018CF91
2022-02-11 09:52:57.014 [TCE:INFO] INFOS reconnect backoff 10s (retry 1)
2022-02-11 09:52:57.035 [any:XDEB] RX mod=LORA f=868300000 bw=125 sz=16 dr=2 405634120200960308D76EE3526CC364
2022-02-11 09:52:57.035 [SYN:CRIT] SX130X RX time roll over - no update for a long time: xticks=0x32857DB last_xtime=0xDB00000854B747
2022-02-11 09:52:57.035 [any:XDEB] RX mod=LORA f=868500000 bw=125 sz=16 dr=2 4056341202009703087F4D35D484BD5B
2022-02-11 09:52:57.035 [SYN:CRIT] SX130X RX time roll over - no update for a long time: xticks=0x3749B1B last_xtime=0xDB00000854B747
2022-02-11 09:52:57.035 [any:XDEB] RX mod=LORA f=868500000 bw=125 sz=16 dr=2 40563412020098030879310A4C5D0278
2022-02-11 09:52:57.035 [SYN:CRIT] SX130X RX time roll over - no update for a long time: xticks=0x3C0DE6B last_xtime=0xDB00000854B747
2022-02-11 09:52:57.036 [any:XDEB] RX mod=LORA f=867900000 bw=125 sz=57 dr=2 A998EAFA544F9E15CC4900BEC05465E082951FB438E980DB88264F2D94DD6BA5BF91623D6C8A6B24C081D406EF40D5BDD4D0824BE091638B32
2022-02-11 09:52:57.036 [any:XDEB] Dropped frame without CRC or with broken CRC
2022-02-11 09:52:57.036 [any:XDEB] RX mod=LORA f=868300000 bw=125 sz=16 dr=2 405634120200990308BCD9EDEDFCBB09
2022-02-11 09:52:57.036 [SYN:CRIT] SX130X RX time roll over - no update for a long time: xticks=0x40D21BB last_xtime=0xDB00000854B747
2022-02-11 09:52:57.037 [any:XDEB] RX mod=LORA f=868500000 bw=125 sz=16 dr=2 4056341202009A03083D3BF4474DDCC5
2022-02-11 09:52:57.037 [SYN:CRIT] SX130X RX time roll over - no update for a long time: xticks=0x459650B last_xtime=0xDB00000854B747
2022-02-11 09:53:00.189 [any:XDEB] RX mod=LORA f=868300000 bw=125 sz=16 dr=2 405634120200A803081FA54456209CB2
2022-02-11 09:53:05.187 [any:XDEB] RX mod=LORA f=868500000 bw=125 sz=16 dr=2 405634120200A903089A8CB09CBE646E
2022-02-11 09:54:11.899 [AIO:XDEB] [3] ws_connecting state=1
2022-02-11 09:54:11.899 [AIO:XDEB] [3] ws_connecting state=2
2022-02-11 09:54:11.900 [TCE:INFO] Connecting to INFOS: ws://192.168.86.33:5000
2022-02-11 09:54:12.146 [AIO:XDEB] [3] ws_connecting state=3
2022-02-11 09:54:12.147 [AIO:XDEB] [3|WS] > {"router":"b827:ebff:fed3:81b7"}
2022-02-11 09:54:12.315 [AIO:XDEB] [3|WS] < {"router":"b827:ebff:fed3:81b7","muxs":"A483:E7FF:FEC2:490C","uri":"ws://192.168.86.33:5000/router-data/B827EBFFFED381B7"}
2022-02-11 09:54:12.315 [TCE:INFO] Infos: b827:ebff:fed3:81b7 A483:E7FF:FEC2:490C ws://192.168.86.33:5000/router-data/B827EBFFFED381B7
2022-02-11 09:54:12.315 [AIO:DEBU] [3] ws_close reason=1000
2022-02-11 09:54:12.315 [AIO:DEBU] [3|WS] Server sent close: reason=1000
2022-02-11 09:54:12.315 [AIO:DEBU] [3] WS connection shutdown...
2022-02-11 09:54:12.320 [AIO:XDEB] [3] ws_connecting state=1
2022-02-11 09:54:12.320 [AIO:XDEB] [3] ws_connecting state=2
2022-02-11 09:54:12.320 [TCE:VERB] Connecting to MUXS...
2022-02-11 09:54:12.349 [AIO:XDEB] [3] ws_connecting state=3
2022-02-11 09:54:12.349 [TCE:VERB] Connected to MUXS.
2022-02-11 09:54:12.349 [AIO:XDEB] [3|WS] > {"msgtype":"version","station":"2.0.5(rpi/std)","firmware":null,"package":null,"model":"rpi","protocol":2,"features":"rmtsh"}
2022-02-11 09:54:15.040 [AIO:XDEB] [3|WS] < {"msgtype":"router_config","NetID":[1],"JoinEui":[[15899451609549320133,15899451609549320135]],"region":"EU863","hwspec":"sx1301/1","freq_range":[863000000,870000000],"DRs":[[12,125,0],[11,125,0],[10,125,0],[9,125,0],[8,125,0],[7,125,0],[7,250,0]],"sx1301_conf":[{"radio_0":{"enable":true,"freq":867500000},"radio_1":{"enable":true,"freq":868500000},"chan_FSK":{"enable":true,"radio":1,"if":300000},"chan_Lora_std":{"enable":true,"radio":1,"if":-200000,"bandwidt
2022-02-11 09:54:15.040 [AIO:XDEB] [3|WS] . h":250000,"spread_factor":7},"chan_multiSF_0":{"enable":true,"radio":1,"if":-400000},"chan_multiSF_1":{"enable":true,"radio":1,"if":-200000},"chan_multiSF_2":{"enable":true,"radio":1,"if":0},"chan_multiSF_3":{"enable":true,"radio":0,"if":-400000},"chan_multiSF_4":{"enable":true,"radio":0,"if":-200000},"chan_multiSF_5":{"enable":true,"radio":0,"if":0},"chan_multiSF_6":{"enable":true,"radio":0,"if":200000},"chan_multiSF_7":{"enable":true,"radio":0,"if":400000}
2022-02-11 09:54:15.040 [AIO:XDEB] [3|WS] . }],"nocca":true,"nodc":true,"nodwell":true}
2022-02-11 09:54:15.041 [RAL:INFO] Lora gateway library version: Version: 5.0.1;
2022-02-11 09:54:15.041 [RAL:VERB] Connecting to device: /dev/spidev0.0
2022-02-11 09:54:15.041 [RAL:DEBU] SX130x txlut table (0 entries)
2022-02-11 09:54:15.041 [RAL:VERB] SX1301 rxrfchain 0: enable=1 freq=867.5MHz rssi_offset=-166.000000 type=2 tx_enable=1 tx_notch_freq=0
2022-02-11 09:54:15.041 [RAL:VERB] SX1301 rxrfchain 1: enable=1 freq=868.5MHz rssi_offset=-166.000000 type=2 tx_enable=0 tx_notch_freq=0
2022-02-11 09:54:15.041 [RAL:VERB] SX1301 ifchain  0: enable=1 rf_chain=1 freq=-400000 bandwidth=0 datarate=0 sync_word=0/0
2022-02-11 09:54:15.041 [RAL:VERB] SX1301 ifchain  1: enable=1 rf_chain=1 freq=-200000 bandwidth=0 datarate=0 sync_word=0/0
2022-02-11 09:54:15.042 [RAL:VERB] SX1301 ifchain  2: enable=1 rf_chain=1 freq=0 bandwidth=0 datarate=0 sync_word=0/0
2022-02-11 09:54:15.042 [RAL:VERB] SX1301 ifchain  3: enable=1 rf_chain=0 freq=-400000 bandwidth=0 datarate=0 sync_word=0/0
2022-02-11 09:54:15.042 [RAL:VERB] SX1301 ifchain  4: enable=1 rf_chain=0 freq=-200000 bandwidth=0 datarate=0 sync_word=0/0
2022-02-11 09:54:15.042 [RAL:VERB] SX1301 ifchain  5: enable=1 rf_chain=0 freq=0 bandwidth=0 datarate=0 sync_word=0/0
2022-02-11 09:54:15.042 [RAL:VERB] SX1301 ifchain  6: enable=1 rf_chain=0 freq=200000 bandwidth=0 datarate=0 sync_word=0/0
2022-02-11 09:54:15.042 [RAL:VERB] SX1301 ifchain  7: enable=1 rf_chain=0 freq=400000 bandwidth=0 datarate=0 sync_word=0/0
2022-02-11 09:54:15.042 [RAL:VERB] SX1301 ifchain  8: enable=1 rf_chain=1 freq=-200000 bandwidth=2 datarate=2 sync_word=0/0
2022-02-11 09:54:15.042 [RAL:VERB] SX1301 ifchain  9: enable=1 rf_chain=1 freq=300000 bandwidth=0 datarate=0 sync_word=0/0
2022-02-11 09:54:15.042 [RAL:VERB] SX130x LBT not enabled
2022-02-11 09:54:15.042 [RAL:VERB] Station device: /dev/spidev0.0 (PPS capture disabled)
2022-02-11 09:54:17.983 [RAL:VERB] Concentrator started (2s941ms)
2022-02-11 09:54:17.983 [S2E:INFO] Configuring for region: EU863 -- 863.0MHz..870.0MHz
2022-02-11 09:54:17.984 [S2E:VERB]   DR0  SF12/BW125
2022-02-11 09:54:17.984 [S2E:VERB]   DR1  SF11/BW125
2022-02-11 09:54:17.984 [S2E:VERB]   DR2  SF10/BW125
2022-02-11 09:54:17.984 [S2E:VERB]   DR3  SF9/BW125
2022-02-11 09:54:17.984 [S2E:VERB]   DR4  SF8/BW125
2022-02-11 09:54:17.984 [S2E:VERB]   DR5  SF7/BW125
2022-02-11 09:54:17.984 [S2E:VERB]   DR6  SF7/BW250
2022-02-11 09:54:17.984 [S2E:VERB]   DR7  undefined
2022-02-11 09:54:17.984 [S2E:VERB]   DR8  undefined
2022-02-11 09:54:17.984 [S2E:VERB]   DR9  undefined
2022-02-11 09:54:17.984 [S2E:VERB]   DR10 undefined
2022-02-11 09:54:17.984 [S2E:VERB]   DR11 undefined
2022-02-11 09:54:17.984 [S2E:VERB]   DR12 undefined
2022-02-11 09:54:17.984 [S2E:VERB]   DR13 undefined
2022-02-11 09:54:17.984 [S2E:VERB]   DR14 undefined
2022-02-11 09:54:17.984 [S2E:VERB]   DR15 undefined
2022-02-11 09:54:17.984 [S2E:VERB]   TX power: 16.0 dBm EIRP
2022-02-11 09:54:17.984 [S2E:VERB]             27.0 dBm EIRP for 869.4MHz..869.65MHz
2022-02-11 09:54:17.984 [S2E:VERB]   JoinEui list: 10 entries
2022-02-11 09:54:17.984 [S2E:VERB]   NetID filter: 00000000-00000000-00000000-00000002
2022-02-11 09:54:17.984 [S2E:VERB]   Dev/test settings: nocca=1 nodc=1 nodwell=1
2022-02-11 09:54:20.166 [any:XDEB] RX mod=LORA f=868100000 bw=125 sz=16 dr=2 405634120200B803085B9577F37B8259
2022-02-11 09:54:20.166 [S2E:VERB] RX 868.1MHz DR5 SF7/BW125 snr=9.0 rssi=-22 xtime=0x7D00000045FF23 - updf mhdr=40 DevAddr=02123456 FCtrl=00 FCnt=952 FOpts=[] 085B9577 mic=1501723635 (16 bytes)
2022-02-11 09:54:20.166 [AIO:XDEB] [3|WS] > {"msgtype":"updf","MHdr":64,"DevAddr":34747478,"FCtrl":0,"FCnt":952,"FOpts":"","FPort":8,"FRMPayload":"5B9577","MIC":1501723635,"RefTime":0.000000,"DR":5,"Freq":868100000,"upinfo":{"rctx":0,"xtime":35184372093419299,"gpstime":0,"fts":-1,"rssi":-22,"snr":9,"rxtime":1644573260.166286}}

MaggieSalak commented 2 years ago

cc @danigian

beitler commented 2 years ago

Hi @MaggieSalak,

thanks for posting the issue.

The default TC_TIMEOUT in LBS is 60 sec - if that is the case why is the LBS waiting ~2min before reconnecting to discovery endpoint?

LBS is not waiting 2 minutes before reconnecting. In fact, in the first round, LBS directly reconnects to the data endpoint (MUXS). Only after a few of these reconnect attempts fail, LBS resorts back to the discovery endpoint (INFOS), and finally back to the CUPS endpoint (if configured).

In your case it seems like the two minutes come from lower level TCP timeouts (host probably not responding to SYN?):

2022-02-11 09:50:42.307 [AIO:ERRO] Recv failed: NET - Connection was reset by peer
2022-02-11 09:50:42.307 [AIO:DEBU] [3] WS connection shutdown...
2022-02-11 09:50:42.307 [TCE:VERB] Connection to MUXS closed in state 4
2022-02-11 09:50:42.307 [TCE:INFO] MUXS reconnect backoff 4s (retry 2)

Station received a server-side TCP connection reset at 09:50:42. A retry is scheduled for 09:50:46 (4 seconds immediate retry back-off).

2022-02-11 09:52:57.010 [AIO:ERRO] [3] WS connect failed: NET - The connection to the given server / port failed
2022-02-11 09:52:57.010 [AIO:DEBU] [3] WS connection shutdown...

Station reports at 09:52:57 that the connection failed. These are about 2 minutes after the retry attempt. This looks to me that the remote host did not respond to the TCP SYN and the TCP keep alive timeout kicks in. Next attempt is scheduled for 10 seconds later, i.e. 09:03:07:

2022-02-11 09:52:57.014 [TCE:INFO] INFOS reconnect backoff 10s (retry 1)

Finally, one minute after the INFOS reconnection attempt, station connects:

2022-02-11 09:54:11.899 [AIO:XDEB] [3] ws_connecting state=1
2022-02-11 09:54:11.899 [AIO:XDEB] [3] ws_connecting state=2
2022-02-11 09:54:11.900 [TCE:INFO] Connecting to INFOS: ws://192.168.86.33:5000

To get complete clarity, I propose you do a TCP packet scan. But from the logs it looks like you shutdown the target port around 09:50 and didn't restart it until around 09:54:11. That (together with a possible DROP firewall target instead of REJECT) could explain this behavior.

After reconnecting, why is the LBS dropping all messages?

Station drops the messages in your case because it had to go back to the discovery endpoint (INFOS). If the data endpoint retry (MUXS) had been successful, station would have played back all the buffered messages to the LNS. This is certainly something that can be improved. In the future, we are looking into making such behavior more configurable.

On a separate note: I see you are using version 2.0.5. Is there a chance you could move to version 2.0.6?

danigian commented 2 years ago

Hi @beitler, thanks for your answer. The behavior we are encountering is exactly what you described: the target port is being shutdown and the 2 minutes are kicking in. I was able to reproduce the issue with version 2.0.6 of the Basics Station.

Regarding the second part of your question, it would be great to configure the behavior of (not) dropping messages if we are going back to the discovery endpoint. In a scenario where there are two LoRa Network Servers for redundancy (one active, one passive) plus an external discovery service, it is important to have the possibility to switch to the other LNS without dropping potentially useful telemetry collected in the seconds/minutes for the switch.

Do you see any plan for making this happen in a future version of Basics Station? Thanks

smtc-bot commented 2 years ago

Thank you for your inquiry.

Customers are encouraged to submit technical questions via our dedicated support portal at https://semtech.force.com/ldp/ldp_support.

We invite all users to visit the LoRa Developer Portal Forum at https://forum.lora-developers.semtech.com and to join the thriving LoRa development community!

lorabasics / basicstation

LBS handling of Network Server going down and recovering #152