helium / router

router combines a LoRaWAN Network Server with an API for console, and provides a proxy to the Helium blockchain
Apache License 2.0
69 stars 32 forks source link

Device runs into JOIN_REQ and JOIN_ACCEPT Loop #818

Closed RTK221B closed 2 years ago

RTK221B commented 2 years ago

Devices not able to join the network

Some devices just go on a long JOIN_REQ and JOIN_ACCEPT loop before they actually join the network. The loop can last a few minutes to many hours. Happens to new devices joining the network as well as to old pre-registered devices. Here are some examples:

Screen Shot 2022-07-06 at 5 09 53 PM Screen Shot 2022-07-06 at 5 11 04 PM
GrayHatGuy commented 2 years ago

I'm seeing similar issues appears to have manifested this past week. Here is a problem description associated with the RAK/wisblock mappers. Join REQ_ACC looping.pdf

macpie commented 2 years ago

Could we know what is the hotspot in use? Could user also try threw a different hotspot just in case?

bsodmike commented 2 years ago

I just performed a Console upgrade, was previously on v2.2.14 and my device worked fine (albeit there were many grpc crash logs in the router). However, now my device is looping and not being accepted when joining:

Screenshot 2022-07-20 at 10 58 14
bsodmike commented 2 years ago

Update: After about 40 failures, it finally connected.

Tried disconnecting and re-connecting and it is back to looping!

12:52:41.514 -> <AS923> Using TX Frequency 923400000
12:52:41.514 -> <LM> OnRadioTxDone
12:52:41.514 -> <LM> OnRadioTxDone => RX Windows #1 5002 #2 6002
12:52:41.514 -> <LM> OnRadioTxDone => TX was Join Request
12:52:43.809 -> <RADIO> RX window timeout = 3000
12:52:56.456 -> <RADIO> RadioIrqProcess => IRQ_RX_TX_TIMEOUT
12:52:56.456 -> <LM> OnRadioRxTimeout
12:52:56.456 -> <LM> Join network failed 57 time(s)
buzzware commented 2 years ago

This just happened to me for the first time. I've just onboarded 8 same devices, and it only happened to one. event-debug (27).json.zip

UPDATE: out of 20 same devices, one gets stuck in this loop and never joins no matter what I do. I tried deleting the device and re-adding it to the console. UPDATE 2: deleting and re-adding did not work - changing the device EUI on both device & console did work.

eikaramba commented 2 years ago

i have this problem since months. on more than 50 devices it happens for around 20% (could be much higher, i cannot verify that however). Different hotspots, differen locations, sometimes device next to gateway.

MDogUnited commented 2 years ago

Just wanted to confirm we are facing the same issue with the Dragino LPS8. I can steadily reproduce the issue with the Dragino LPS8. We tested both EU868 and US915 versions of the LPS8 to see if that had any impact but both got stuck in the Join loop. It seems that the LPS8 doesn't actually send the downlink. Official Helium gateway seems to work fine.

We sometimes see this in the log. But it actually has the region parameters.

Wed Jul 27 08:44:13 2022 daemon.info helium_gateway[9253]: WARN ignoring downlink, no region params, module: gateway

We will try reverting to older FW of the Dragino to see if that solves anything.

madninja commented 2 years ago

Just wanted to confirm we are facing the same issue with the Dragino LPS8. I can steadily reproduce the issue with the Dragino LPS8. We tested both EU868 and US915 versions of the LPS8 to see if that had any impact but both got stuck in the Join loop. It seems that the LPS8 doesn't actually send the downlink. Official Helium gateway seems to work fine.

We sometimes see this in the log. But it actually has the region parameters.

Wed Jul 27 08:44:13 2022 daemon.info helium_gateway[9253]: WARN ignoring downlink, no region params, module: gateway

We will try reverting to older FW of the Dragino to see if that solves anything.

Try asserting a location for the hotspot with that error? Downlinks need to know the region parameters to regulate maximum tx power. This requires the hotspot to be location asserted

MDogUnited commented 2 years ago

Just wanted to confirm we are facing the same issue with the Dragino LPS8. I can steadily reproduce the issue with the Dragino LPS8. We tested both EU868 and US915 versions of the LPS8 to see if that had any impact but both got stuck in the Join loop. It seems that the LPS8 doesn't actually send the downlink. Official Helium gateway seems to work fine. We sometimes see this in the log. But it actually has the region parameters. Wed Jul 27 08:44:13 2022 daemon.info helium_gateway[9253]: WARN ignoring downlink, no region params, module: gateway We will try reverting to older FW of the Dragino to see if that solves anything.

Try asserting a location for the hotspot with that error? Downlinks need to know the region parameters to regulate maximum tx power. This requires the hotspot to be location asserted

I did what you suggested the gateway shows asserted in my wallet hotspot list. But now I get this error when a JOIN REQ is received. I'm on "helium_gateway 1.0.0-alpha.29" and Dragino LPS8 firmware "lgw-5.4.1655274924"

Fri Jul 29 07:53:53 2022 daemon.info fwd[10936]: PKTUP~ [server] JSON: {"rxpk":[{"jver":1,"tmst":368878588,"chan":3,"rfch":0,"freq":904.500000,"mid":0,"stat":1,"modu":"LORA","datr":"SF10BW125","codr":"4/5","rssis":-40,"lsnr":8.8,"foff":0,"rssi":-40,"size":23,"data":"APxFDBfAYxGkVgAAAMNjEaQBBAyS0rw="}]} Fri Jul 29 07:53:53 2022 daemon.info fwd[10936]: INFO~ [server2-up] received packages from mote: 00000000 (fcnt=0) Fri Jul 29 07:53:53 2022 daemon.info fwd[10936]: PKTUP~ [server2] JSON: {"rxpk":[{"jver":1,"tmst":368878588,"chan":3,"rfch":0,"freq":904.500000,"mid":0,"stat":1,"modu":"LORA","datr":"SF10BW125","codr":"4/5","rssis":-40,"lsnr":8.8,"foff":0,"rssi":-40,"size":23,"data":"APxFDBfAYxGkVgAAAMNjEaQBBAyS0rw="}]} Fri Jul 29 07:53:53 2022 daemon.info helium_gateway[11006]: INFO uplink @368878588 us, 904.50 MHz, Ok(DataRate(SF10, BW125)), snr: 8.8, rssi: -40, len: 23 from A8:40:41:20:39:24:41:50, module: gateway Fri Jul 29 07:53:53 2022 daemon.info fwd[10936]: INFO~ [server-up] PUSH_ACK received in 7 ms Fri Jul 29 07:53:53 2022 daemon.info fwd[10936]: JOIN_REQ:{"Size":23, "Rssi":-40, "snr":9, "AppEUI":"A41163C0170C45FC", "DevEUI":"A41163C300000056"} Fri Jul 29 07:53:54 2022 daemon.info fwd[10936]: INFO~ [server-down] PULL_ACK received in 0 ms Fri Jul 29 07:53:54 2022 daemon.info helium_gateway[11006]: WARN ignoring failed uplink Service(Rpc(Status { code: Unknown, message: "no response", metadata: MetadataMap { headers: {"user-agent": "grpc-erlang/0.1.0", "content-type": "application/grpc+proto", "grpc-encoding": "identity"} }, source: None })), oui: 12, uri: http://54.219.236.122:8080/, pubkey: 112ewJNEUfSg3Jvo276tMjzFC2JzmmZcJJ32CWz2fzYqbyCMMTe1, module: router

macpie commented 2 years ago

@RTK221B We are going to test with a new release of Router that will send errors back to gateway RS

jdgemm commented 2 years ago

@RTK221B making progress, validator component included in release, and now team's working on gateway-rs update.

jdgemm commented 2 years ago

@RTK221B latest gateway-rs release was released to manufacturers yesterday: https://engineering.helium.com/2022/08/18/miner-hotspot-release.html.

specifically this pr: https://github.com/helium/gateway-rs/pull/285

It typically takes a few days for manufacturers to push out the update, next week please start to reach back out folks with join loop concerns to confirm issues addresed.

jdgemm commented 2 years ago

Assume latest release addressed this...

eikaramba commented 2 years ago

still having that same problem grafik

i changed the rx1delay to 3 seconds (was 5) and it seems to work better now. will need to test this further however

michaeldjeffrey commented 2 years ago

Changing RX windows won’t change joining. The Join windows are fixed in the spec.

eikaramba commented 2 years ago

yeah it seems after it has joined a few times it works directly after resetting the device. however if i try it the next day it first runs into the loop issue again before joining after 1 minute or so

I can visually see the device sending the join request (indicated by a rgb led), then helium sending the accept call back and ~ 2 seconds later the join window 1 closes on the device (indicated by a rgb led).

btw i am using heltec cubecell dev board ab-01 with latest firmware. device spec for 868 region would be this: https://github.com/HelTecAutomation/CubeCell-Arduino/blob/master/libraries/LoraWan102/src/loramac/region/RegionEU868.h

i am using a public hotspot so i don't know what specs that has. however this is its url: https://explorer.helium.com/hotspots/112QZTkUTzE3btS8rJ4zGrY4nDDvyWeZ19i3DPcF7HUGCpoSCAqS

grafik

LoRaWAN EU868 Class A start!

trying to join-start
joining...TX on freq 868500000 Hz at DR 5
TX on freq 868500000 Hz at DR 5
Event : Tx Done
RX on freq 868500000 Hz at DR 5
Event : Rx Timeout
RX on freq 869525000 Hz at DR 0
Event : Rx Timeout
TX on freq 868300000 Hz at DR 5
TX on freq 868300000 Hz at DR 5
Event : Tx Done
RX on freq 868300000 Hz at DR 5
Event : Rx Error
RX on freq 869525000 Hz at DR 0
Event : Rx Timeout
TX on freq 868500000 Hz at DR 5
TX on freq 868500000 Hz at DR 5
Event : Tx Done
RX on freq 868500000 Hz at DR 5
Event : Rx Timeout
RX on freq 869525000 Hz at DR 0
Event : Rx Timeout
TX on freq 868500000 Hz at DR 5
TX on freq 868500000 Hz at DR 5
Event : Tx Done
RX on freq 868500000 Hz at DR 5
Event : Rx Timeout
RX on freq 869525000 Hz at DR 0
Event : Rx Timeout
TX on freq 868100000 Hz at DR 5
TX on freq 868100000 Hz at DR 5
Event : Tx Done
RX on freq 868100000 Hz at DR 5
Event : Rx Timeout
RX on freq 869525000 Hz at DR 0
Event : Rx Timeout
TX on freq 868100000 Hz at DR 5
TX on freq 868100000 Hz at DR 5
Event : Tx Done
RX on freq 868100000 Hz at DR 5
Event : Rx Timeout
RX on freq 869525000 Hz at DR 0
Event : Rx Timeout
TX on freq 868500000 Hz at DR 5
TX on freq 868500000 Hz at DR 5
Event : Tx Done
RX on freq 868500000 Hz at DR 5
Event : Rx Timeout
RX on freq 869525000 Hz at DR 0
Event : Rx Timeout
TX on freq 868100000 Hz at DR 4
TX on freq 868100000 Hz at DR 4
Event : Tx Done
RX on freq 868100000 Hz at DR 4
Event : Rx Error
RX on freq 869525000 Hz at DR 0
Event : Rx Timeout
TX on freq 868500000 Hz at DR 5
TX on freq 868500000 Hz at DR 5
Event : Tx Done
RX on freq 868500000 Hz at DR 5
Event : Rx Timeout
RX on freq 869525000 Hz at DR 0
Event : Rx Timeout
TX on freq 868100000 Hz at DR 5
TX on freq 868100000 Hz at DR 5
Event : Tx Done
RX on freq 868100000 Hz at DR 5
Event : Rx Timeout
RX on freq 869525000 Hz at DR 0
Event : Rx Timeout
TX on freq 868500000 Hz at DR 5
TX on freq 868500000 Hz at DR 5
Event : Tx Done
RX on freq 868500000 Hz at DR 5
Event : Rx Timeout
RX on freq 869525000 Hz at DR 0
Event : Rx Timeout
TX on freq 868500000 Hz at DR 5
TX on freq 868500000 Hz at DR 5
Event : Tx Done
RX on freq 868500000 Hz at DR 5
Event : Rx Timeout
RX on freq 869525000 Hz at DR 0
Event : Rx Timeout
TX on freq 868300000 Hz at DR 5
TX on freq 868300000 Hz at DR 5
Event : Tx Done
RX on freq 868300000 Hz at DR 5
Event : Rx Timeout
RX on freq 869525000 Hz at DR 0
Event : Rx Timeout
TX on freq 868300000 Hz at DR 5
TX on freq 868300000 Hz at DR 5
Event : Tx Done
RX on freq 868300000 Hz at DR 5
Event : Rx Timeout
RX on freq 869525000 Hz at DR 0
Event : Rx Timeout
TX on freq 868300000 Hz at DR 5
TX on freq 868300000 Hz at DR 5
Event : Tx Done
RX on freq 868300000 Hz at DR 5
Event : Rx Done
setup
joined
eikaramba commented 2 years ago

now testing whether activating the CF-List option helps for region 868 https://github.com/helium/router/issues/858 need to test tomorrow, as i said, once the device joined a few times it works. i need it to rest and try tomorrow again. should it directly join it is a strong indicator that the underlying problem is actually the problem described in the ticket

michaeldjeffrey commented 2 years ago

I can visually see the device sending the join request (indicated by a rgb led), then helium sending the accept call back and ~ 2 seconds later the join window 1 closes on the device (indicated by a rgb led).

This should not be the case. JOIN_ACCEPT_DELAY1 is always 5s, and JOIN_ACCEPT_DELAY2 is always 6s.

From these logs

TX on freq 868300000 Hz at DR 5 TX on freq 868300000 Hz at DR 5 Event : Tx Done RX on freq 868300000 Hz at DR 5 Event : Rx Error

It appears the device is sending 2 join request on the same frequency 1 right after the other.

eikaramba commented 2 years ago

This should not be the case. JOIN_ACCEPT_DELAY1 is always 5s, and JOIN_ACCEPT_DELAY2 is always 6s. Honestly i can be wrong, i just mention what i see on the device. I'm just using the quite popular heltec cubecell board. So you say that in total there SHOULD be 11 seconds at least between join attempts? I also don't see that behaviour on other lora nodes.

The device only is sending one join request, the logs are just showing it two times. i checked and it is always like this for the heltec logs (also checked all logs public available on the internet)

unfortunately the "bug" is still there, even after activating the CF-List. i will continue to inspect but fear there is not much i can do anymore.