Messages not appearing - network unstable

slavendam commented 2 years ago

Last 8-9 days we are getting increased number of reports about problems with network. Noticed increased number of upplink_dropped_late packets.

Case 1. - Device joins network successfully, sending unconfirmed uplink messages, Multiple packets turned on and set to ALL. Sending on SF12. In Console user gets 2-3 uplinks (counters 0-2) and nothing after that. If he restars device, rejoin, same situation. This is tested with 2 different devices, from 2 different producers (Glamos and RAK). Devices were working correctly before and since then nothing changed. It seems like session is lost and network is throwing away all packets (network can't reckognize device). This user even created video to show what is happening so I can send it in private chat, you can contact me on Discord

Case 2. - Device joins network, Multiple packets set to ALL. Sending on SF12 -> in Console getting 1-3 messages and nothing after that. User tested with Spectrum analysis and he can see that device is transmiting LoRa packets. On January 25th he could reach to 500 hotspots from that location with same antenna, and on Jan 27th and since then he can reach only 50 (same location, same antenna). Lot of Late packets appeared. We tested on SF7 and SF9 and it worked fine for tested 20 messages, but on SF12 it blocked again after few attempts.

Case 3 - In case 2 user is getting splitted messages. counter 11 (65 HS) counter 12 (70 HS), counter 13 (70 HS), counter 12 (3 HS), counter 14 (65 HS). Some hotspots are late and router sees that as a new message.

Case 4 - This was during test of Case 2. We were sending 2x 10 messages, so 20 uplinks in total (counters 0 to 19). At counter 11 uplinks stopped coming to Console and after some time (60+s) we got 1 uplink with counter 12, 7 more join request messages one after other and 7 Join Acc (check photo bellow). So it seems that network lost Join session keys and all Uplinks were seen as Join Req. We were sending messages every 12-14s. You see that cnt10 has time 2:55:06, cnt11 55:14, cnt12 55:23. So network received message at correct time, and that message was hold somewhere before it was showed in Console. You can see on photo that Uplink came AFTER all these Join Req and before Join Acc and number of Join Req (7) is same as number of missing Uplinks (13-19). So definitely there is something wrong.

Other users also reported similar issues! In Discord channel some users were writing about problems with not getting messages or blocking.

From my experience and point of view it seems this issue is happening more often on SF12 which gives higher range and more hotspots. So maybe number of hotspots (multiple packets) is blocking somehow router or console. Other point is that after 2-3 months without dropped_late_uplink this started to happen again. Maybe wait/hold time is not long enough or hotspots are slowed down somehow.

Were there some updates on network in last period which could affect all these problems?

I kindly ask you to take a look on this issues because network is not usable at all for coverage mapping or sensor usage lately.

I'm offering my help for discusses or real field testing, just to solve this ASAP.

Thanks :)

slavendam commented 2 years ago

Hi guys, these problems became really huge. I spent last 2 days in testing network and trying to find out what is pattern of issue.

I would split issue in 2 parts:

Multiple packets issue
Problems of joining and holding keys of device in network

No.1 I found pattern for Multiple packets. There is HUGE amount of upplink_dropped_late packets. It is happening more often in area with more hotspots. This issue was happening 2-3 months before and was solved with increasing hold_time. Every message has ~10 dropped packets (and probably there is some limit in displaying all dropped). Other problem is huge oscillation of number of packets. So device is on the table and sending with some time delay. In photo bellow we were sending messages and for counters 18-20 I've turned off Multiple packets. At that time it worked as it should (just 1 hotspot). But just with turning again Multiple packets you can see oscillations again. In normal times (10-15 days ago), this number was stable (it can oscillate +-3 hotspots).

In area with 500 hotspots, it can see just 50 now.

Other wierd issue is that this same device was receiving just 1-2 packets from same location, and same situation was after trying to restart device and rejoin 3-4 time. What I did is just removed device, added it again, joined with device and we got results with 20-30 hotspots. Same location, same device, same antenna, 30min difference in time.

On other side other device once join and number of hotspots is stable (45-50), and later join with same oscillations.

No.2 Part of this issue is described in previous point. It seems like network put some devices (or join sessions) in unfavorable position. In that case keys for join session are lost, number of hotspots oscillates a lot, joining is not working (and then starts after readding device)... I still didn't find pattern for this issues.

Is it possible to return version of console from 2-3 weeks before (at least on staging) so we can test is there issue in updates since then? I kindly ask you to put this issue No.1 as priority because this is most important feature for testing Helium network. And testing on this way gives really bad look of network.

For my development experience and previous experience with Helium (and updates in Helium), I believe that all these issues can be caused by same thing. Contact me if you need help with fast testing to try figure out what is going on.

Slaven

slavendam commented 2 years ago

Just for the record, most users started noticing this issue with Multiple packets on January 26th or 27th. Just saw that some console release was these days https://engineering.helium.com/2022/01/27/console-updates-2.2.1.html

jdgemm commented 2 years ago

Thank you for taking the time to provide extensive feedback. Also appreciate the willingness to share additional information and help test. The passion and use of the network is definitely appreciated. The title of the issue is less appreciated given that the network is and continues to be usable (i.e., you are still receiving packets as shown in your screenshots).

You've conflated a number of items which makes it challenging to pinpoint the issues, which then makes it difficult to work on them.

In this ticket can you let us know which issue is the highest priority so the team can have a chance to investigate? It would appear that the main issue is:

with multiple packet setting enabled, the number of hotspots sending packets is inconsistent

Also going forward, please focus on 1 issue/posting. This will allow the team to have a better understanding of your situation and also it will be easier for you not only to post, but also to track the progress.

slavendam commented 2 years ago

Problem is that it took some time to separate issues and understand that they are not appearing in the same pattern. After 2 days of intensive testing I'm bit wiser. :)

Yes, this issue "with multiple packet setting enabled, the number of hotspots sending packets is inconsistent" is highest priority. It includes inconsistent number and lot of "late packets".

But as I said, I believe that all of these issues came from update from 27th Jan. And I believe that is can be connected with "Fix ets leaks" commit. But this part you will know better.

macpie commented 2 years ago

Hi slavendam,

Thanks for your report. In order to better figure out what is going on here we will focus on the inconsistency of packets received for each frame count.

Console is currently under going some issue see https://github.com/helium/console/issues/992 where it is exhausting database connections and then cannot update events that you see in that table. This should NOT have any impact on actual data transfer meaning that your integration should still receive everything normally (not including late packet).

That being said, we would suggest checking that, when you only see 1 hotspot in a frame if you integration is receiving 1 hotspot or more. You can use something like https://requestbin.com/r as a simple http integration to see that data and share with us if it is ok with you.

If not related to Console's issue, we are going to need more data to be able to isolate the problem, please note that we have a bug template here. It might not be perfect (feel free to open a PR to improve it if need be) but it really helps us gather minimum information needed to debug.

Your latest screenshot is the one we should probably focus on. It would be helpful to get more data like:

How many late packets did you see for frame 18, 19, 20
RSSI and SNR
Data rate
Hotspot Names
Console Device ID
How often does the device transmit

(If data is not available anymore feel free to reproduce and attach screen shot again)

We will also be happy to do some live debugging if necessary.

Thank for your patience.

slavendam commented 2 years ago

I saw that Console sometimes won't update live all data, but this is not root for this issue. We have Integration, and you can check bellow values. Number of hotspots/gateways is the same as in the Console (our App can show few less because of some filtering).

If not related to Console's issue, we are going to need more data to be able to isolate the problem, please note that we have a bug template here. It might not be perfect (feel free to open a PR to improve it if need be) but it really helps us gather minimum information needed to debug.

Describe the bug https://github.com/helium/router/issues/591#issuecomment-1028348294 Most of things I can tell is here

To Reproduce Let device sending in area of 50+ hotspots on higest SF with 2dBi antenna. Device should be at same place all the time.

Expected behavior Number of hotspots that received message is inconsistent and there is lot of Late packets. (check photo on link above)

Device Info (please complete the following information):

Environment (Production, Staging) -> Production
Console Device ID -> 2aebdf6f-90f3-4fa7-afe8-cd9b051c0802
Device dev eui / app eui -> 1D4A7D0000146FFE / D84BCF431BCCC0E1
How often is device sending data -> We tried with 1 and 2 minute delay
Region -> EU, Belgium
How many late packets did you see for frame 18, 19, 20 -> I checked this and it was 0 (here multiple packets were disabled)
- RSSI and SNR -> it doesn't oscillate, for same hotspot it is near average everytime it reports
- Data rate -> we used SF11 and SF12
- Hotspots (names) used to transfer data -> many of them, check photo bellow. There are no pattern which hotspot will be showed and which one will not, it is random. Sometimes you get report from hotspot X, sometimes you don't get.

slavendam commented 2 years ago

Is it possible to run previous versions of console locally or retrieve previous version? And then go step by step with adding commits and see which one made problem and then find issue in that part?

macpie commented 2 years ago

Thank you, this is very helpful! (Awesome integration dashboard by the way!)

OK so we at least know that this is not related to Console's issue.

Is it possible to run previous versions of console locally or retrieve previous version? And then go step by step with adding commits and see which one made problem and then find issue in that part?

Reverting is not an easy process, especially for a production environment (other fixes have been put in place since then). Could you test on Staging https://staging-console.helium.wtf/ ? If you are able to reproduce there it would be easier to investigate and/or revert and see if that fixes it.

Last but not least, I do see that some hotspots listed are "far away", more than half seemed to be a over 10km. It would not be uncommon for those hotspots to not always pickup your device's frames as conditions out there change.

slavendam commented 2 years ago

(Awesome integration dashboard by the way!)

Thanks! There is good reason why GLAMOS is most advanced testing tool for LoRaWAN network. :)

Could you test on Staging https://staging-console.helium.wtf/ ?

I'll give it a try tomorrow as soon as I get to users

It would not be uncommon for those hotspots to not always pickup your device's frames as conditions out there change.

Signal quality can change (RSSI) and some collision can happen (unlikely). But not case here. One time it picks up all, next time some close missing, next time just far missing. LoRaWAN signal is pretty robust during the short time periods (<1h). If you are looking few days, you can see that signal quality can change which leads to not reaching just to furthest hotspots. In our issue we have random pattern.

When you find issue, how fast can you deploy fix (like in hours, days..)? We have lot of users who are waiting for this fix to test network, to test antennas and locations so I can give them some time period.

macpie commented 2 years ago

The big difference in hotspot numbers is odd, I just thought I would mention that in "real world" differences are to be excepted.

Fixes can be deployed quickly.

slavendam commented 2 years ago

Could you test on Staging https://staging-console.helium.wtf/ ? If you are able to reproduce there it would be easier to investigate and/or revert and see if that fixes it.

Tested - same issues happening in production and staging consoles. One period device gets ~50 HS (with big number of late packets), and then next period 3-5 HS. Tested with 3 different devices, with testing messages every 30s/1min, on SF12 in EU, tried with readding device to console, rejoining, sending more or less in same period.

Additional info we got today. After every rejoin issue situation can be different. Situation 1: Device joins - getting GOOD data some time - then getting BAD data - rejoin - sending good Situation 2: Device joins - sending good longer time ~2h - rejoin - sending bad data from begining

So JOIN can change situation (in most cases on better), but situation can change on worse at some period of working (not with pattern)

At one moment (like 2 messages from 1000) we got ALL HS (~70) without late packets.

I believe number of hotspots and late packets are from same problem in system. What can we proceed in next steps? Can we revert on old version to staging to test it for at least 12h? Version should be from before Jan 27th of Production. Lot of users waiting for fix as this is important feature of network...

macpie commented 2 years ago

We will attempt to revert staging end of this week or early next week. Note that is might not be possible if the chain added chain vars and others

slavendam commented 2 years ago

I kindly ask you to do this ASAP and to publish update about this on Discord because I'm getting lot of questions. And if some of you can go through code as try to check is there some issue that was overlooked. There are no lot of commits that were published on Jan 27th.

Some part of code that can affect late packets (and lost packets) is issue. Maybe commit "fix ets leak" in #537 can cause this because any other commit from that period don't seem to have big changes. Or maybe #564

jdgemm commented 2 years ago

Please be patient, the team is continuing to investigate.

macpie commented 2 years ago

I kindly ask you to do this ASAP and to publish update about this on Discord because I'm getting lot of questions. And if some of you can go through code as try to check is there some issue that was overlooked. There are no lot of commits that were published on Jan 27th.

Some part of code that can affect late packets (and lost packets) is issue. Maybe commit "fix ets leak" in #537 can cause this because any other commit from that period don't seem to have big changes. Or maybe #564

I have taken a look at those changes none of them should cause this issue. I am more leaning towards a potential client side issue here. I will attempt to revert Staging early next week some we can 2x check and maybe do some more live debugging.

slavendam commented 2 years ago

I am more leaning towards a potential client side issue here.

Few different locations, many devices which worked great and then over night issues happend. Would be happy to understand what issue could happen on user side without any change and at the same time on all devices. :)

Let me know time for reverting Staging so I can prepare users so they can make time.

macpie commented 2 years ago

We can also attempt a live debugging session, very hard to debug those things after the fact. You can reach out in our discord channel #console and we can try to schedule something from there.

macpie commented 2 years ago

After testing it does not appear to be a Router issue. Each uplink seems to be received (no frames were missed) with sometimes a big variation in how many hotspots saw the frame.

We have incoming work that will help communication between Hotspots and Routers but even then variations, in a decentralized network, should be expected.

slavendam commented 2 years ago

What part of network is in charge for generating Uplink package which showed in Console and which is sent to Integration?
Why there were no ANY variations since September 2021 until end of January 2022?

Each uplink seems to be received (no frames were missed) with sometimes a big variation in how many hotspots saw the frame.

"Sometimes" would be that 10 messages has stable number of HS, and 1 is different. But here in ALL 10 messages we have 20/70 late packets. And in half we receive only 10-20 HS (of 70), in other half we receive 40-50 of 70.

jdgemm commented 2 years ago

The team has spent many hours investigating and concluded this is not a Router issue.

It's unclear what changes have been made to the coverage in your area as it is a decentralized network.

helium / router

Messages not appearing - network unstable #591