Benchmarking - Githubissues

dpezely commented 2 years ago

Load & Capacity Measurement (some people call this "stress testing")

TL;DR or Executive Summary:

The upper limit for a devaddr 1024 slab seems to be approximately 5800 devices, but results still need to be replicated and bracketed.
Overall uplink/downlink traffic capacity of those 5800 devices:
- There were six organizations, each with 999 devices
- However, the sixth org was unable to see all of its devices Join
- Joins were sent with minimal jitter in batches of 999 devices
- After join, each device transmits between 20 and 40 seconds, staggered in one second intervals
- Each device was set to re-join after 100 packets +/- 40, staggered in intervals of one but in opposite direction than transmits
Whatever numbers are determined by router developers, Community members are encouraged to reproduce those results
- By following the same instructions
- Requires basic familiarity with Linux and Bash command-line

Approach:

Use multiple instances of virtual-lorawan-device
- e.g., Debug with sniffer tutorial
Use multiple Organizations, each with a unique NetID
Intentionally induce DevAddr collisions across these different (fake) organizations
Measure load on an isolated instance of Router to determine practical thresholds for LoRaWAN traffic capacity
Attempt to simulate real-world use cases

Tasks:

[x] Create scripts & documentation on benchmark procedure, and create PR on docs repo
[x] Implement on dev router instance and preliminary run as proof of correctness for instructions
[x] Measure devaddr allocation on development router instance with course-grained sustained loads
[x] Measure devaddr allocation on development router instance with fine-grained sustained loads
[x] Measure 1024 slab of devaddrs
[x] #759
[x] Measure traffic capacity on dev router instance with heavy sustained load
[x] #760

lthiery commented 2 years ago

It's of interest, I have an OUI with 8 devaddr slab running on Hetzner. I can provide access

dpezely commented 2 years ago

A minor issue: ~~router seems to only permit one gateway coming from a single IP address.~~

EDIT: The connection issues experienced earlier were due to upstream validators (not routers). Existing validators specified within default gateway-rs configurations have yet to be updated, and because of that, they are not listening on the new ports. This leads to lots of connection retries, which shouldn't be an issue in the near future.

This is now optional: The solution could be simply to run gateway-rs from multiple AWS server instances (e.g., spot instances) which would also avoid resource contention from a single host sending all those uplinks. Multiple hosts was initially avoided to reduce complexity.

dpezely commented 2 years ago

Finally have all configured devices utilizing all gateway-rs instances on a single machine. (There was an off-by-one error in a Bash excerpt used for generating IP port number offsets.)

We also have a few AWS server instances ready for running there, which will reduce Internet latency.

dpezely commented 2 years ago

On the larger run with hundreds of devices from an AWS server instance running gateway-rs and virtual-lorawan-device, the vast majority of devices were unable to Join.

Investigating...

dpezely commented 2 years ago

There seems to be a resource limit on number of gateway-rs inbound socket connections (and therefore limits number of concurrent virtual-lorawan-device instances per gateway instance).

Investigating gateway-rs code and dependencies (e.g., Tokio), but Linux kernel vars seem sufficient (i.e., somaxconn).

madninja commented 2 years ago

huh? why would a single gateway-rs need to handle more than a few connections?

dpezely commented 2 years ago

huh? why would a single gateway-rs need to handle more than a few connections?

This is for driving artificial load for sake of determining capacity of a particular size AWS server instance running router. Yes, it's far beyond the originally intended use cases for gateway-rs, so 100 devices per organization means hundreds of concurrent connections.

madninja commented 2 years ago

huh? why would a single gateway-rs need to handle more than a few connections?

This is for driving artificial load for sake of determining capacity of a particular size AWS server instance running router. Yes, it's far beyond the originally intended use cases for gateway-rs, so 100 devices per organization means hundreds of concurrent connections.

uhh.. isn't that just the virtual lorawan device simulating hundreds of devices but it's still a single connection to gateway-rs for that packet forwarder? What am I missing? It's not like you have 100 packet forwarders connected to the same gateway-rs instance right?

dpezely commented 2 years ago

virtual lorawan device simulating hundreds of devices

Yes, this looks like the best path. Either way, there will be new Rust code! (I was first exploring the least coding effort, but that's been sufficiently exhausted now.)

madninja commented 2 years ago

virtual lorawan device simulating hundreds of devices

Yes, this looks like the best path. Either way, there will be new Rust code! (I was first exploring the least coding effort, but that's been sufficiently exhausted now.)

I'm not sure I understand this.. isn't the virtual LoRaWAN device repo already capable of simulating multiple devices?

dpezely commented 2 years ago

Now that we're using a single instance of virtual-lorawan-device for driving 100 devices per organization, driving load at scale is successful.

This also confirms abuse-prevention measures of the new reputation system, and that gets triggered within a minute or so. (Reputation score exceeded 50 in about 1 minute).

Packet uplink offers are getting dropped by router due to "devaddr not in subnet", so @lthiery it may be time to take you up on that kind offer of using your 8 devaddr slab.

dpezely commented 2 years ago

Clarified my understanding of a few points and updated the docs PR with more to come.

e.g., Location distances only comes into play when same device sends via different gateway, but we do not yet have the means [for virtual-lorawan-device] to have round-robin per device uplinks. I'll put that feature on my to-do list.

Adding more instances of gateway-rs and asserting their locations to an obviously bogus location such as a six meters above a lake.

dpezely commented 2 years ago

EDIT for correctness: "packet offer rate" is packets per second, not a percentage graph.

On a non-trivial but light run, Grafana indicated 98-99 offers per second for Packet Offer Rate success. This run involved 5 gateways, 2 organizations each with 100 devices all driven from same region within AWS to minimize Internet latency.

The graph dipped briefly to 95%-96% a few times within the first ten minutes and then down towards 75% for next ten minutes.

Meanwhile, even though router's reputation response was temporarily disabled yet still keeping score, these 5 gateways each have a score in the thousands; e.g., 4k, 5k after only ten minutes; over 10k around 15 minute mark;

Run began at time = Tue Apr 26 22:04:01 2022 GMT and ended at 22:28:26 GMT.

This is NOT using a purely dedicated server for benchmarking yet.

dpezely commented 2 years ago

An issue with benchmark runs is that occasionally a gateway-rs instance or two will disconnect from its upstream validator. When this happens, traffic drops-off because devices going through that gw can't reach router.

That apparently happened with yesterday's run and again just now.

We may have to wait to get real results until after gateway-rs v1.0 is official (still v1.0 alpha at the moment, according to its repo).

There is an upcoming release for Validators, gateway-rs, etc.

My understanding is that newer Validator code will listen on changed/additional IP ports that are not yet open. This means that upon starting gateway-rs today, it will sometimes take an hour to connect because of trying (currently) invalid port numbers.

dpezely commented 2 years ago

Regarding devaddr allocation with an 8 slab:

200 devices total was fine, third set of 100 were unable to Join
- 2 separate instances of gateway-rs
- each gateway-rs on a unique Public IP address
- 100 devices per Organization
- each org unique to a single instance of virtual-lorawan-device
- Each device: default_secs_between_transmits = 15
Third org containing 100 saw 100% failure to join
- via new instance of virtual-lorawan-device
- hitting an existing gateway-rs already servicing first org with 100 devices
This total of 200 is not necessarily and upper limit
- occasionally, an entire running instance of virtual-lorawan-device and all its devices fail to Join
- it's currently uncertain whether this is merely a transient issue, software bug in either piece, etc.
After adding third org, traffic for first two orgs were apparently unharmed
More fine-grained tests have yet to be run for finding more precise value
- smaller tranche sizes; e.g., 100 -> 50, 25, 15, 10, 5

When using lower time between transmits such as 10 or 5, there were too many late downlinks, so either gateway-rs or virtual-lorawan-device would need to be patched.

Note:

Neither gateway-rs nor virtual-lorawan-device were intended to be used for this use-case, so we may ultimately need to modify one or the other. While some transient errors are expected, I'm exploring a feature of virtual-lorawan-device to steadily increase pre-configured devices (rather than current behavior of all-or-nothing) while remaining within a configurable error rate, defaulting to 1:500 or so.

dpezely commented 2 years ago

Prior issue is apparently resolved which was that router-dev had an arbitrarily low ceiling of DC for state channels, so that it would become exhausted quickly. This was intentional for exercising certain behavior within router.

Preliminary numbers look great for overloading a devaddr 8 slab:

first 200 devices: 149 No Joins at beginning but then stable
second 200: 117 No Joins, same pattern
RxWindow expired occurred between 1:3 to 2:5 ratios, which is ultimately a side-effect of overloading (abusing) individual instances of gateway-rs

dpezely commented 2 years ago

The upper limit seems to be approximately 2250 devices for an 8 slab devaddr, but results still need to be replicated and bracketed.

To sanity-check my thought process here: despite some noise upon starting each batch, once each batch's Joins have all settled, I take that to mean the devaddr 8 slab + MIC disambiguation resolution has absorbed the pool of new devices.

The last 100 devices-- of which only about half were able to Join-- took dozens of attempts to Join. This is likely due to our own traffic interfering with itself while abusing gateway-rs and the artificial nature of this simulation.

This latest run used the following configuration:

3 hotspots on AWS instance 1
2 hotspots on AWS instance 2
1k devices in first org on instance 1 spraying across those 3 hotspots
1k devices in first org on instance 2 spraying across those 2 hotspots
added another org, each with 100 devices per AWS host, also spraying across hotspots
Upon adding that final 100 devices, it's estimated that half were never able to Join

mfalkvidd commented 2 years ago

Very interesting work, well done!

Did you record cpu usage data for the router? I'm curious whether the bottleneck is cpu power (so larger reuse would be possible by adding more cpu cores, or faster cores if the bottleneck is single threaded), or if the bottleneck is caused by collisions (more than 1 device key matches the mic, resulting in ambiguity that cannot be resolved).

2250 is a much larger amount than I've seen discussed on other LoRaWAN networks. I don't think anyone has performed as extensive testing, so the result is valuable regardless of where the bottleneck is. Big thanks for doing this work.

dpezely commented 2 years ago

@mfalkvidd Thank you! CPU and Erlang stats will be collected and shared from subsequent runs, but neither were limiting factors thus far. That run was still preliminary, as we had previously overlooked local configuration details on router-dev, so that run was just to confirm that we resolved earlier issues.

For benchmarks thus far, this is only measuring capacity of the 8 slab of devaddrs (rather than CPU+RAM consumption, but I'll post those next time as well).

Once our new Message Integrity Check gets merged, we get to do it all again. See payload_mic() in https://github.com/helium/erlang-lorawan/blob/initial-lib/src/lora_core.erl for the new implementation.

In addition to the 8 slab, traffic congestion was a limiting factor. Next runs will increase seconds between transmits per virtual device to 30 seconds and may have to increase from there. virtual-lorawan-device sends UDP datagrams to gateway-rs which provides a good simulation of physical radios (e.g., dropped packets versus RF noise), so this makes for a different kind of Load & Capacity test situation than HTTP requests.

The goal is to get a more precise number than "2250" on each run, and then show each set of numbers from 3 consecutive runs-- ideally within a small margin of error.

More runs are planned for today involving 12 hotspots (instances of gateway-rs running in AWS).

dpezely commented 2 years ago

Status update and highlights of corrections:

Prior benchmark runs had non-conforming App EUI and App Keys for devices (must be unique)-- now fixed
Prior runs intending to have more than zero seconds between device transmits remained at zero-- now fixed
Router's method of tracking unique devices has been revised-- let's just say it's been battle-tested...
Since previous update, use of a multiplexer has been added:
- More accurately reflects real-world use cases because each device's packets go through all routers known to the muxer
- Prior config generation remains in doc but commented-out as optional
Draft doc has been revised (also linked from ticket description)
- https://deploy-preview-846--helium-docs.netlify.app/use-the-network/run-a-network-server/router-benchmarking/

dpezely commented 2 years ago

Update after several successful runs:

First, router-dev does not have an 8 slab devaddr block but instead is a 1024 slab
- @mfalkvidd , I misspoke previously
- This is why the count was so much higher than you expected
The upper limit on concurrent devaddr allocations on router-dev seems to be 5800
- We still need to reproduce for confirmation of the number
- The actual count was 5806
- We highly encourage the Community to attempt reproducing our results and reporting
Interpretation of graphs in Grafana from which we reached that number has been updated in the doc
Final updates to the doc include a few lines of Bash for leveling traffic load on Router and host sending traffic
At time of writing this post, there is an issue with upstream Validators that currently impacts further testing
- Similarly, text and procedures have been added to the doc for diagnosing such issues
- e.g., using API into gateway-rs instead of extracting from its logs for more reliable results
Thank you, everyone, for your patience with this endeavor!
- Planning and executing this is unlike its equivalent for HTTP traffic
- The number of experimental conditions and variables to track with each run is like doing multi-variate analysis or holding a large polynomial equation in one's head, but hopefully the doc accommodates copy/paste

helium / router

Benchmarking #677