helium / router

router combines a LoRaWAN Network Server with an API for console, and provides a proxy to the Helium blockchain
Apache License 2.0
70 stars 31 forks source link

Benchmarking #677

Closed dpezely closed 2 years ago

dpezely commented 2 years ago

Load & Capacity Measurement (some people call this "stress testing")

TL;DR or Executive Summary:

Approach:

Tasks:

lthiery commented 2 years ago

It's of interest, I have an OUI with 8 devaddr slab running on Hetzner. I can provide access

dpezely commented 2 years ago

A minor issue: router seems to only permit one gateway coming from a single IP address.

EDIT: The connection issues experienced earlier were due to upstream validators (not routers). Existing validators specified within default gateway-rs configurations have yet to be updated, and because of that, they are not listening on the new ports. This leads to lots of connection retries, which shouldn't be an issue in the near future.

This is now optional: The solution could be simply to run gateway-rs from multiple AWS server instances (e.g., spot instances) which would also avoid resource contention from a single host sending all those uplinks. Multiple hosts was initially avoided to reduce complexity.

dpezely commented 2 years ago

Finally have all configured devices utilizing all gateway-rs instances on a single machine. (There was an off-by-one error in a Bash excerpt used for generating IP port number offsets.)

We also have a few AWS server instances ready for running there, which will reduce Internet latency.

dpezely commented 2 years ago

On the larger run with hundreds of devices from an AWS server instance running gateway-rs and virtual-lorawan-device, the vast majority of devices were unable to Join.

Investigating...

dpezely commented 2 years ago

There seems to be a resource limit on number of gateway-rs inbound socket connections (and therefore limits number of concurrent virtual-lorawan-device instances per gateway instance).

Investigating gateway-rs code and dependencies (e.g., Tokio), but Linux kernel vars seem sufficient (i.e., somaxconn).

madninja commented 2 years ago

huh? why would a single gateway-rs need to handle more than a few connections?

dpezely commented 2 years ago

huh? why would a single gateway-rs need to handle more than a few connections?

This is for driving artificial load for sake of determining capacity of a particular size AWS server instance running router. Yes, it's far beyond the originally intended use cases for gateway-rs, so 100 devices per organization means hundreds of concurrent connections.

madninja commented 2 years ago

huh? why would a single gateway-rs need to handle more than a few connections?

This is for driving artificial load for sake of determining capacity of a particular size AWS server instance running router. Yes, it's far beyond the originally intended use cases for gateway-rs, so 100 devices per organization means hundreds of concurrent connections.

uhh.. isn't that just the virtual lorawan device simulating hundreds of devices but it's still a single connection to gateway-rs for that packet forwarder? What am I missing? It's not like you have 100 packet forwarders connected to the same gateway-rs instance right?

dpezely commented 2 years ago

virtual lorawan device simulating hundreds of devices

Yes, this looks like the best path. Either way, there will be new Rust code! (I was first exploring the least coding effort, but that's been sufficiently exhausted now.)

madninja commented 2 years ago

virtual lorawan device simulating hundreds of devices

Yes, this looks like the best path. Either way, there will be new Rust code! (I was first exploring the least coding effort, but that's been sufficiently exhausted now.)

I'm not sure I understand this.. isn't the virtual LoRaWAN device repo already capable of simulating multiple devices?

dpezely commented 2 years ago

Now that we're using a single instance of virtual-lorawan-device for driving 100 devices per organization, driving load at scale is successful.

This also confirms abuse-prevention measures of the new reputation system, and that gets triggered within a minute or so. (Reputation score exceeded 50 in about 1 minute).

Packet uplink offers are getting dropped by router due to "devaddr not in subnet", so @lthiery it may be time to take you up on that kind offer of using your 8 devaddr slab.

dpezely commented 2 years ago

Clarified my understanding of a few points and updated the docs PR with more to come.

e.g., Location distances only comes into play when same device sends via different gateway, but we do not yet have the means [for virtual-lorawan-device] to have round-robin per device uplinks. I'll put that feature on my to-do list.

Adding more instances of gateway-rs and asserting their locations to an obviously bogus location such as a six meters above a lake.

dpezely commented 2 years ago

EDIT for correctness: "packet offer rate" is packets per second, not a percentage graph.

On a non-trivial but light run, Grafana indicated 98-99 offers per second for Packet Offer Rate success. This run involved 5 gateways, 2 organizations each with 100 devices all driven from same region within AWS to minimize Internet latency.

The graph dipped briefly to 95%-96% a few times within the first ten minutes and then down towards 75% for next ten minutes.

Meanwhile, even though router's reputation response was temporarily disabled yet still keeping score, these 5 gateways each have a score in the thousands; e.g., 4k, 5k after only ten minutes; over 10k around 15 minute mark;

Run began at time = Tue Apr 26 22:04:01 2022 GMT and ended at 22:28:26 GMT.

This is NOT using a purely dedicated server for benchmarking yet.

dpezely commented 2 years ago

An issue with benchmark runs is that occasionally a gateway-rs instance or two will disconnect from its upstream validator. When this happens, traffic drops-off because devices going through that gw can't reach router.

That apparently happened with yesterday's run and again just now.

We may have to wait to get real results until after gateway-rs v1.0 is official (still v1.0 alpha at the moment, according to its repo).

There is an upcoming release for Validators, gateway-rs, etc.

My understanding is that newer Validator code will listen on changed/additional IP ports that are not yet open. This means that upon starting gateway-rs today, it will sometimes take an hour to connect because of trying (currently) invalid port numbers.

dpezely commented 2 years ago

Regarding devaddr allocation with an 8 slab:

When using lower time between transmits such as 10 or 5, there were too many late downlinks, so either gateway-rs or virtual-lorawan-device would need to be patched.

Note:

Neither gateway-rs nor virtual-lorawan-device were intended to be used for this use-case, so we may ultimately need to modify one or the other. While some transient errors are expected, I'm exploring a feature of virtual-lorawan-device to steadily increase pre-configured devices (rather than current behavior of all-or-nothing) while remaining within a configurable error rate, defaulting to 1:500 or so.

dpezely commented 2 years ago

Prior issue is apparently resolved which was that router-dev had an arbitrarily low ceiling of DC for state channels, so that it would become exhausted quickly. This was intentional for exercising certain behavior within router.

Preliminary numbers look great for overloading a devaddr 8 slab:

dpezely commented 2 years ago

The upper limit seems to be approximately 2250 devices for an 8 slab devaddr, but results still need to be replicated and bracketed.

To sanity-check my thought process here: despite some noise upon starting each batch, once each batch's Joins have all settled, I take that to mean the devaddr 8 slab + MIC disambiguation resolution has absorbed the pool of new devices.

The last 100 devices-- of which only about half were able to Join-- took dozens of attempts to Join. This is likely due to our own traffic interfering with itself while abusing gateway-rs and the artificial nature of this simulation.

This latest run used the following configuration:

mfalkvidd commented 2 years ago

Very interesting work, well done!

Did you record cpu usage data for the router? I'm curious whether the bottleneck is cpu power (so larger reuse would be possible by adding more cpu cores, or faster cores if the bottleneck is single threaded), or if the bottleneck is caused by collisions (more than 1 device key matches the mic, resulting in ambiguity that cannot be resolved).

2250 is a much larger amount than I've seen discussed on other LoRaWAN networks. I don't think anyone has performed as extensive testing, so the result is valuable regardless of where the bottleneck is. Big thanks for doing this work.

dpezely commented 2 years ago

@mfalkvidd Thank you! CPU and Erlang stats will be collected and shared from subsequent runs, but neither were limiting factors thus far. That run was still preliminary, as we had previously overlooked local configuration details on router-dev, so that run was just to confirm that we resolved earlier issues.

For benchmarks thus far, this is only measuring capacity of the 8 slab of devaddrs (rather than CPU+RAM consumption, but I'll post those next time as well).

Once our new Message Integrity Check gets merged, we get to do it all again. See payload_mic() in https://github.com/helium/erlang-lorawan/blob/initial-lib/src/lora_core.erl for the new implementation.

In addition to the 8 slab, traffic congestion was a limiting factor. Next runs will increase seconds between transmits per virtual device to 30 seconds and may have to increase from there. virtual-lorawan-device sends UDP datagrams to gateway-rs which provides a good simulation of physical radios (e.g., dropped packets versus RF noise), so this makes for a different kind of Load & Capacity test situation than HTTP requests.

The goal is to get a more precise number than "2250" on each run, and then show each set of numbers from 3 consecutive runs-- ideally within a small margin of error.

More runs are planned for today involving 12 hotspots (instances of gateway-rs running in AWS).

dpezely commented 2 years ago

Status update and highlights of corrections:

dpezely commented 2 years ago

Update after several successful runs: