Hatch rate in distributed mode spawns users in batches equal to number of slaves

tortila commented 6 years ago

Description of issue / feature request and actual behavior

It looks like hatch rate behavior highly depends on the number of slaves in Locust's distibuted mode.

As an example: I'm running Locust in distributed mode with master node and 10 slave nodes. I set the test execution to spawn 100 users with hatch rate of 1. It seems that instead of spawning 1 user per second, 10 users (1 on each slave) are being spawned at once in batches.

screen shot 2018-10-01 at 14 25 10

If I add 5 more slave (summing up to 15 slave nodes in total), and start new test with the same values: 100 users with hatch rate of 1, users are now spawned in batches of 15:

screen shot 2018-10-01 at 14 38 49

Expected behavior

I would expect hatch rate to behave independent of the number of slaves. In the example above, I expect a smooth increase of 1 user every second.

Environment settings (for bug reports)

OS: Debian Stretch
Python version: 3.6
Locust version: 0.9.0

Steps to reproduce (for bug reports)

As described above

heyman commented 5 years ago

Yes, your description matches the current implementation: The slave nodes are unaware of each other and will get an instruction on launching X users with Y hatch rate.

This should only be a potential issue if you have a very low hatch rate (less than number of slave nodes) which I don't think is very common.

Could be fixed but it would add quite a bit of extra complexity, which I currently don't think is justified.

tortila commented 5 years ago

@heyman thank you for responding.

This should only be a potential issue if you have a very low hatch rate (less than number of slave nodes) which I don't think is very common.

When I filed this issue it was indeed the case - we used to run Locust in setups with 300 slaves. The reason behind it was that we aimed for a very large scale, and wanted to ramp-up slowly, ideally not changing the number of slaves on the fly as it was very problematic (but that's another story). So with this setup, the smallest possible number of users spawned at once was 300, and it was not small enough, as 300 users generated already a significant amount of load. So to sum up, this feature is important for a narrow use case, but I think it's still important to guarantee a smooth and gradual ramp-up. On top of that I also see it as a surprising and not intuitive behaviour - so maybe if it won't be fixed, at least it deserves a proper documentation.

Maybe you can also take a look at https://github.com/locustio/locust/issues/724 as the issue described there is somehow connected to how users are being distributed between slaves.

heyman commented 5 years ago

The reason behind it was that we aimed for a very large scale, and wanted to ramp-up slowly

Ah, that's a use-case I hadn't considered and might not be too uncommon I guess. Depending on the implementation maybe it could be worth fixing after all. And I agree that if we don't fix it, or until we do, the documentation should have a note about it.

heyman commented 5 years ago

Documentation updated in d6d87b4b3cb29b9cec1190bed7586f8b87bb492b

max-rocket-internet commented 5 years ago

we used to run Locust in setups with 300 slaves

We are also doing this. We run on k8s and it's more cost effective to scale out with many smaller slaves, as opposed to fewer larger slaves.

Current implementation is that each slave just receives a client and hatch rate that is simply client and hatch rate / number of connected slaves.

There's quite a few issues that would be resolved by allowing the locust master to have a much tighter control over the number of users running on slaves. For example it would enable autoscaling slaves (https://github.com/locustio/locust/issues/1100 https://github.com/locustio/locust/issues/1066 https://github.com/karol-brejna-i/locust-experiments/issues/13) and custom load patterns (https://github.com/locustio/locust/issues/1001)

heyman commented 5 years ago

There's quite a few issues that would be resolved by allowing the locust master to have a much tighter control over the number of users running on slaves.

I'm not opposed to fixing this if we can come up with a good implementation. Here's an idea from the top of my head:

Change so that the "hatch" message from master to slaves specifies the number of users to simulate for each Locust class, as well as an optional initial wait time that the slave should sleep for before starting to hatch (which can be used to even out the hatch rate spikes).
Implement a function that calculates a "plan" - that respects the weight attributes - for how many instances of the different locust classes each node should run.

I'm thinking of an API similar to this:
```
>> get_run_plan([User1, User2, User3], user_count=5, runner_count=3)
[{User1:1, User2:1}, {User1:1, User2:1}, {User3:1}]
```
LocustRunner.weight_locusts could be partly replaced by the plan calculation function.
In MasterLocustRunner.start_hatching() get the plan and then send out the corresponding hatch messages to slaves.

Like I said it's from the top of my head, and there might be problems with it that I haven't thought of, or there might be a better ways to implement it.

Thoughts?

max-rocket-internet commented 5 years ago

That sounds like a good start!

It would be great if the master would periodically runs the calculation for the given amount of slaves connected, then sends the messages out. Then the number of slaves could be more dynamic, i.e. autoscale.

Would also be great of the plan function could be provided to locust for advanced users that want to replicated traffic shapes that go up and down at specific rates. For example we are interested in reproducing a shape that is like our live environment:

Screen Shot 2019-10-24 at 15 40 48

Would also solve https://github.com/locustio/locust/issues/974

heyman commented 5 years ago

It would be great if the master would periodically runs the calculation for the given amount of slaves connected, then sends the messages out.

Yes, this could be done every time a new slave node connects or disconnects, if the tests are running. (Maybe with some kind of delay just to let more nodes connect in case many are started at the same time to avoid rebalancing multiple times directly after each other)

Would also be great of the plan function could be provided to locust for advanced users

Good idea.

github-actions[bot] commented 3 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] commented 3 years ago

This issue was closed because it has been stalled for 10 days with no activity.

github-actions[bot] commented 3 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 10 days.

mboutet commented 3 years ago

/remove-lifecycle stale

locustio / locust