OOM error with master/slaves setup (zeromq, windows)

mparent commented 4 years ago

Hi !

Describe the bug

An out of memory error occurs with ZeroMQ trying to allocate a crazy amount of memory in decoded_allocator, sometime up to several petabytes. This might very well be a ZeroMQ bug : OUT OF MEMORY (bundled\zeromq\src\decoder_allocators.cpp:89)

I added some logs and recompiled pyzmq to check what's going on. Upon further investigation, _max_counters seems to take a crazy value at some point. See zmq_logs.txt As you can see, allocator instance 0x0000016A9270F700 is constructed with _max_counters=249, but before crash its value has changed to 1557249601288, which causes a malloc of several terabytes.

Steps to reproduce

Sorry, I couldn't find a surefire way to reproduce this one. It seems kind of random. It sometime happens before the test is even started, sometime when the test is stopped. Sometime it doesn't happen at all. It does seem to happen more often when stopping a test in the web UI. Simply run the ps1 attached and do some stuff in the web UI.

Environment

OS: Windows 10.0.18362.778
Python version: 3.6
Locust version: 0.14.6
Locust files : test_locust.zip

I managed to repro the bug on two computers : my work computer and my personal computer. Both are on Windows 10/Python 3.6 that comes with VS2017, but my personal computer has a pristine python environent, just ran pip install locustio.

Am I doing something I'm not supposed to ?

heyman commented 4 years ago

Interesting. You're not doing anything wrong AFAICT. I suspect this is a windows related issue. Is it possible for you to test if you can reproduce this on a linux or mac machine?

mparent commented 4 years ago

Sure thing ! I'll try on WSL, if that works for you. Worst case, I can set up a Linux VM.

mparent commented 4 years ago

Ok, I tried several times on Ubuntu 18.04 LTS on WSL, running it the exact same way with Powershell Core. I couldn't repro the issue.
I can't say for sure that it won't happen on Linux, considering how inconsistent that bug is, but at the very least I can safely say that it is much less likely to happen. Considering it a Windows-related issue does seem plausible.

heyman commented 4 years ago

Ok, good to know! It might be awhile before I get the chance to try to reproduce this on a Windows machine. Please keep us updated if you test anything else (e.g. another version of ZeroMQ or Python).

mparent commented 4 years ago

Will do !

anuj-ssharma commented 4 years ago

Tested this on my windows machine and I can reproduce this (on v1.0.0 of locust and python 3.8). Used a different locust file to the one attached above.

However, I couldn't really figure out a pattern to the failures but there were a few observations:

It always failed for me after all the clients were connected and ready.
Chances of failures increased with the increase in the number of workers.
Memory consumption of the locust workers seemed to be normal.

Machine Specs: Windows 10 Version 1903 Intel(R) Core i7-7700HQ CPU @ 2.80 GHz 16GB RAM

cyberw commented 4 years ago

Maybe we should bump the minimum required pyzmq version? Other than that I dont think we can do much without a clear repro case. @anuj-ssharma @mparent can you check your pyzmq versions?

mparent commented 4 years ago

@cyberw 19.0.1 for me.

cyberw commented 4 years ago

Ok, that is the latest, so that shouldnt be an issue...

cyberw commented 4 years ago

I dont have any real ideas on how to solve this, and I hardly use Windows at all these days.

If any of you have the time to do some more digging & finding a fix it would be much appreciated (unfortunately it is unlikely anyone else will fix it for you :-/ )

mparent commented 4 years ago

It's fine, our actual locusts are run on Linux anyway. I'll simply keep working on WSL locally and try to find a fix if I have some time.

bebeo92 commented 4 years ago

@cyberw I also faced this issue when run on Window

cyberw commented 4 years ago

@bebeo92 Have you tried updating to latest pyzmq? Can you find any pattern to when it works and when it doesnt?

With no more details there is nothing we can do, sorry... (and with so few of our users running on windows I dont think this issue will get a lot of attention)

perhaps file an issue with pyzmq?

bebeo92 commented 4 years ago

@cyberw I think it happens when I click on Stop button. Does it expect behaviour?

cyberw commented 4 years ago

It should work. Sorry, I dont think I can help you...

bebeo92 commented 4 years ago

@cyberw I think it still a valid bug, can you contact someone else to verify it?

cyberw commented 4 years ago

I agree, but there is really nobody to contact. this is a project maintained by volunteers.

cyberw commented 4 years ago

like I said, you may have more luck talking to the maintainers of pyzmq itself.

RichardLions commented 3 years ago

@cyberw this is happening on the project I am working on. When running locust on a windows machine in headless mode with several workers(all on the same machine) there is a high chance the master will assert. The chance increases the more workers that are spawned. Note: Assert only starts triggering with 3 or more workers.

The assert always triggers after the master has sent a message to all workers. Either at the start when sending the spawn message or at the end telling them to quit.

The assert: warning : FATAL ERROR: OUT OF MEMORY (C:\projects\libzmq\src\decoder_allocators.cpp:85)

https://pyzmq.readthedocs.io/en/latest/morethanbindings.html#thread-safety

The pyzmq docs mention c-level crashes could be encountered if calling into the same sockets from multiple threads. When looking at the locust setup it appears to be using greenlets. The same socket could be called into multiple times but would be on the same thread. I am not experienced with python(only started using it to get locust setup) so I am unsure if this could be causing the issue?

versions: python 3.8.10 locust 1.6.0 pyzmq 22.1.0

Do you have any advice on tracking down what could be triggering this issue?

cyberw commented 3 years ago

Hi! Sorry, I have nothing to add here. You probably already know a lot more than me :)

Matthew--Townsend commented 3 years ago

I also have this issue.

FATAL ERROR: OUT OF MEMORY (C:\projects\libzmq\src\decoder_allocators.cpp:85)

I get it a majority if the time when just creating my master and worker nodes. I'd say 3 out of 4 attempts fail. If that part passes, sometimes it fails with the same error after I start my load test.

versions: python 3.9.5 locust 1.5.3 pyzmq 22.1.0 OS: Windows 10, Windows Server 2016 Free Memory at time of failure: 7.3 GB of 16 GB on my Windows 10 box

I will check out pyzmq, but I wanted to post here for the sake of visibility (i.e., it isn't just a few people getting this error, when I talked with the guy who recommended Locust, he said "Oh yeah it does that all the time. I just keep trying until it works." Personally I'd rather fix it. So I'll see if the folks at pyzmq have this on their radar already.

Thanks.

cyberw commented 3 years ago

Is there a ticket on pyzmq? If so then maybe link it here.

Matthew--Townsend commented 3 years ago

It is failing in libzmq when trying to allocate the memory needed. I have seen this before when there is available system memory, but it is fragmented (thus not enough available in one spot to allocate continuously for the requested size). There is an issue open on pyzmq currently (https://github.com/zeromq/pyzmq/issues/1555), but it was also opened by @RichardLions and has no replies from anyone else that may have seen this. On my end I'll need to investigate with a memory profiler to see what is filling up (or fragmenting) the available memory. I'll see how much time my project owner will let me spend debugging this and post back here if I find anything. It could be as simple as we somehow created a small memory leak in our python code. I usually write in C# so I am not sure if that is a common occurrence in python, but seems like a possibility.

cyberw commented 3 years ago

Memory leaks are not really a common occurence no, and since other people have encountered this issue it is pretty likely there is a real bug here. Good luck, and let us know if you find the issue or a workaround!

cyberw commented 3 years ago

One possibility is of course that locust is (for some reason) attempting to send a very (very) big message and that exceeds some limit on windows.

Matthew--Townsend commented 3 years ago

@RichardLions Good job updating the other bug (zeromq/pyzmq#1555) and finding a possible cause. Silly me, I thought the error (out of memory) could be something to do with running out of memory. :) I did try to run some python memory profilers but all I saw was a very flat memory allocation over time and nothing alarming.

Matthew--Townsend commented 3 years ago

@RichardLions has a pull request that fixes this in the pyzmq project. I implemented it manually and tested and it works. See zeromq/pyzmq#1555

Pull Request: zeromq/pyzmq#1560

cyberw commented 3 years ago

Lets close this when there is a new release of pyzmq including the fix and we have bumped the dependency in locust.

RichardLions commented 3 years ago

pyzmq 22.2.1 has been released containing the fix for this issue.

https://github.com/zeromq/pyzmq/releases/tag/v22.2.1

cyberw commented 3 years ago

Thanks @RichardLions !

locustio / locust