nornir-automation / nornir

Pluggable multi-threaded framework with inventory management to help operate collections of devices
https://nornir.readthedocs.io/
Apache License 2.0
1.4k stars 237 forks source link

Nornir scalability #264

Closed dmfigol closed 5 years ago

dmfigol commented 6 years ago

Has someone run any tests on a big network (1000+ devices)? If yes, do you have any numbers?

While threads scale better than processes which are used by most DSL automation tools (Ansible, Salt), they still may be a bottleneck if the number is too big, especially considering Python GIL. Users with big networks may want to scale horizontally their servers with nornir runners, or at least distribute load across several cores.

I am not saying we should implement this somehow. But I encourage you to think if it even makes sense.

My questions are: 1) if we should implement nornir runners across several CPU cores (using multiprocessing library)? If yes, what would be the design? 2) if we should implement nornir runners across different machines? If yes, what would be the design? 3) ~if we should rewrite Nornir in Go :trollface:~

ktbyers commented 6 years ago

Step1 is to get a lot more data to see if there is even a problem.

I have some data from Netmiko and threads with on the order of 1200 devices where the data indicated using threads wasn't a problem, but it was a small amount of data and also I don't have good data from the >5000 devices category.

dbarrosop commented 6 years ago

The GIL is mostly relevant during CPU-bound operations, waiting for the network/disk shouldn't be a problem. I think if we were to implement (1) or (2) we should rather do it with a wrapper around nornir, something like nornir-multiprocess. The easiest solution for that would be to implement that library in a way where users shard the process themselves, something like:

nr1 = nr.filter(F(...))
nr2 = nr.filter(F(...))
nr3 = nr.filter(F(...))
nr4 = nr.filter(F(...))

nornir_multicore.run(nr1, nr2, nr3, nr4, workers=multiprocessing.cpu_count())

Delegating to the user the responsibility of sharding is important as you might want to ensure certain groups of devices are operated in conjunction. For instance, each process could manage a single datacenter or POD and revert immediately if a device fails without having to wait for all the processes to terminate or to coordinate between them.

A similar wrapper could be implemented for celery although I am not a fan of it and I'd rather stay far from it. As a pattern I'd rather implement a small HTTP service that exposes my nornir code and use that to trigger/coordinate remote jobs. Alternatively, coordination could also be achieved with a queue or pub/sub service. Anything that doesn't involve pickling objects and that uses a stable interface.

dbarrosop commented 6 years ago

Pinging @ogenstad as I think he might have done some scalability tests.

dmfigol commented 6 years ago

Interesting, thanks David. So if I understand correctly, you feel it would be outside of the main library in some other package, correct?

The GIL is mostly relevant during CPU-bound operations, waiting for the network/disk shouldn't be a problem.

AFAIK while GIL is released during i/o, python threads are still run on one core.

dbarrosop commented 6 years ago

you feel it would be outside of the main library in some other package, correct?

Yeah, I think so. At least for the multiple hosts scenario as different people will prefer different mechanisms. For instance, I'd totally be against celery while I know some other folks would rather use it as they use it for other use cases. For the multiple cores one we have first to decide if it's worth/needed and then discuss it more in detail.

AFAIK while GIL is released during i/o, python threads are still run on one core.

My understanding is that cPython uses OS threads and thus use all the available CPUs, however, the GIL doesn't allow two threads to write to memory at the same time. In our particular case I think what this means is that we will see very little different between N threads and 4 processes with N/4 threads each. We should be able to do some naive experiments with containers providing a single HTTP endpoint and adding an artificial delay to the response to simulate latency. Ie, a container with a random time.sleep(n) and returning "ok".

lykinsbd commented 6 years ago

I'd be happy to provide some data if needed on running against large inventories. I'm currently testing Nornir on about 1500 Cisco ASAs right now, with no problems seen using the current architecture. Most of my problems are SSH specific ones with using our proxy hosts, or timeouts to devices (going from TX, USA to Hong Kong and the latency involved, etc).

I haven't seen any specific, structural, scale problems yet with Nornir, but I haven't pointed it at our entire fleet (~20,000 firewalls) yet for any sort of job as I wanted to vet it more fully (and wait for 2.0).

However with my test subset, I can run some benchmarking if you would like.

dbarrosop commented 6 years ago

That's awesome to hear. I am planning to release a beta of nornir2.0 to pip this weekend. It should be fairly stable as it's been quite tested, the only reason I am flagging it beta is because I want to improve the documentation and update some of the old tutorials before an official release. There are also a couple of features we want to add before releasing but those are addition that shouldn't break any existing code.

Back to your offering, it'd be nice if we could work together to gather some data and I'd love to see a blogpost or something detailing your use case, scale and results if that's something you can do :D

lykinsbd commented 6 years ago

I've worked on a draft blog post a few times, but never hit publish. It covers what our use case is, the specific problems I'm solving for, etc.

I'll get it a little more work into it this coming week and I can send you the link when done.

dbarrosop commented 6 years ago

Awesome, thanks!

ktbyers commented 6 years ago

@lykinsbd Yes, some of this data with lots of devices is very helpful.

ogenstad commented 6 years ago

I've used Nornir against 3000+ devices. Unfortunately, I don't have any numbers to share, and I don't work with that client any longer, so I don't have access to that network to try anything.

But in my opinion, it worked nicely. If I remember correctly, it took about 2-3 minutes to run the program I wrote. Probably at 100 workers.

At another customer, I ran into some issues and had to lower the number of workers. Their Tacacs server didn't care that much to have all the authentication requests happen at once.

I'm working in a network with more devices now, but I don't work directly with the boxes. I might be able to test something in the future.

ktbyers commented 6 years ago

One technique I have had to use in Netmiko in threads contexts (to overcome the AAA-overloaded issue) is to put a small sleep offset between each thread starting. So that these threads starting don't all happen simultaneously.

This has generally helped meaningfully for AAA auth issues.

xavierhardy commented 5 years ago

What about gevent? Coroutines/green threads with non-blocking I/O is known to be a good fit for network I/O intensive tasks (https://hackernoon.com/asynchronous-python-45df84b82434). They are much lighter weight than threads or processes in terms of memory. In theory, nothing prevents you to combine green threads and multiple processes (which would allow you to use all of your CPU cores), but there seems to be a bug in gevent: https://github.com/gevent/gevent/issues/993 .

dbarrosop commented 5 years ago

Honest question, if your project is py3.6+, why would you use gevent instead of async? Or was gevent just an example of coroutines?

Currently there are no reported scalability issues, this issue is just speculative, so at this point we don't have a problem to address. What I mean with this is that without an actual issue it's hard to pick the right solution :) In any case I agree coroutines are a good fit for IO bound tasks but at that point I think I'd rather evaluate python's async as it's also using coroutines under the hood and it's part of the stdlib. I also don't think we should change the current model but extend it instead.

xavierhardy commented 5 years ago

This project advertises being compatible with Python 2.7 (in setup.py), so gevent was a valid solution. About making a case for using gevent in Python 3.6+, I would just argue that gevent's monkey patching allows to write async code with little to no modification as long as you do not use any C extension. However, I consider monkey patching as a workaround and an anti pattern. The long-term solution is actually to re-write or extend the code to support coroutines using async (especially when you consider that Python 2.7 is deprecated and will stop being maintained in a year). Note that there is nothing preventing the use of C extensions with coroutines as long as these extensions allow non-blocking I/O.

dbarrosop commented 5 years ago

I am closing this one as I don't think there is anything to do here, there is a dedicated issue to discuss async #175