Closed v00lk closed 3 years ago
This seems like a performance issue. Are you seeing high CPU utilisation when you doing --create
? Either way, it seems like pyroute2 is struggling to do its changes. There are several suggestions in https://github.com/svinota/pyroute2/issues/304, e.g. increasing the SYNC_TIMEOUT or IPDB(nl_async="process")
. But I think the best one would be to upgrade to a higher spec system.
I changed this SYNC_TIMEOUT to 30 and seems it is working. I run topology on a pretty powerful server (32 cores E5-2660 @ 2.20GHz), CPU load is low. Please give me some time to check other parameters like i/o or memory. Thank you!
Previous error has successfully gone! Thank you for the solution! By the way, have you ever tested the scale? I added 3 more devices (I have about 20 devices fully meshed now) and got this error:
ERROR:pyroute2.netns.nslink:forced shutdown procedure, clean up netns manually
Traceback (most recent call last):
File "/home/v0lk/testdir/lib/python3.6/site-packages/pyroute2/netns/nslink.py", line 170, in __init__
super(NetNS, self).__init__(trnsp_in, trnsp_out)
File "/home/v0lk/testdir/lib/python3.6/site-packages/pyroute2/iproute/linux.py", line 127, in __init__
super(RTNL_API, self).__init__(*argv, **kwarg)
File "/home/v0lk/testdir/lib/python3.6/site-packages/pyroute2/remote/__init__.py", line 190, in __init__
super(RemoteSocket, self).__init__()
File "/home/v0lk/testdir/lib/python3.6/site-packages/pyroute2/netlink/nlsocket.py", line 330, in __init__
self._ctrl_read, self._ctrl_write = os.pipe()
OSError: [Errno 24] Too many open files
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/v0lk/testdir/bin/docker-topo", line 843, in <module>
main()
File "/home/v0lk/testdir/bin/docker-topo", line 778, in main
started = [device.start() == 0 for (name, device) in devices.items()]
File "/home/v0lk/testdir/bin/docker-topo", line 778, in <listcomp>
started = [device.start() == 0 for (name, device) in devices.items()]
File "/home/v0lk/testdir/bin/docker-topo", line 357, in start
self._attach()
File "/home/v0lk/testdir/bin/docker-topo", line 374, in _attach
link.connect(self, interface)
File "/home/v0lk/testdir/bin/docker-topo", line 660, in connect
self.network.connect(device, interface)
File "/home/v0lk/testdir/bin/docker-topo", line 596, in connect
with IPDB(nl=NetNS(ns_name)) as ns:
File "/home/v0lk/testdir/lib/python3.6/site-packages/pyroute2/netns/nslink.py", line 172, in __init__
self.close()
File "/home/v0lk/testdir/lib/python3.6/site-packages/pyroute2/netns/nslink.py", line 201, in close
for close in (self.trnsp_in.close,
AttributeError: 'NoneType' object has no attribute 'close'
I think the maximum I have ever tested was 20 devices. pyroute2 errors are most likely due to too many links its trying to create. If you're looking for horizontal scale, I'd recommend k8s-topo, which was created specifically for that.
I did some debugs here but I can't fix it by myself. I would appreciate any suggestions about where to dig in such scenario.
This error OSError: [Errno 24] Too many open files
is fixed with ulimit -n 16384
command. But after about 31-32 added links docker-topo hangs one CPU completely and the execution process can't go further on.
I added some breakpoints and this freeze happens after this line of code:
https://github.com/networkop/docker-topo/blob/b0d879f7833285c90295b1d0c3cbd17f66c8da7c/bin/docker-topo#L594
I run it on a quite powerful baremetal server with 32 cores and 64G RAM, but on that step one of CPUs (randomly chosen) is 100% loaded and occupies 20G virtual memory (I believe it is a maximum allowed for python).
My idea was to add multiprocessing at some point so I tried the following:
Add basic multiprocessing in create part of a main loop:
manager = multiprocessing.Manager()
started = manager.dict()
procs = []
for (name, device) in devices.items():
print(name)
proc = multiprocessing.Process(target=Device.start, args=(device,))
procs.append(proc)
proc.start()
for proc in procs:
proc.join()
Unfortunately this didn't help much. All containers are created in a pause state, and all links are created, but when the first link is added to container every worker stops:
DEBUG:__main__: Connecting(LINK) veth-type link to test2_vla-34d3
DEBUG:urllib3.connectionpool:http://localhost:None "POST /v1.35/containers/93ff29b010ca574bd0fd543a76ef9bff768704e48cc0a459e1b029eaaa34abbd/start HTTP/1.1" 204 0
DEBUG:urllib3.connectionpool:http://localhost:None "GET /v1.35/containers/test2_vla-35d3/json HTTP/1.1" 200 None
DEBUG:__main__: PID for container test2_vla-35d3 is 23386
DEBUG:__main__: Sandbox key for container test2_vla-35d3 is 23386
DEBUG:urllib3.connectionpool:http://localhost:None "POST /v1.35/containers/93ff29b010ca574bd0fd543a76ef9bff768704e48cc0a459e1b029eaaa34abbd/pause HTTP/1.1" 204 0
DEBUG:__main__: Attaching container test2_vla-35d3 interface eth1 to its link
DEBUG:__main__: Connecting(LINK) veth-type link to test2_vla-35d3
DEBUG:__main__:Connecting(VETH) test2_net-64-b to test2_vla-35d3
DEBUG:__main__:Create an interface for container test2_vla-35d3: {'ifname': 'eth1', 'stats': None, 'protinfo': None, 'map': None, 'linkinfo': None, 'stats64': None, 'af_spec': None, 'net_ns_fd': 606, 'event': None, '-ports': set(), '+ports': set(), '-vlans': set(), '+vlans': set(), '-ipaddr': set(), '+ipaddr': set()}
Process Process-2:
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "./bin/docker-topo", line 363, in start
self._attach()
File "./bin/docker-topo", line 380, in _attach
link.connect(self, interface)
File "./bin/docker-topo", line 676, in connect
self.network.connect(device, interface)
File "./bin/docker-topo", line 592, in connect
self._create()
File "./bin/docker-topo", line 583, in _create
return self.ipdb.create(ifname=self.sideA, kind='veth', peer=self.sideB).commit()
File "/home/v0lk/.local/lib/python3.6/site-packages/pyroute2/ipdb/interfaces.py", line 652, in commit
raise CreateException()
pyroute2.ipdb.exceptions.CreateException
Probably this is due to ordering of addition of all these links to NS and containers.
p = multiprocessing.Pool(5)
print(devices.items())
p.map(Device.start, [device for (name, device) in devices.items()])
Probably I did something wrong cause I got:
File "./bin/docker-topo", line 863, in <module>
main()
File "./bin/docker-topo", line 798, in main
p.map(Device.start, [device for (name, device) in devices.items()])
File "/usr/lib/python3.6/multiprocessing/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
File "/usr/lib/python3.6/multiprocessing/pool.py", line 424, in _handle_tasks
put(task)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/usr/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
TypeError: cannot serialize '_io.TextIOWrapper' object
I would appreciate any suggestions where should I dig. May be it is a good idea to open an issue on pyroute2 project? Can't get why so many resources are consumed, may be some files or fds are not closed correctly... I know k8s solution is more scalable but current server has plenty resources and more to add k8s is kinda different universe for a network engineer :) Thank you!
I can't seem to reproduce the same issue on my laptop. I've used the topology file you attached at the beginning and I'm able to start the full topology on my XPS13(2017) in less than 30 seconds. It feels like some dependency issue (kernel or library). Are you still using the same topology? Can you confirm which os kernel version you're running on? Try updating pyroute2 to the latest version (0.5.11) Try removing the MLAG workaround (lines 595-601) for now Try running strace to see if it hints at anything in particular.
I changed my topology a little bit since last run, I added 2 more spines. So it is now 16 superspines and 3 spines, 19 devices. I've attached it, could you please try to run v0lk-topo.txt What I've tracked with strace and lsof: 1) The app creates multiple pipe fds like this:
pipe2([1027, 1028], O_CLOEXEC) = 0
After that it is trying to read from pipe like this:
read(1017, "\200\3}q\0(X\5\0\0\0stageq\1X\4\0\0\0initq\2X\5\0"..., 216) = 216
It is always failing on reading from pipe 1023:
read(1023
2) When I issue lsof -p cat /proc/<proc number>/limits
3) I've added some os.close(fd) in a code but this didn't help.
4) Pyroute2 is a latested one (0.5.11) and I reinstalled it. Mlag hack also didn't help.
I use Ubuntu 18.04 with 4.15.0-47-generic kernel.
I also tried to run topology on the other server (mounted in different lab) and problem happened also there, with the same interface. It is 16.04 4.4.0-176-generic kernel.
If you could run updated topology on your environment would be great.
Good news is that I could reproduce the bug on my laptop. The bad news is that it looks there are multiple overlayed issues. One of them is that pyroute2 communicates with IPRoute API via a pipes, so every time it needs to connect to a container's namespace it would open what 2 pipes/files. So right now we're hitting the worst case scenario where just to bring up the interfaces docker-topo would create (number of containers) x (number of interfaces) files. I've done a simple fix for that, which seems to be working. you can find it in https://github.com/networkop/docker-topo/tree/net-attach-fix
the second issue is that pyroute2 hangs when trying to open more than 1024 files (I've also tried raising the ulimit to 2048). This looks like something more serious. If I'm able to reproduce it outside of docker-topo, I may open an issue on pyroute2.
Started to investigate the limits issue on my side also: https://github.com/svinota/pyroute2/issues/698
you can find it in https://github.com/networkop/docker-topo/tree/net-attach-fix
Yep, this fix works just fine, thank you! I'll try to add more nodes in the topology. As far as I can see pyroute2 issue was also fixed, great!
Hi Michael! Please give me a hand with the debugging the following issue. I am trying to emulate a part or better a slice of our production network. I created topology with 1 spine and 16 superspines and due to the presence of LACP and other l2 multicast I use veth. I encountered two issues so far.
pyroute2.ipdb.exceptions.CommitException: target ifname is not set
Here there are some outputs from --debug:I tried multiple times and got the same error for the same interface. I attached my topology file as a reference. Feel free to contact if you have any related question or need to test something. Regards, Alex. v0lk-topology.txt