gopher-net / docker-ovs-plugin

An Open vSwitch Plugin for Docker's Libnetwork
Apache License 2.0
76 stars 31 forks source link

Failed in prefunc: failed to get link by name "eth0-xxx": Link not found #1

Closed nerdalert closed 9 years ago

nerdalert commented 9 years ago

If you spin up a big batch of containers, eventually a race condition occurs. Im guessing but it seems like an issue with libnetwork and the plugin both trying to lock the nspid filehandle at the same time. Haven't investigated yet.

Replicate the issue by pasting a few dozen containers into your console and eventually containers stop starting and go to created status:

docker run -itd busybox
docker run -itd busybox
docker run -itd busybox
# add a couple dozen or so and paste at once

*Plugin:

ERRO[0046] Failed to get the nspid from docker Unable to find container: f52e5b76687d
ERRO[0046] Errors encountered adding routes to the port [ eth0-dc0ef ]: bad file descriptor

*Docker Daemon:

ERRO[0047] Handler for POST /containers/{name:.*}/start returned error: Cannot start container f52e5b76687dbe8c93750aa57aaaed53595b9d4f22f24771ad702bbec2aa37fe: failed sandbox add: failed to add interface eth0-dc0ef to sandbox: failed in prefunc: failed to get link by name "eth0-dc0ef": Link not found
ERRO[0047] HTTP Error                                    err=Cannot start container f52e5b76687dbe8c93750aa57aaaed53595b9d4f22f24771ad702bbec2aa37fe: failed sandbox add: failed to add interface eth0-dc0ef to sandbox: failed in prefunc: failed to get link by name "eth0-dc0ef": Link not found statusCode=404

OVSDB transaction logs: https://gist.github.com/nerdalert/c0645d9eae392190f5ff

nerdalert commented 9 years ago

Yeah, its definitely a lock not released on the namespace file handle. The nspid that are complaining are the RWs below.

Example: -r--r--r-- 1 root root 0 Jul 10 08:09 14dbf69c1418 -r--r--r-- 1 root root 0 Jul 10 08:09 622f32b72710 -rw-r--r-- 1 root root 0 Jul 10 08:09 1393dd11ecc9 -rw-r--r-- 1 root root 0 Jul 10 08:09 d5d79264d81d

Eventually the two above that are locked by another process (rw-r--r) get GCed. Will figure it out this weekend. Even better is working through route adds in Libnetwork. The connected route isn't getting created so when a gateway is added it is rejected since it doesn't have a connected route and thus a valid route to the gateway on the same network. That would eliminate the need for the plugin to need to get a lock on the nsfd, netlink etc.

nerdalert commented 9 years ago

Ditched internal ports in favor of Veth pairs to avoid renaming pains. This removed the issue since the rename doesn't need to lock the nspid and thus be in contention for the file lock with Libnetwork.