jpetazzo / pipework

Software-Defined Networking tools for LXC (LinuX Containers)
Apache License 2.0
4.22k stars 727 forks source link

errant busybox containers #163

Closed dreamcat4 closed 4 years ago

dreamcat4 commented 9 years ago

Hi again. It seems the new busybox udhcpc method is improved greatly over previous dhclient. However there is some issue I encountered today: Sometimes it has an issue and the busybox container don't exit.

This in turn somehow causes an issue of no WAN connectivity inside the container (but ok LAN connectivity). I have an example to show here:

docker ps
CONTAINER ID        IMAGE                         COMMAND                CREATED             STATUS              PORTS               NAMES
a7c2e871e781        busybox                       "udhcpc -i eth0 -x h   14 minutes ago      Up 14 minutes                           loving_poitras      
b9f967ec9e51        busybox                       "udhcpc -i eth0 -x h   39 minutes ago      Up 39 minutes                           cocky_hoover        
cfe45c2ea8cf        busybox                       "udhcpc -i eth1 -x h   About an hour ago   Up About an hour                        sad_curie           
ccd685ef3e72        busybox                       "udhcpc -i eth0 -x h   About an hour ago   Up About an hour                        stoic_payne         
840bd189b4ad        busybox                       "udhcpc -i eth0 -x h   About an hour ago   Up About an hour                        stoic_tesla         
a3abf57df7e7        dreamcat4/pipework            "/entrypoint.sh --he   46 hours ago        Up 9 minutes                            pipework            
cb3b16f03f0c        dreamcat4/tvheadend:testing   "/init /entrypoint.s   11 days ago         Up 8 hours                              tvh                 
id@emachines-e520:~/docker-images$ 

And the docker logs of 1 such container is here (at bottom of page):

https://gist.github.com/dreamcat4/d0655834dc358191a979

dreamcat4 commented 9 years ago

BTW:

To stop these container manually, and re-run pipework afterward (and restart each affected client containers). Can temporarily work around the problem. You have to do it manually though.

dreamcat4 commented 9 years ago

I have my suspicions:

Perhaps this happen when multiple containers are started all together. If the previous busybox container had not enough time to complete / exit before the next one. Then somehow the problem is perpetuated.

Since I start my containers in a loop, it immediately run the next pipework command. Does the pipework script wait until the busybox container has exit to also exit itself? I don't think so. But that may be a desirable behaviour (for me).

dreamcat4 commented 9 years ago

@jpetazzo This problem hasn't recurred for me recently. So I'm not sure how widespread it actually is. But anyway here is are 3 suggestions how we might attempt to avoid such problem occuring:

1) Run in the foreground the busybox container. So that the busybox udhcpc command must itself exit before the next pipework command can be run.

2) Else apply a unique label identifier on it. For checking on successive invocations. For example here:

https://github.com/jpetazzo/pipework/blob/5a46ecb5f8f933fd268ef315f58a1eb1c46bd93d/pipework#L307-L310

We may add an extra argument to docker run. Like--label="jpetazzo/pipework". Or some other appropriate identifier. So that pipework can later query it usingdocker ps --filter="label=jpetazzo/pipework"`. To clean up any matching hung containers before executing the next busybox dhcp instance. (successive invocations).

However if the pipework command is being run successively, then perhaps the previous running container may not be a hung container. Just that the previous container has not finished / exited yet. In which case it is not clear if pipework should be waiting for the container, or instead killing it.

Using the same trick of docker ps --filter=label, we may also remove spent / exited images which were previously run. To stop them from accumulating too much.

Note: A drawback is that this approach requires docker ps --filter. Which is only available on the most recent versions of docker. Of course that is an issue will solve itself in due time.

3) An alternative approach (instead of labels) can be to set the image name to some grep'able string. Such as pipework-busybox-$CONTAINER_NAME or pipework-busybox-$RANDOM_UUID. Then the output of docker ps may be grepped without such compatibility issue.

jpetazzo commented 9 years ago

Sorry for the lag!

I think there are two issues.

1) Waiting for the DHCP client to do its job before continuing. It's possible, but hackish. A few ideas:

2) Tagging the DHCP "sidekicks" appropriately. Labels are cool. Probably a "pipework" label to indicate the container "belongs" to pipework, then "pipework.dhcp=ID" to indicate the ID of the other container.

(In theory we should use some reverse FQDN like com.pipework.etc but I'm not in the mood of specifying 42 miles long labels right now, nor purchasing a domain just for that :smirk:)

WDYT?

dreamcat4 commented 9 years ago

I honestly don't mind how to do it. Anything you feel would be an acceptable solution to me.

After reporting the issue, it has not recurred for myself. I suspect it only actually happens at certain times. For example if the DHCP server happens to be slow to respond to requests, or may be offline. But also at the same tiem when pipework is being multiple times in sequence. Like at system startup.

dreamcat4 commented 9 years ago

This happened again today (from a fresh reboot). 2nd time, same situation / conditions. All containers starting at once.

dreamcat4 commented 9 years ago

udhcpc will work if it's the version from ubuntu 15.10. That is not the default dhcp provider for pipework so users will have to specify that option explicitly in their pipework cmds to get such work around. Only if they are affected. As I haven't yet heard other reports of this issue, just myself. Maybe worth some errata or troubleshooting FAQ to mention somewhere in docs.

Previously udhcpc was not working for me... but many thanks has got solved by following tips in related issue https://github.com/jpetazzo/pipework/issues/47#issuecomment-144525962 credit to @stoopsj for that one.

dreamcat4 commented 9 years ago

OK hacked in a sleep 2 before launching the busybox container. Unfortunately it had no effect (same error).

+ [ phys = ipoib ]
+ ip link set ph21243eth0 netns 21243
+ ip netns exec 21243 ip link set ph21243eth0 name eth0
+ [ 0a:00:00:03:00:17 ]
+ ip netns exec 21243 ip link set dev eth0 address 0a:00:00:03:00:17
+ sleep 2
+ docker run -d --net container:smb.kodi --cap-add NET_ADMIN busybox udhcpc -i eth0 -x hostname:smb.kodi
+ installed arping
+ command -v arping
+ cut -d/ -f1
+ echo dhcp
+ IPADDR=dhcp
+ ip netns exec 21243 arping -c 1 -A -I eth0 dhcp
+ true
+ rm -f /var/run/netns/21243

Can't really see what's wrong with @jpetazzo code here though. It looks like it aught to do the right things.

id@emachines-e520:~/docker-images$ docker logs admiring_brown 2>&1 | head
udhcpc (v1.23.2) started
Sending discover...
Read error: Network is down, reopening socket
udhcpc: sendto: Network is down
Sending discover...
udhcpc: sendto: Network is down
Read error: Network is down, reopening socket
Sending discover...
udhcpc: sendto: Network is down
Read error: Network is down, reopening socket
id@emachines-e520:~/docker-images$ 

Ah. Now I see in busybox ifconfig needs the -a for show all flag. And indeed the network interface is present. Before it didn't show up in the cmd output. So that's not the issue after all...

id@emachines-e520:~/docker-images$ docker exec admiring_brown ifconfig -a
eth0      Link encap:Ethernet  HWaddr 0A:00:00:03:00:17  
          BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:8 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:480 (480.0 B)  TX bytes:480 (480.0 B)
dreamcat4 commented 9 years ago

OK the busybox's eth0 is not in the up state.

dreamcat4 commented 9 years ago

the busybox's eth0 is not in the up state

Doing ifconfig eth0 up inside the busybox container causes some improvement. In the sense that it's no longer hand. The udhcpc completes, and the container exits.

id@emachines-e520:~/dev$ docker start jackett.id
jackett.id

id@emachines-e520:~/dev$ docker ps
CONTAINER ID        IMAGE                COMMAND                  CREATED             STATUS              PORTS               NAMES
bcfef8293694        busybox              "udhcpc -i eth0 -x ho"   1 seconds ago       Up 1 seconds                            hungry_bhabha
0cdd44694800        dreamcat4/pipework   "/entrypoint.sh --hel"   12 hours ago        Up 12 hours                             pipework
61eebce2f692        dreamcat4/jackett    "/init /entrypoint.sh"   8 weeks ago         Up 4 seconds                            jackett.id

id@emachines-e520:~/dev$ ping -c1 jackett.id
PING jackett.id (192.168.5.6) 56(84) bytes of data.
From emachines-e520.lan (192.168.1.33) icmp_seq=1 Destination Host Unreachable

--- jackett.id ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

id@emachines-e520:~/dev$ docker exec hungry_bhabha ifconfig eth0 up

id@emachines-e520:~/dev$ docker ps
CONTAINER ID        IMAGE                COMMAND                  CREATED             STATUS              PORTS               NAMES
0cdd44694800        dreamcat4/pipework   "/entrypoint.sh --hel"   12 hours ago        Up 12 hours                             pipework
61eebce2f692        dreamcat4/jackett    "/init /entrypoint.sh"   8 weeks ago         Up About a minute                       jackett.

... however for some reason the ping still failed afterwards:

id@emachines-e520:~/dev$ ping -c1 jackett.id
PING jackett.id (192.168.5.6) 56(84) bytes of data.
From emachines-e520.lan (192.168.1.33) icmp_seq=1 Destination Host Unreachable

--- jackett.id ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

id@emachines-e520:~/dev$ 

I will try again.

dreamcat4 commented 9 years ago

Unfortunately although the busybox udhcpc is reporting getting a lease, that success is not reflected in the linked container, or by ping probe either.

Not sure why that is, or why the container's pipework interface was in the DOWN state to begin with. I have been trying certain other things but unfortunately no improvement. I'm going to leave the progress @ there for the time being. And just switch to the ubuntu 15.10 udhcpc for myself...

But if any others have similar issue please report it.

Dmitriusan commented 9 years ago

I have the same issue. Also running Ubuntu 14.04, udhcpc and starting containers multiple at once. Recently updated pipework script from the repo, and faced that

Dmitriusan commented 9 years ago

UPD Managed to solve an issue. Here is a bit more info and workaround for those who are going to suffer from that.

LONG STORY: Today at morning I rebooted the server and faced this issue. Issue was arising consistently for me, even after reverting all dockerfiles, pipework, /etc/ to earlier revisions. Moreover, starting services one by one did not help, as well as waiting 20 secs before running pipework. I was a bit wrong in previous post saying that I'm using udhcpc. The first thing I did I upgraded udhcpc package to Ubuntu 15.04 version, but it did not help. Finally, I decided to try another dhcp client, and noticed that despite of busybox containers stuck with udhcpc, I'm starting pipework with

bash pipework int-br ${NAME} dhcp ${MAC}

That means that udhcpc program that is being executed in container is not the same as installed on host. Changed dhcp to udhcpc (so pipework runs the host version of udhcpc in container's network namespace), and the issue has gone.

SUSPECTED REASON: I've updated busibox container this morning. Official Docker Busybox image has been updated 2 days ago. https://github.com/docker-library/busybox/commits/master . Maybe they changed dhcp client program or it's version, so it can not understand options passed by pipework.

PID   USER     TIME   COMMAND
    1 root       0:00 udhcpc -i eth1 -x hostname squid-proxy
   13 root       0:00 /bin/sh
   20 root       0:00 ps aux

WORKAROUND: Run pipework with option udhcpc, and not with dhcp:

    bash pipework int-br ${NAME} udhcpc ${MAC}

Of course, udhcpc should be installed on physical host.

Dmitriusan commented 9 years ago

UPD2: Looks like with the workaround I posted, I could not access docker container from other machines attached to bridge. That is because I did not disable default Docker network eth0, and it was serving as a default route. Had to add --net=none to all docker run commands in my scripts

dreamcat4 commented 9 years ago

@Dmitriusan +1 I am in agreement. Using a newest version of ubuntu udhcpc on host machine (of 15.04 of newer) feels like the easiest workaround for the time being. And declaring udhcpc in your pipework commands. Probably only needed for those users who are experiencing ^^ this problems.

A better long term solution be to make the busybox project aware of the problem, in the hopes that they are in a position to fix it. However equally they might not be very interested in docker and such related matters with the dhcp client. I'm not sure what they might feel inclined to do about it. Haven't asked.

Or for other alternative (small docker images) instead of using busybox docker image. Well I did look around but unfortunately could not find anything as a suitable replacement. Just for the task to run a simple dhcp client. Maybe I missed / overlooked. It really surprised me not to find something else.

jpetazzo commented 8 years ago

Just to make sure I understand correctly: with the latest busybox, does the problem happen always now, or only when starting a bunch of containers at the same time? (Which would hint at some race condition)

dreamcat4 commented 8 years ago

To reproduce it seems to initially require the 2nd situation - to be starting multiple containers at the same time. Once the problem starts to occur, it can continue happening thereafter when starting individual containers.

To clear problem I reboot the whole computer (or perhaps can be cleared with less than a full system reboot, don't know).

I don't use busybox default method anymore now. Or the other ones. Only Ubuntu 15.04+ version of udhcpc (recent / newest). That is the only one of them that works for me without issues.

pppq commented 8 years ago

Looks like udhcpc tries to run a client script each time a relevant event occurs, and does not do any interface-related changes by itself, see dhcpc.c and the manpage section for udhcpc. The default script, /usr/share/udhcpc/default.script is not present in the helper container.

dreamcat4 commented 8 years ago

Thanks @pppq!

What does this mean then? Perhaps we could mount the missing script into the busybox image with -v host/script:/path/script?

Or is this more like some general bug in the official busybox image, whereby we should always really have the missing file built right into it? (i.e. to benefit also the many other users of Busybox image).

pppq commented 8 years ago

Yes, if I exec into the helper, I'm seeing:

/ # ls -al /usr
total 24
drwxr-xr-x    3 root     root          4096 Jan  2 16:51 .
drwxr-xr-x    1 root     root          4096 Jan  2 22:05 ..
drwxr-xr-x    2 daemon   daemon        4096 Dec  8 16:44 sbin

Ubuntu keeps the example scripts in /usr/share/doc/busybox-static/examples/udhcp (see the filelist), and in Jérôme's rootfs.tar, it's present in the expected location. So the image building process needs to copy it from the appropriate location.

pppq commented 8 years ago

Well, it would be the easiest if the official image included the script, but it will not be of great use unless the container is running in privileged mode, or one adds the NET_ADMIN capability, I think. So I'm not sure if it is useful for the wider audience of the image. On the other hand, the script is not too big either. :smile:

dreamcat4 commented 8 years ago

Ping @jpetazzo ^^

jpetazzo commented 8 years ago

I see! Let me summon the Powers Than Be.

@tianon: the busybox image contains the udhcpc client, but this client depends on a couple of scripts to work correctly. The scripts are invoked by udhcpc once it has obtained a lease, and the scripts are responsible for configuring the network interface. The scripts are currently not included in the busybox image. Do you think we should include them, or should we just tell people to build their own busybox image if they need to? (Which is not too hard since that'd just be FROM busybox and a COPY ./them-scripts/ /to/dat/path/tho/)

pppq commented 8 years ago

The experimentally inclined can also include it in the image builder. :smile: https://github.com/pppq/docker-busybox/commit/5c200e53e6ead5c2a5ecc7a0895faa2257ad4938

But I agree, it is easier to enable and/or customize this functionality with an additional layer. Also, a regular container instance not started via pipework will not have the required level of access to its veth interface to change the IP address, netmask and default gateway to the received values.

tianon commented 8 years ago

@jpetazzo ah interesting -- there is an "example" configuration referenced as part of the BusyBox source, so that would be really trivial to include (https://git.busybox.net/busybox/tree/examples/udhcp?h=1_24_stable)

My question here would be whether there's a recommendation from BusyBox upstream one way or the other on what the default should be for a generic environment like the one we provide? Do they have any documentation about this script/applet and the recommended usage? (I've done some searching and can't seem to find any. :disappointed:)

Looking at https://git.busybox.net/busybox/log/examples/udhcp?h=1_24_stable is not terribly encouraging (those example scripts haven't been touched since 2014, which likely either means they're unmaintained, or that they're rock-solid).

pppq commented 8 years ago

The only hint I could find is in Config.src. The scripts also don't try to do too much – it looks like simple.script is a one-file combination of all the sample.* scripts that handle dhcpc events individually.

With that said, there are people who come up with alternative implementations, see http://lists.busybox.net/pipermail/busybox/2007-January/059859.html for an example. I don't know if there is a definitive script that should be placed in the default location.