nestybox / sysbox

An open-source, next-generation "runc" that empowers rootless containers to run workloads such as Systemd, Docker, Kubernetes, just like VMs.
Apache License 2.0
2.68k stars 150 forks source link

Network communication failed when running Docker swarm inside a Sysbox system container #250

Open sthiriet opened 3 years ago

sthiriet commented 3 years ago

Hi

Given the test suite, it's possible to launch swarm services in a docker swarm inside a sysbox system container but services on the same networks can't communicate with each other:

I initialize a swarm cluster with the help of this test file:

Then I created a network and two services:

docker exec $mgr sh -c "docker network create --attachable=true -d overlay innet"
docker exec $mgr sh -c "docker service create --replicas 1 --network innet --name firstservice alpine ping docker.com"
docker exec $mgr sh -c "docker service create --replicas 1 --network innet --name secondservice alpine ping docker.com"

Then I connect to the firstservice replica and try to ping secondservice:

ping: bad address 'secondservice'

In manager's log:

time="2021-03-30T19:19:01.030773398Z" level=info msg="shim containerd-shim started" address=/containerd-shim/27a7d45d81c1bb86758304c6f770bc730fa970d03cc47daa673f7626a409deae.sock debug=false pid=11675
time="2021-03-30T19:19:01Z" level=warning msg="file does not exist: /proc/sys/net/ipv6/conf/all/disable_ipv6 : stat /proc/sys/net/ipv6/conf/all/disable_ipv6: no such file or directory Has IPv6 been disabled in this node's kernel?"
time="2021-03-30T19:19:05.966424391Z" level=info msg="NetworkDB stats d288bd289303(248d2ba3518c) - netID:8mnt6217s6bfua4qb9fk2keaj leaving:false netPeers:2 entries:4 Queue qLen:0 netMsg/s:0"
time="2021-03-30T19:19:05.966560293Z" level=info msg="NetworkDB stats d288bd289303(248d2ba3518c) - netID:uau5lrwi4lxmf52tycwtist4r leaving:false netPeers:2 entries:8 Queue qLen:0 netMsg/s:0"
time="2021-03-30T19:19:08.268646522Z" level=info msg="Container 510866142c71261d7dc4eea060c0a50e61b58afc34c28225452ded40d6c1f64e failed to exit within 10 seconds of signal 15 - using the force"
time="2021-03-30T19:19:08.431420306Z" level=info msg="shim reaped" id=510866142c71261d7dc4eea060c0a50e61b58afc34c28225452ded40d6c1f64e
time="2021-03-30T19:19:08.441486528Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
time="2021-03-30T19:19:08.442291287Z" level=warning msg="rmServiceBinding 0c932c01ce0b750e93491916c6494b28b7762560a111299c691229492a5e8711 possible transient state ok:false entries:0 set:false "
time="2021-03-30T19:19:08Z" level=error msg="set up rule failed, [-t mangle -D INPUT -d 10.0.1.2/32 -j MARK --set-mark 256]:  (iptables failed: iptables --wait -t mangle -D INPUT -d 10.0.1.2/32 -j MARK --set-mark 256: iptables: No chain/target/match by that name.\n (exit status 1))"
time="2021-03-30T19:19:08.558023941Z" level=error msg="Failed to delete firewall mark rule in sbox lb_uau5 (lb-inne): reexec failed: exit status 8"
time="2021-03-30T19:19:08.558253463Z" level=error msg="Failed add IP alias 10.0.1.2 to network uau5lrwi4lxmf52tycwtist4r LB endpoint interface eth0: cannot assign requested address"
rodnymolina commented 3 years ago

@sthiriet, thanks for reporting this one.

A few questions below ...

rodnymolina commented 3 years ago

Also, to leave aside dns-resolution issues, instead of pinging docker.com, try the ip-address of each container.

sthiriet commented 3 years ago

Hello @rodnymolina

@sthiriet, thanks for reporting this one.

A few questions below ...

* Which linux distro are you running at the host level?
uname -a
Linux s-VirtualBox 5.8.0-44-generic #50~20.04.1-Ubuntu SMP Wed Feb 10 21:07:30 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
* Which linux distro are you running at the system container level?

On first test, a standard docker alpine but same behaviour in tests with netsybox/alpine-docker-dbg

* Have you checked if the testcase on which you are basing your setup is properly working? If not already done, please do the following within your host / VM:

  * $ git clone sysbox
  * $ cd sysbox
  * $ make test-shell (to launch the sysbox-test privilege container)
  * $ bats -t tests/dind/swarm.bats (to execute the swarm testcase)

I've modified the tests as follow in order to reproduce the error :

#!/usr/bin/env bats
  #set 3>/dev/tty
  #BASH_XTRACEFD=3
  #set -x
# Basic tests running docker inside a system container
#

load ../helpers/run
load ../helpers/docker
load ../helpers/net
load ../helpers/sysbox-health

function teardown() {
  sysbox_log_check
}

function basic_test {
   net=$1

   # Launch swarm manager sys container
   local mgr=$(docker_run --rm --name manager --net=$net ${CTR_IMG_REPO}/alpine-docker-dbg:latest tail -f /dev/null)

   # init swarm in manager, get join token
   docker exec -d $mgr sh -c "dockerd > /var/log/dockerd.log 2>&1"
   [ "$status" -eq 0 ]

   wait_for_inner_dockerd $mgr

   docker exec $mgr sh -c "docker swarm init"
   [ "$status" -eq 0 ]

   docker exec $mgr sh -c "docker swarm join-token -q manager"
   [ "$status" -eq 0 ]
   local mgr_token="$output"

   docker exec $mgr sh -c "ip a"
   [ "$status" -eq 0 ]
   local mgr_ip=$(parse_ip "$output" "eth0")

   local join_cmd="docker swarm join --token $mgr_token $mgr_ip:2377"

   # Launch worker node
   local worker=$(docker_run --rm --name worker --net=$net ${CTR_IMG_REPO}/alpine-docker-dbg:latest tail -f /dev/null)

   # Join the worker to the swarm
   docker exec -d $worker sh -c "dockerd > /var/log/dockerd.log 2>&1"
   [ "$status" -eq 0 ]

   wait_for_inner_dockerd $worker

   docker exec $worker sh -c "$join_cmd"
   [ "$status" -eq 0 ]

   # verify worker node joined
   docker exec $mgr sh -c "docker node ls"
   [ "$status" -eq 0 ]

   # The output of the prior command is something like this:
   #
   # ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS      ENGINE VERSION
   # by9ukwes9r9emn3pbozbh6dp6     7f62c95195dc        Ready               Active              Reachable           19.03.12
   # sfgwme7k5vol5ra3hf2jgwlfo *   fc4c806f1598        Ready               Active              Leader              19.03.12

   for i in $(seq 1 2); do
      [[ "${lines[$i]}" =~ "Ready".+"Active" ]]
   done

   # deploy a service
   docker exec $mgr sh -c "docker service create --restart-max-attempts 5 --replicas 4 --name helloworld alpine ping localhost"
   [ "$status" -eq 0 ]

   # verify the service is up
   docker exec $mgr sh -c "docker service ls"
   [ "$status" -eq 0 ]
   [[ "${lines[1]}" =~ "$helloworld".+"4/4" ]]

    # cleanup
   docker_stop $mgr
   docker_stop $worker
}

function service_com_test {
   net=$1

   # Launch swarm manager sys container
   local mgr=$(docker_run --rm --name manager --net=$net ${CTR_IMG_REPO}/alpine-docker-dbg:latest tail -f /dev/null)

   # init swarm in manager, get join token
   docker exec -d $mgr sh -c "dockerd > /var/log/dockerd.log 2>&1"
   [ "$status" -eq 0 ]

   wait_for_inner_dockerd $mgr

   docker exec $mgr sh -c "docker swarm init"
   [ "$status" -eq 0 ]

   docker exec $mgr sh -c "docker swarm join-token -q manager"
   [ "$status" -eq 0 ]
   local mgr_token="$output"

   docker exec $mgr sh -c "ip a"
   [ "$status" -eq 0 ]
   local mgr_ip=$(parse_ip "$output" "eth0")

   local join_cmd="docker swarm join --token $mgr_token $mgr_ip:2377"

   # Launch worker node
   local worker=$(docker_run --rm --name worker --net=$net ${CTR_IMG_REPO}/alpine-docker-dbg:latest tail -f /dev/null)

   # Join the worker to the swarm
   docker exec -d $worker sh -c "dockerd > /var/log/dockerd.log 2>&1"
   [ "$status" -eq 0 ]

   wait_for_inner_dockerd $worker

   docker exec $worker sh -c "$join_cmd"
   [ "$status" -eq 0 ]

   # verify worker node joined
   docker exec $mgr sh -c "docker node ls"
   [ "$status" -eq 0 ]

   # The output of the prior command is something like this:
   #
   # ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS      ENGINE VERSION
   # by9ukwes9r9emn3pbozbh6dp6     7f62c95195dc        Ready               Active              Reachable           19.03.12
   # sfgwme7k5vol5ra3hf2jgwlfo *   fc4c806f1598        Ready               Active              Leader              19.03.12

   for i in $(seq 1 2); do
      [[ "${lines[$i]}" =~ "Ready".+"Active" ]]
   done

   # create an overlay network 
   docker exec $mgr sh -c "docker network create -d overlay test-in-net"
   [ "$status" -eq 0 ]
   # deploy a target service to test network
   docker exec $mgr timeout 60 sh -c "docker service create --network test-in-net --restart-max-attempts 5 --replicas 1 --name target nginx:alpine"
   [ "$status" -eq 0 ]

   # deploy a source service to test network using timeout as docker service create never returns on failure
   docker exec $mgr timeout 60 sh -c "docker service create --network test-in-net --restart-max-attempts 5 --replicas 1 --name source alpine sh -c 'set -e; while true; do wget -q -O - target; sleep 2; done'"
   [ "$status" -eq 0 ]

   # verify the services are up
   docker exec $mgr sh -c "docker service ls"
   [ "$status" -eq 0 ]
   [[ "${lines[1]}" =~ "$source".+"1/1" ]]
   [[ "${lines[2]}" =~ "$target".+"1/1" ]]

   # cleanup
   docker_stop $mgr
   docker_stop $worker
}

@test "swarm-in-docker basic" {
   basic_test bridge
 }

@test "swarm-in-docker custom net" {

   docker network create test-net
   [ "$status" -eq 0 ]

   basic_test test-net

   docker network rm test-net
   [ "$status" -eq 0 ]
}

@test "swarm-in-docker basic service communication test" {
   service_com_test bridge
 }

@test "swarm-in-docker custom net service communication test" {

   docker network create test-net
   [ "$status" -eq 0 ]

   service_com_test test-net

   docker network rm test-net
   [ "$status" -eq 0 ]
}

Results are :

root@sysbox-test:~/nestybox/sysbox# bats -t tests/dind/swarm.bats 
1..4
ok 1 swarm-in-docker basic
ok 2 swarm-in-docker custom net
not ok 3 swarm-in-docker basic service communication test
# (from function `service_com_test' in file tests/dind/swarm.bats, line 142,
#  in test file tests/dind/swarm.bats, line 172)
#   `service_com_test bridge' failed
# docker run --runtime=sysbox-runc -d --rm --name manager --net=bridge ghcr.io/nestybox/alpine-docker-dbg:latest tail -f /dev/null (status=0):
# f89d162272803432230dd87f55fc1fc60f0f471e19702b1b6a1770f0cae84418
# docker ps --format {{.ID}} (status=0):
# f89d16227280
# docker exec -d f89d16227280 sh -c dockerd > /var/log/dockerd.log 2>&1 (status=0):
# 
# docker exec f89d16227280 sh -c docker swarm init (status=0):
# Swarm initialized: current node (twoqefhhn5f663xbogw4hw9kc) is now a manager.
# 
# To add a worker to this swarm, run the following command:
# 
#     docker swarm join --token SWMTKN-1-3h1vuoe3mikx2slnr9fz1wa45vfouy8azu1248rx0tlb3mmy41-0bobvuwih39upg18p0ak7btjv 172.21.0.2:2377
# 
# To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.
# docker exec f89d16227280 sh -c docker swarm join-token -q manager (status=0):
# SWMTKN-1-3h1vuoe3mikx2slnr9fz1wa45vfouy8azu1248rx0tlb3mmy41-8iyykpy3qofvl6964572jztzk
# docker exec f89d16227280 sh -c ip a (status=0):
# 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
#     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
#     inet 127.0.0.1/8 scope host lo
#        valid_lft forever preferred_lft forever
# 2: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN 
#     link/ether 02:42:9e:97:2d:94 brd ff:ff:ff:ff:ff:ff
#     inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
#        valid_lft forever preferred_lft forever
# 13: eth0@if14: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP 
#     link/ether 02:42:ac:15:00:02 brd ff:ff:ff:ff:ff:ff
#     inet 172.21.0.2/16 brd 172.21.255.255 scope global eth0
#        valid_lft forever preferred_lft forever
# docker run --runtime=sysbox-runc -d --rm --name worker --net=bridge ghcr.io/nestybox/alpine-docker-dbg:latest tail -f /dev/null (status=0):
# be2e2aac46fafe86a428a70b6aa6235cded0c5098cd924db8665d8d013084785
# docker ps --format {{.ID}} (status=0):
# be2e2aac46fa
# f89d16227280
# docker exec -d be2e2aac46fa sh -c dockerd > /var/log/dockerd.log 2>&1 (status=0):
# 
# docker exec be2e2aac46fa sh -c docker swarm join --token SWMTKN-1-3h1vuoe3mikx2slnr9fz1wa45vfouy8azu1248rx0tlb3mmy41-8iyykpy3qofvl6964572jztzk 172.21.0.2:2377 (status=0):
# This node joined a swarm as a manager.
# docker exec f89d16227280 sh -c docker node ls (status=0):
# ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS      ENGINE VERSION
# vtgvvjh8ksiagtqc4hn5wtlcp     be2e2aac46fa        Ready               Active              Reachable           19.03.12
# twoqefhhn5f663xbogw4hw9kc *   f89d16227280        Ready               Active              Leader              19.03.12
# docker exec f89d16227280 sh -c docker network create -d overlay test-in-net (status=0):
# q00s6ru7h5efde9l73gssl3fh
# docker exec f89d16227280 timeout 60 sh -c docker service create --network test-in-net --restart-max-attempts 5 --replicas 1 --name target nginx:alpine (status=0):
# rwdil8nnkwodtlaojl3qzcs6l
# overall progress: 0 out of 1 tasks
# 1/1:  
# overall progress: 0 out of 1 tasks
# overall progress: 0 out of 1 tasks
...
...
# overall progress: 1 out of 1 tasks
# verify: Waiting 5 seconds to verify that tasks are stable...
....
# verify: Waiting 1 seconds to verify that tasks are stable...
# verify: Service converged
# docker exec f89d16227280 timeout 60 sh -c docker service create --network test-in-net --restart-max-attempts 5 --replicas 1 --name source alpine sh -c 'set -e; while true; do wget -q -O - target; sleep 2; done' (status=143):
# ub3wztjj32irrpyl07537on7d
# overall progress: 0 out of 1 tasks
# 1/1:  
# overall progress: 0 out of 1 tasks
# overall progress: 0 out of 1 tasks
# overall progress: 0 out of 1 tasks
...
# overall progress: 0 out of 1 tasks
# overall progress: 1 out of 1 tasks
# verify: Waiting 5 seconds to verify that tasks are stable...
...
# verify: Waiting 2 seconds to verify that tasks are stable...
# overall progress: 0 out of 1 tasks
# verify: Detected task failure
# overall progress: 0 out of 1 tasks
...
# verify: Waiting 2 seconds to verify that tasks are stable...
# verify: Waiting 2 seconds to verify that tasks are stable...
# overall progress: 0 out of 1 tasks
# verify: Detected task failure
# overall progress: 0 out of 1 tasks
...
# overall progress: 0 out of 1 tasks
not ok 4 swarm-in-docker custom net service communication test
# (from function `service_com_test' in file tests/dind/swarm.bats, line 95,
#  in test file tests/dind/swarm.bats, line 180)
#   `service_com_test test-net' failed
# docker network create test-net (status=0):
# acb2067d9b3a345491dcb735a81c3a68221784081ecda4daf412252333924dec
# docker run --runtime=sysbox-runc -d --rm --name manager --net=test-net ghcr.io/nestybox/alpine-docker-dbg:latest tail -f /dev/null (status=125):
# docker: Error response from daemon: Conflict. The container name "/manager" is already in use by container "f89d162272803432230dd87f55fc1fc60f0f471e19702b1b6a1770f0cae84418". You have to remove (or rename) that container to be able to reuse that name.
# See 'docker run --help'.
# docker ps --format {{.ID}} (status=0):
# be2e2aac46fa
# f89d16227280
# docker exec -d be2e2aac46fa sh -c dockerd > /var/log/dockerd.log 2>&1 (status=0):
# 
# docker exec be2e2aac46fa sh -c docker swarm init (status=1):
# Error response from daemon: This node is already part of a swarm. Use "docker swarm leave" to leave this swarm and join another one.
rodnymolina commented 3 years ago

@sthiriet, thanks for your detailed response and reproduction steps.

I suspect that problem is likely a consequence of Sysbox being currently unable to deal with IPVS instructions within a system container, which are usually required by Swarm to manage access to services exported through the "ingress" network.

What surprised me from your setup is that, at first glance, I don't see any service being exported (e.g. port-forwarding), that may require the use of ipvs. That's the reason I asked you to verify that traffic could flow outside the non-ingress network (i.e. the overlay network you created as well as through the regular docker_gwbridge iface). But I went ahead and answer those questions myself, thanks to your repro instructions.

I'd need to do some digging to fully connect the dots as I'm not 100% sure that ipvs is being the problem here. However, if you ever need to make use of the "ingress" network to offer access to your services from external parties, you will hit this Sysbox limitation anyways. We have ipvs feature in our roadmap though, so please stay tuned.

rodnymolina commented 3 years ago

One question, what's the scenario that you have in mind and what's its purpose? I'm just asking to better understand the scope of what you're trying to accomplish so that we can prioritize this feature (full docker-swarm support within sysbox containers) accordingly.

sthiriet commented 3 years ago

Indeed I had not presented the context.

The first scenario is testing ansible roles that deploy and configure swarm clusters, theirs monitoring tools and applications. Today, we have a gitlab instance with shared runners (docker executor with privileges). Those runners will be removed soon for security reasons.

The second scenario I wanted to test was to use these sysbox docker runners to deploy a swarm application and perform our CI tests instead of using dedicated non-production swarm cluster.

I will look forward to these updates :)

ctalledo commented 3 years ago

Hi @rodnymolina, given that this appears to be related to IPVS not working inside Sysbox containers, I am wondering if we should we mark this as a duplicate of issue #189. What do you say?

rodnymolina commented 3 years ago

Thanks @sthiriet, both use cases make perfect sense.

I fully understand that those privileged docker-executors represent a security risk, and this is precisely a natural use-case for Sysbox runtime.

In regards to the second use-case, if I understood you correctly, you want to use sysbox containers to deploy swarm services. This means that there's no need to run swarm inside the sysbox container, as the container would only run the apps that need to be tested. If that's the case, then everything should already work fine for you, as ipvs and all the docker-swarm networking magic would be done outside the sysbox container.

@ctalledo, yes, will mark this one as dup once/if I confirm that ipvs is the root-cause, but want to make sure that's the case before we do that.

dmarteau commented 2 years ago

Hi,

We - maybe - have the same problem here: we use sysbox to mock-up production cluster wich involve running swarm on some hosts. Swarm DNS resolver works but there is no way for swarm services containers to connect to each other.

dmarteau commented 2 years ago

To be more precise: communication between containers work but not service ip which, afaik, acts as a virtual ip and is responsible to forward the traffic to the containers.

ctalledo commented 2 years ago

Hi @dmarteau, yes this is almost certainly due to IPVS not working inside Sysbox containers. It's something we've been wanting to add for a while, but it's challenging and mainly impacts Docker swarm inside Sysbox, as K8s inside Sysbox works because it also supports iptables.

dmarteau commented 2 years ago

Atm, I have found a workaround by declaring endpoint_mode as dnsrr for bypassing IPVS in swarm services (see https://docs.docker.com/compose/compose-file/compose-file-v3/#deploy). This is ok as long you don't need load-balancing your services.

ctalledo commented 2 years ago

Atm, I have found a workaround by declaring endpoint_mode as dnsrr for bypassing IPVS in swarm services (see https://docs.docker.com/compose/compose-file/compose-file-v3/#deploy). This is ok as long you don't need load-balancing your services.

Good hint, thanks!

pekindenis commented 1 year ago

The lxc containers have the same problem