problem setting up multi host cluster

xh3b4sd commented 10 years ago

Hey there,

I am playing around with the docker-riak repo and have some problems. I hope it is ok to address my issue here, even if it is not the correct platform (maybe the riak/coreos mailing list would be more appropriate). If so, sorry!

Anyway, I am trying to figure out how to set up a docker-riak cluster across multiple hosts. The idea is the following. Having two coreos [1] machines running, the cluster should be created using one riak node on the first coreos machine, and one riak node on the second coreos machine. Linking should be enabled using the ambassador pattern [2]. I have a gist [3] describing how it should work. Everything comes up properly, but the riak nodes don't get linked to one cluster. It is possible to access the riak nodes from inside the containers, but for some reason the riak-admin cluster join <node> does not do what it should.

I don't understand the problem, because the riak nodes can access the correct IP's and be able to communicate. For me there is no obvious reason why that scenario fails. Maybe somebody could take a look to get that to work. Would be really cool to be able to do this cluster thing across multiple hosts.

[1] https://coreos.com/ [2] http://coreos.com/blog/docker-dynamic-ambassador-powered-by-etcd/ [3] https://gist.github.com/zyndiecate/74e8df820ccee60f67ae

All the best, Tim

hectcastro commented 10 years ago

I need to dig into this blog post and your gist, but before I do:

What is the contents of your vm.args file inside /etc/riak/?
How are you testing that the nodes can communicate?
What ports are open between the two containers?

xh3b4sd commented 10 years ago

Hey thanks for the quick reply! To your questions:

Content of the vm.args file is unchanged. It has the content the docker image hectcastro/docker-riak creates. They are identical for both nodes. See gist for riak01 and riak02.
I ensure communication using curl. Imagine being on the first coreos host, inside the riak container. It has the environment variable (e.g. SEED_PORT_8098_TCP_ADDR=10.1.0.2) set. That represents its own ambassador, that dispatches traffic to the other riak container on the second coreos host. Doing curl -s "http://10.1.0.2:8098/stats" | grep ring_member inside the first riak container fetches the stats of the second riak node. Thus I assume the nodes can communicate with each other.
Theoretically both ports are open on each node. But in my setup only the 8098 ports are "linked". I thought that should be enough, as the simpler example worked like that. As I understood, port 8098 is supporting the HTTP interface.

hectcastro commented 10 years ago

Content of the vm.args file is unchanged. It has the content the docker image hectcastro/docker-riak creates. They are identical for both nodes. See gist for riak01 and riak02.

There is a bit of customization to the vm.args file right before Riak starts. The -name argument for the Erlang VM needs to be a fully qualified domain name (FQDN) or an IP address that is accessible by other nodes in the cluster:

-name riak@riak1.example.com

Can you try spinning up the nodes again, but before attempting to join them into a cluster, ensure that the values for -name are accessible?

xh3b4sd commented 10 years ago

Oh yeah, I just showed the unchanged content of vm.args file. Starting riak using /sbin/my_init --quiet does that substitution.

Now I see the following. The container of the first riak node has the IP 10.1.0.4, thus -name riak@10.1.0.4. The container of the second riak node has the IP 10.1.0.5, thus -name riak@10.1.0.5.

Having both containers natively linked using the docker link option, link the containers using that IP's. Using the ambassador pattern the linked IP's are not the IP's of the riak node containers itself. That is may an issue, but I still don't understand why it should be. May it is because of missing knowledge of riak internals. Anyway, the container of the first riak node (10.1.0.4) has SEED_PORT_8098_TCP_ADDR=10.1.0.2, for communicating with the second node. The container of the second riak node (10.1.0.5) has SEED_PORT_8098_TCP_ADDR=10.1.0.3, for communicating with the first node.

Should 10.1.0.4 have SEED_PORT_8098_TCP_ADDR=10.1.0.5 and should 10.1.0.5 have SEED_PORT_8098_TCP_ADDR=10.1.0.4 to get that to work? If yes, why? Communication is guaranteed, so ... :confused:

xh3b4sd commented 10 years ago

Ok, I guess I understand what the problem is. Riak node two initially communicates with riak node one using the IP given by the env var SEED_PORT_8098_TCP_ADDR=10.1.0.3. That initial communication is used to join the cluster. Riak node two says: "Hey there, I am 10.1.0.5". Riak node one thinks: "Cool, I just join 10.1.0.5". But when riak node one tries to call riak node two, it will fail, because 10.1.0.5 is not accessible for it. Instead it would need to use the ambassadors IP 10.1.0.2, and the same the other way around. Does that make sense @hectcastro? Do you see a way to get that scenario to work?

Update: screen shot 2014-05-02 at 16 10 41

hectcastro commented 10 years ago

I spent some time reading through a few strategies for connecting containers across multiple Docker hosts this weekend. I want to mess with a few locally to see which work best with Riak. As soon as I have something working, I'll update this issue.

Thus far, I'm not convinced that the ambassador pattern will work well because of the number of ports inter-Erlang node communication requires (inet_dist_listen_min and inet_dist_listen_max): http://docs.basho.com/riak/latest/ops/advanced/security/

xh3b4sd commented 10 years ago

Ok cool. I am always happy to hear your ideas.

A friend of mine talked about erlang based node communication today, and I agree that the ambassador pattern may not is a good choice here. Anyway, in some way the multi host support of connected docker containers goes further and I am glad to see what you come up with. Thanks!

hectcastro commented 10 years ago

I just wanted to update this issue with some progress (of lack of progress). I created a Vagrant project to test strategies for connecting containers across Docker hosts:

https://github.com/hectcastro/vagrant-multi-docker-riak

Unfortunately, I'm stuck on an issue with an approach that leverages Open vSwitch to bridge the network between Docker hosts. Connectivity between containers across Docker hosts works, but something is preventing the riak-admin cluster join process from completing. I have yet to determine the root cause of that issue.

Below are some details around the riak-admin cluster join RPC errors:

Docker host: `docker-us-east-1a`

vagrant@docker-us-east-1a:~$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:0c:41:3e brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.15/24 brd 10.0.2.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fe0c:413e/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:2b:83:f0 brd ff:ff:ff:ff:ff:ff
    inet 33.33.33.100/24 brd 33.33.33.255 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fe2b:83f0/64 scope link
       valid_lft forever preferred_lft forever
8: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default
    link/ether 72:2c:4a:53:a8:78 brd ff:ff:ff:ff:ff:ff
9: br-int: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UNKNOWN group default
    link/ether 1e:1b:4f:41:98:42 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::88be:5bff:fe0b:3007/64 scope link
       valid_lft forever preferred_lft forever
10: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 56:84:7a:fe:97:99 brd ff:ff:ff:ff:ff:ff
    inet 172.16.42.1/24 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::5484:7aff:fefe:9799/64 scope link
       valid_lft forever preferred_lft forever
12: veth4ST3MW: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master docker0 state UP group default qlen 1000
    link/ether fe:88:85:ea:bb:b8 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fc88:85ff:feea:bbb8/64 scope link
       valid_lft forever preferred_lft forever

vagrant@docker-us-east-1a:~$ ip route show
default via 10.0.2.2 dev eth0
10.0.2.0/24 dev eth0  proto kernel  scope link  src 10.0.2.15
33.33.33.0/24 dev eth1  proto kernel  scope link  src 33.33.33.100
172.16.42.0/24 dev docker0  proto kernel  scope link  src 172.16.42.1

vagrant@docker-us-east-1a:~$ sudo iptables-save
# Generated by iptables-save v1.4.21 on Fri May  9 03:28:31 2014
*mangle
:PREROUTING ACCEPT [117199:218801889]
:INPUT ACCEPT [117199:218801889]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [96000:4021273]
:POSTROUTING ACCEPT [96000:4021273]
COMMIT
# Completed on Fri May  9 03:28:31 2014
# Generated by iptables-save v1.4.21 on Fri May  9 03:28:31 2014
*nat
:PREROUTING ACCEPT [17:2714]
:INPUT ACCEPT [17:2714]
:OUTPUT ACCEPT [165:12019]
:POSTROUTING ACCEPT [165:12019]
:DOCKER - [0:0]
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -s 172.16.42.0/24 ! -d 172.16.42.0/24 -j MASQUERADE
-A POSTROUTING -s 172.17.0.0/16 ! -d 172.17.0.0/16 -j MASQUERADE
COMMIT
# Completed on Fri May  9 03:28:31 2014
# Generated by iptables-save v1.4.21 on Fri May  9 03:28:31 2014
*filter
:INPUT ACCEPT [102937:180906520]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [82468:3409664]
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -i docker0 -o docker0 -j ACCEPT
COMMIT
# Completed on Fri May  9 03:28:31 2014

Docker container: `riak-us-east-1a`

root@c0cfd464a464:/# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
11: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether e6:b1:ae:97:22:3f brd ff:ff:ff:ff:ff:ff
    inet 172.16.42.21/24 brd 172.16.42.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::e4b1:aeff:fe97:223f/64 scope link
       valid_lft forever preferred_lft forever

root@c0cfd464a464:/# ip route show
default via 172.16.42.1 dev eth0
172.16.42.0/24 dev eth0  proto kernel  scope link  src 172.16.42.21

root@c0cfd464a464:/# ping -c 3 172.16.42.23
PING 172.16.42.23 (172.16.42.23) 56(84) bytes of data.
64 bytes from 172.16.42.23: icmp_req=1 ttl=64 time=2.01 ms
64 bytes from 172.16.42.23: icmp_req=2 ttl=64 time=0.483 ms
64 bytes from 172.16.42.23: icmp_req=3 ttl=64 time=0.480 ms

--- 172.16.42.23 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2007ms
rtt min/avg/max/mdev = 0.480/0.993/2.017/0.724 ms

Docker host: `docker-us-east-1b`

vagrant@docker-us-east-1b:~$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:0c:41:3e brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.15/24 brd 10.0.2.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fe0c:413e/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:d2:6c:c5 brd ff:ff:ff:ff:ff:ff
    inet 33.33.33.200/24 brd 33.33.33.255 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fed2:6cc5/64 scope link
       valid_lft forever preferred_lft forever
8: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default
    link/ether 06:7c:22:f5:c7:2c brd ff:ff:ff:ff:ff:ff
9: br-int: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UNKNOWN group default
    link/ether 96:52:8f:52:a4:40 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::b8f2:58ff:feb0:1ab9/64 scope link
       valid_lft forever preferred_lft forever
10: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 56:84:7a:fe:97:99 brd ff:ff:ff:ff:ff:ff
    inet 172.16.42.2/24 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::5484:7aff:fefe:9799/64 scope link tentative dadfailed
       valid_lft forever preferred_lft forever
12: vethAOT3LG: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master docker0 state UP group default qlen 1000
    link/ether fe:ce:a1:27:35:fd brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fcce:a1ff:fe27:35fd/64 scope link
       valid_lft forever preferred_lft forever

vagrant@docker-us-east-1b:~$ ip route show
default via 10.0.2.2 dev eth0
10.0.2.0/24 dev eth0  proto kernel  scope link  src 10.0.2.15
33.33.33.0/24 dev eth1  proto kernel  scope link  src 33.33.33.200
172.16.42.0/24 dev docker0  proto kernel  scope link  src 172.16.42.2

vagrant@docker-us-east-1b:~$ sudo iptables-save
# Generated by iptables-save v1.4.21 on Fri May  9 03:29:31 2014
*mangle
:PREROUTING ACCEPT [120311:218920423]
:INPUT ACCEPT [120311:218920423]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [99271:4147795]
:POSTROUTING ACCEPT [99271:4147795]
COMMIT
# Completed on Fri May  9 03:29:31 2014
# Generated by iptables-save v1.4.21 on Fri May  9 03:29:31 2014
*nat
:PREROUTING ACCEPT [9:1306]
:INPUT ACCEPT [9:1306]
:OUTPUT ACCEPT [165:12019]
:POSTROUTING ACCEPT [165:12019]
:DOCKER - [0:0]
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -s 172.16.42.0/24 ! -d 172.16.42.0/24 -j MASQUERADE
-A POSTROUTING -s 172.17.0.0/16 ! -d 172.17.0.0/16 -j MASQUERADE
COMMIT
# Completed on Fri May  9 03:29:31 2014
# Generated by iptables-save v1.4.21 on Fri May  9 03:29:31 2014
*filter
:INPUT ACCEPT [104109:180959635]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [84137:3471018]
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -i docker0 -o docker0 -j ACCEPT
COMMIT
# Completed on Fri May  9 03:29:31 2014

Docker container: `riak-us-east-1b`

root@79374e008b90:/# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
11: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether da:f2:ce:fc:21:a2 brd ff:ff:ff:ff:ff:ff
    inet 172.16.42.23/24 brd 172.16.42.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::d8f2:ceff:fefc:21a2/64 scope link
       valid_lft forever preferred_lft forever

root@79374e008b90:/# ip route show
default via 172.16.42.1 dev eth0
172.16.42.0/24 dev eth0  proto kernel  scope link  src 172.16.42.23

root@79374e008b90:/# ping -c 3 172.16.42.21
PING 172.16.42.21 (172.16.42.21) 56(84) bytes of data.
64 bytes from 172.16.42.21: icmp_req=1 ttl=64 time=3.71 ms
64 bytes from 172.16.42.21: icmp_req=2 ttl=64 time=0.624 ms
64 bytes from 172.16.42.21: icmp_req=3 ttl=64 time=0.468 ms

--- 172.16.42.21 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2008ms
rtt min/avg/max/mdev = 0.468/1.603/3.719/1.497 ms

Attempt to join `riak-us-east-1b` to `riak-us-east-1a`

root@79374e008b90:/# riak-admin cluster join riak@172.16.42.21
RPC to 'riak@172.16.42.23' failed: timeout
root@79374e008b90:/# riak-admin status | grep connected
connected_nodes : ['riak@172.16.42.21']
root@79374e008b90:/# riak-admin cluster plan
There are no staged changes
root@79374e008b90:/# riak-admin member-status
================================= Membership ==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
valid     100.0%      --      'riak@172.16.42.23'
-------------------------------------------------------------------------------
Valid:1 / Leaving:0 / Exiting:0 / Joining:0 / Down:0

Also, riak-admin diag does not appear to work within the Docker container. See https://github.com/basho/riaknostic/issues/82 for details.

/cc @trotter

pakfur commented 10 years ago

I have been trying to solve this same issue with no success using Docker < 0.11.1.

However, I was able to easily stand up a cluster across different hosts in EC2 using Docker 0.11.1 [1]. Docker 0.11.1 introduced a new feature called "Host mode" where the container shares the host network interface directly, instead of using a bridge. Since the container is sharing the host interface, it is faster (reportedly) and the nodes are able to join the cluster.

This comes at a cost though. For example I am unable to create a shell in the container sandbox, so running riak-admin is out. I enabled Riak Control in all my nodes and was thus able to do the necessary admin work to cluster and manage the nodes, but I would prefer to be able to get access to riak-admin directly.

[1] http://blog.docker.io/2014/05/docker-0-11-release-candidate-for-1-0

-- john kline

/cc @trotter

benjaminbarbe commented 10 years ago

@pakfur Yes, it's a workaround thanks for sharing. But you are stuck with one riak container per host, isn't it? Could be problematic for coreos managment (fleetctl).

@hectcastro Do you have any update on this?

hectcastro commented 10 years ago

Using the following two branches (one from this project and one from vagrant-multi-docker-riak), I was able to wire up containers across two Docker hosts via ambassador containers:

Hopefully the two fig.yml files help outline what's going on, but the main idea is that inet_dist_listen_min and inet_dist_listen_max are being set explicitly so that it can be exposed across the Docker hosts.

The main reason for leveraging Fig is that it helps lay things out a bit more elegantly than a series of docker run commands.

grkvlt commented 8 years ago

@hectcastro I am having the exact same issue as you mention in https://github.com/hectcastro/docker-riak/issues/10#issuecomment-42630916 with riak nodes on a libnetwork provisioned Calico network spanning multiple hosts, using Docker 1.10.3 so am curious if you ever worked out what was causing that RPC error?

hectcastro commented 8 years ago

Unfortunately, I did not. It is possible that someone has a better answer on the riak-users mailing list.

vschiavoni commented 8 years ago

Do you know if using docker-swarm would solve this issue ? We would like to deploy a riak cluster with as many as 200's nodes and this clearly requires spanning multiple hosts.

hectcastro / docker-riak