Question: PIMD + Docker

troglobit / pimd

PIM-SM/SSM multicast routing for UNIX and Linux

http://troglobit.com/projects/pimd/

BSD 3-Clause "New" or "Revised" License

200 stars 90 forks source link

Question: PIMD + Docker #70

Closed dpetzel closed 8 years ago

dpetzel commented 8 years ago

Sorry if this is the wrong place for this type of question, I didn't see anything more specific in the README, or at http://troglobit.github.io/pimd.html.

We have an existing network of applications that leverage multicast. Our network is maintained by another team and I don't have much visibility into, but I do know we historically relied on IGMP snooping. I'll be the first to admit I'm out of normal element in this space, so if helps assume I have a basic understanding of multicast and IGMP, and no knowledge of PIM aside from what I've read over the last couple days trying to get this to work.

We are moving some of these multicast dependent applications into Docker. Our docker hosts consists of a single eth0 attached to upstream switches, and the local docker0 bridge. We are not doing anything with overlay networks at this time. As a byproduct of installing docker, ip_forwarding is enabled (in case it matters)

We'd like to use pimd to route multicast packets from the upstream networking into the local network on the docker bridge. We've got this mostly working but I'm hung up at the very end.

I've been using jgroups to test as outlined here: http://linuxproblems.org/wiki/How_to_check_Multicasting.

If I started a docker container as a jgroups receiver, and start a jgroups sender on a remote host, I get nothing (expected). This is confirmed in the container as well as tcpdump against eth0
If i then start PIMD on the docker host, I still don't get anything in the container (which again lives on the docker bridge), but tcpdump on eth0 shows the packets coming in, they just don't seem to be getting forwarded/routed to the container.

I believe that its a good sign that I see the packets on eth0 as that suggests PIMD has negotiated (peered?) with the upstream switch and packets are getting routed across the network.

I did fine the note in http://troglobit.github.io/multicast-howto.html, about disabling multicast_snooping on the docker0 bridge, however that doesn't appear to have helped.

To test some basic assumptions I tested with smcroute as well as igmpproxy and I had success with both of those tools, however since they are both limited to 20 groups, and I have a small subset of use cases which might require more groups than that, I'd really like to get PIMD working.

It feels like I maybe one minor configuration tweak away from success, but I've hit a brick wall and hoping someone may have done this already or have some suggestions.

Thanks, and if there is a better place for this type of question I'm happy to post something there.

troglobit commented 8 years ago

Hi @dpetzel, interesting use-case for pimd! I'm a bit lost in the Docker world and your setup as well, an (ascii) image might have helped. I'm trying one out here:

                                            eth0
>--- MC sender ----{ Network cloud }-------> [ Server host ]     <--- router
                                                    |
                                            ________|________
                                           /     docker0     \   <--- bridge    ______
                                          /         |         \                |      |   <--- MC receiver
                               __________/   Container ship    \_______________|______|_____
                              \                     |                            /         /
                               \                     `------------------>-------'         /
                                \________________________________________________________/

Now, there are many levels of multicast where things go wrong. You seem humble enough and knowledgeable enough about the basics, so we should be good to go :wink:

The TTL of the MC sender must be big enough to be passed along the path, in the cae of jgroups it seems to be 32, which should be sufficient in your case, which seems reasonable since SMCRoute works. I just always mention the TTL as the number one issue since most people stumble on it
The Linux bridge (here docker0) does not always play nice with Layer-2 multicast (IGMP), which you've also taken note of
For pimd to even consider installing a multicast route from eth0 to docker0 it needs: a) To hear the client respond to an IGMP query it sent out (layer-2), and b) To actually have the desired multicast on eth0, or know of an upstream PIM-SM router (rendez-vous point) that has it to send a PIM join to (layer-3)

However, multicast routing takes place not on regular interfaces, but on "VIF"'s ... virtual interfaces. These are unique to multicast and are enumerated when the multicast routing daemon (smcroute/pimd/mrouted) starts up. In the case of pimd it is a bit picky when allowing an interface to be enumerated as a VIF, it's not just the MULTICAST interface flag, which one might believe. Check the output from cat /proc/net/ip_mr_vif to make sure a VIF has been created for docker0, as well as eth0.

dpetzel commented 8 years ago

You are much more skilled in ascii art than myself!! Your diagram is spot on. Thank you so much for taking the time to talk through this.

I apologize for not including earlier, I am using a TTL of 64

Here is the output of /proc/net/ip_mr_vif

# cat /proc/net/ip_mr_vif
Interface      BytesIn  PktsIn  BytesOut PktsOut Flags Local    Remote
 0 eth0        4041586   34814         0       0 00000 6840C70A 00000000
 1 docker0           0       0   1789217   14660 00000 010017AC 00000000
 2 pimreg            0       0         0       0 00004 6840C70A 00000000

Additionally I lifted what I believe are relevant log entries showing it setting up the VIF: This is during startup

14:21:18.208 Installing eth0 (10.0.64.104 on subnet 10.0.64/22) as vif #0-2 - rate 0
14:21:18.208 Installing docker0 (172.10.0.1 on subnet 172.10) as vif #1-3 - rate 0
14:21:18.208 Getting vifs from /etc/pimd.conf
14:21:18.208 Local Cand-BSR address 172.10.0.1, priority 5
14:21:18.208 Local Cand-RP address 172.10.0.1, priority 20, interval 30 sec

I see these at random times as pimd is running:

23:26:59.782 accept_group_report(): igmp_src 172.17.0.128 ssm_src 0.0.0.0 group 239.192.12.200 report_type 34
23:26:59.782 Set delete timer for group: 239.192.12.200
23:26:59.782 Adding vif 1 for group 239.192.12.200

a) To hear the client respond to an IGMP query it sent out (layer-2),

I think that is what I'm seeing in the second log snippet? If not whats the best way for me to confirm or deny that is happening?

b) To actually have the desired multicast on eth0, or know of an upstream PIM-SM router (rendez-vous point) that has it to send a PIM join to (layer-3)

I feel like I've confirmed this is happening via the observed behavior using tcpdump on eth0 Since I can see the packets coming in when pimd is running, and nothing when its not.

Since you've mentioned a lack of familiarity with Docker, I'll toss out that it sets up the following iptables rules. I haven't seen anything to suggest these are at fault, but just wanted to get you the information in case it matters. I am no iptables wizard by any stretch, so its entirely possible something could be wrong there and I just don't see it. That said given that smcroute and igmpproxy work, I'm inclined to think the issue is not in the iptables configuration.

# iptables -S
-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT
-N DOCKER
-A FORWARD -o docker0 -j DOCKER 
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT 
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT 
-A FORWARD -i docker0 -o docker0 -j DROP 

# iptables -S -t nat
-P PREROUTING ACCEPT
-P POSTROUTING ACCEPT
-P OUTPUT ACCEPT
-N DOCKER
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER 
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE 
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER

dpetzel commented 8 years ago

Side question, The main driver for me for pimd over igmpproxy (or smcroute) was the limit of 20 groups, however I just stumbled across this http://unix.stackexchange.com/questions/23832/is-there-a-way-to-increase-the-20-multicast-group-limit-per-socket which suggests 20 is the default but its configurable.

From scmroute:

Only 20 mgroup lines can be configured, this is a HARD kernel maximum. If you need more, you probably need to find another way of forwarding multicast to your router.

So I'm confused on if this is really a hard kernel max, or am I simply misreading the limitation here. I also found https://groups.google.com/forum/#!topic/linux.kernel/QeiadoMEdWY which implies it maybe configurable.

troglobit commented 8 years ago

Hi again,

does the command pimd -r show the routing tables for you? It should show a route being set up, or at least some useful info. You can also verify what routes are actually written to the kernel in the file /proc/net/ip_mr_cache ... the more readable version can be seen using the ip mroute tool. This latter part (two steps) are also shared with SMCRoute.

Verifying IGMP join from your docker client/receiver is easier to do with Wireshark or tcpdump on the docker0 interface, I guess.

I'm a bit involved in SMCRoute as well, and I can tell you that the 20 group MAX is only just for the daemon to act as a layer-2 client, sending IGMP join on behalf of the client on "the other side". If you can direct all multicast to your router using other means, e.g. setting "router port" or similar on switches or routers on the eth0 side, then you won't need the SMCRoute mgroup rows. Thanks for the heads-up on that /proc variable didn't know about that, so I've just updated the SMCRoute sources a bit! :smiley: :+1:

dpetzel commented 8 years ago

I don't really know what to make of it, but I suspect you do, here is the section for the group in question from pimd -r

----------------------------------- (S,G) ------------------------------------
----------------------------------- (*,G) ------------------------------------
Source           Group            RP Address       Flags
---------------  ---------------  ---------------  ---------------------------
INADDR_ANY       239.192.12.200   172.17.0.1       WC RP
Joined   oifs: ...                 
Pruned   oifs: ...                 
Leaves   oifs: .l.                 
Asserted oifs: ...                 
Outgoing oifs: .o.                 
Incoming     : ..I                 

TIMERS:  Entry    JP    RS  Assert VIFS:  0  1  2
             0    20     0       0        0  0  0
----------------------------------- (S,G)

I believe I do see the igmp join occur in tcpdump:

# tcpdump -s0 -i docker0 -vv -XX igmp

11:11:41.527227 IP (tos 0xc0, ttl 1, id 0, offset 0, flags [DF], proto IGMP (2), length 40, options (RA))
    172.17.0.139 > igmp.mcast.net: igmp v3 report, 1 group record(s) [gaddr 239.192.12.200 to_ex { }]

ip mroute | grep 12.200 comes up dry

Here is the full table (cleaned up a little for internal info)

ip mroute
(REMOTE_SENDER1, GROUP_IP1)   Iif: eth0       Oifs: docker0 
(REMOTE_SENDER1, GROUP_IP2)   Iif: eth0       Oifs: docker0 
(REMOTE_SENDER2, GROUP_IP3)    Iif: eth0       Oifs: docker0 
(REMOTE_SENDER3, GROUP_IP4)   Iif: eth0       Oifs: docker0 
(REMOTE_SENDER3, GROUP_IP5)   Iif: eth0       Oifs: docker0 
(REMOTE_SENDER1, GROUP_IP6)   Iif: eth0       Oifs: docker0 
(REMOTE_SENDER4, GROUP_IP7)      Iif: unresolved 
(REMOTE_SENDER5, GROUP_IP8)       Iif: unresolved 
(REMOTE_SENDER6, GROUP_IP9)  Iif: unresolved 
(REMOTE_SENDER7, GROUP_IP10)  Iif: unresolved 
(REMOTE_SENDER8, GROUP_IP11)   Iif: unresolved 
(REMOTE_SENDER9, GROUP_IP8)       Iif: unresolved 
(REMOTE_SENDER10, GROUP_IP9)  Iif: unresolved 
(REMOTE_SENDER11, GROUP_IP7)       Iif: unresolved 
(REMOTE_SENDER3, GROUP_IP12)  Iif: unresolved 
(REMOTE_SENDER12, GROUP_IP9)   Iif: unresolved

If its at all helpful...

# sysctl -a | grep mc_forward
net.ipv4.conf.all.mc_forwarding = 1
net.ipv4.conf.default.mc_forwarding = 0
net.ipv4.conf.lo.mc_forwarding = 0
net.ipv4.conf.eth0.mc_forwarding = 1
net.ipv4.conf.docker0.mc_forwarding = 1
net.ipv4.conf.veth#####.mc_forwarding = 0
net.ipv4.conf.pimreg.mc_forwarding = 1
net.ipv4.conf.vethfd#####.mc_forwarding = 0

troglobit commented 8 years ago

Hmm, that's just weird ... there should be a routing rule for the 12.200 group. There's one thing that may screw up things, and that's conntrack. When you use smcroute it knows pre-runtime what rules you want, so before traffic enters the router it has a multicast route set up. With pimd it takes a while to figure out what receivers exist before it installs a route, so the firewall may drop incoming traffic. Try flushing conntrack after starting pimd, a few times ...

It's really difficult to debug issues like this remote, so I'm sorry that I cannot help you better. If I were in your shoes I'd double check the TTL on the inbound multicast using tcpdump on eth0.

Maybe I should try out this new fancy docky thingy, there may be something with the docker0 interface that pimd does differently from smcroute, which needs adaptation, dunno ...

dpetzel commented 8 years ago

I totally understand how hard these things can be remotely, and I appreciate the time you have already spent.

Confirmed the incoming TTL

# tcpdump -s0 -i eth0 -vv -XX host 239.192.12.200
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
11:46:21.079540 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 31)
    REMOTE_SENDER.6000 > 239.192.12.200.6000: [udp sum ok] UDP, length 3

Flushing conntrack conntrack -F a few times (waited about 60 seconds between flushes) doesn't seem to do the trick.

As a test, I did a service iptables stop and then restarted pimd and my listener. Still no entry listed in ip mroute

Unsure if any of this is useful, but in case it is I did a grep on 12.200 from the output of pimd -d :

19:50:33.359 accept_group_report(): igmp_src 172.17.0.142 ssm_src 0.0.0.0 group 239.192.12.200 report_type 34
19:50:33.359 Set delete timer for group: 239.192.12.200
19:50:33.359 SM group order from  172.17.0.142 (*,239.192.12.200)
19:50:33.359 create group entry, group 239.192.12.200
19:50:46.779 accept_group_report(): igmp_src 172.17.0.142 ssm_src 0.0.0.0 group 239.192.12.200 report_type 34
19:50:46.779 Set delete timer for group: 239.192.12.200
19:50:46.779 create group entry, group 239.192.12.200
19:50:46.779 Adding vif 1 for group 239.192.12.200
INADDR_ANY       239.192.12.200   172.17.0.1       WC RP
INADDR_ANY       239.192.12.200   172.17.0.1       WC RP
INADDR_ANY       239.192.12.200   172.17.0.1       WC RP
19:51:02.225 accept_group_report(): igmp_src 172.17.0.142 ssm_src 0.0.0.0 group 239.192.12.200 report_type 34
19:51:02.225 Set delete timer for group: 239.192.12.200
19:51:02.225 Adding vif 1 for group 239.192.12.200
INADDR_ANY       239.192.12.200   172.17.0.1       WC RP
INADDR_ANY       239.192.12.200   172.17.0.1       WC RP
INADDR_ANY       239.192.12.200   172.17.0.1       WC RP
19:51:20.136 accept_group_report(): igmp_src 172.17.0.142 ssm_src 0.0.0.0 group 239.192.12.200 report_type 34
19:51:20.136 Set delete timer for group: 239.192.12.200
19:51:20.136 Adding vif 1 for group 239.192.12.200
INADDR_ANY       239.192.12.200   172.17.0.1       WC RP

Is there a particular logging statement around creation of the route?

troglobit commented 8 years ago

OK, TTL looks fine ... never hurts to double check that :smirk:

No firewall problems and no conntracking issues, I'm at a loss. Just tested a setup on my laptop, using a multicast sender in Qemu, connected to my latop host on virbr0, I used ping -I eth0 -t 5 225.1.2.3 inside my Qemu and then started my own mcjoin -i eth0 225.1.2.3 tool on my host's eth0 (sorry for confusing with same name!). When I then started pimd on my host I could see the ICMP frames with tcpdump on my host's eth0 after a short while. The log says:

 Added kernel MFC entry src 192.168.123.110 grp 225.1.2.3 from virbr0 to eth0

My mcjoin tool is a simple IP multicast receiver https://github.com/troglobit/toolbox/tree/master/mcjoin setup looks like this:

     Laptop host      Qemu sender
 __________________ _______________
|   tcpdump        |               |
|   mcjoin    pimd |               |
|      |           |               |
|      |           |               |
|      V           |               |
|    eth0   virbr0===eth0 <-- ping |
|__________________|_______________|

dpetzel commented 8 years ago

Very strange indeed. It can clearly create routes as it did for some of the groups, but not all (specifically this one I've been testing with). It never logs the MFC entry for this group for some reason.... I've been running strace for a little bit now hoping something might jump out, but so far nothing useful

troglobit commented 8 years ago

There are some fixes in the pipeline for the next release, already on master. If you're a brave soul you could try building the GIT sources:

 git clone https://github.com/troglobit/pimd.git
 cd pimd
 git submodule update --init
 ./configure && make

I'm terribly sorry I cannot be of any more help! :disappointed:

dpetzel commented 8 years ago

Well... I was brave enough (pimd version 2.3.2-rc2 starting). Sadly its the same behavior though its simply not adding the route entry. Its baffling how some groups are getting routes but this one is not.

No need to apologize, you have been extremely helpful. Even if nothing else comes of this I have learned a ton in the process.

The good news, is since learning I can have do more than 20 groups, igmpproxy works for my use cases, but pimd seems to be a much more maintained project.

troglobit commented 8 years ago

Very unfortunate, thank you for giving it a go anyway! I tried 239.192.12.200 in my setup and there it works, so I really don't know what could be the problem ... unless ...

... maybe the receiver does send a join, but then quickly sends a leave? Dunno, that's a loooong shot, but analyzing the tcpdump-log on the host for the IGMP traffic of that group might show sth?

troglobit commented 8 years ago

You could give mrouted a try as well, if you've got any strength left that is. I maintain that too, along with smcroute, and DVMRP (the protocol mrouted uses) is waay more simple (built-in RIP-like) and uses a flood-and-prune method instead. So it might be better for "world --> docker" deployments.

It's not as polished and capable as pimd, but at least you won't have to mess with static routes.

https://github.com/troglobit/mrouted/

(Much the same build system as I've set up for pimd.)

dpetzel commented 8 years ago

I had similar issues when I tried mrouted as well, but I'm probably gonna give it another go with what I've learned from this discussion.

troglobit commented 8 years ago

@dpetzel Hi again! Just couldn't let this one go, kept on nagging me that we couldn't get it working ...

So I fixed up my own tool, verified it outside of Docker and then used it as sink for 250 groups in a container. Went without a hitch.

I don't know if you gave up, or went with igmpproxy instead. Anyway, this may be too late and I'll likely close this issue before the next release. Just wanted to let you know.

Cheers!

dpetzel commented 8 years ago

Hey @troglobit Thanks for the follow up and apologies for any lost sleep I have caused :(. For right now, igmpproxy is fulfilling our 95% use cases, but I don't rule out we'll need to circle back and revisit PIMD so I really appreciate your right up. Its good to know that it can work, and that we just have something off in our configuration somewhere. I don't see any reason to keep this open.

troglobit commented 8 years ago

Great to hear back from you, @dpetzel, hope you circle back one day and good luck! :-)