troglobit / mdnsd

Jeremie Miller's original mdnsd
BSD 3-Clause "New" or "Revised" License
56 stars 35 forks source link

Not responding to multicast requests while on link-local network? #14

Closed thom-nic closed 6 years ago

thom-nic commented 6 years ago

On a DHCP network, mdnsd is super reliable however if I only have a link-local address, I see mdnsd do an initial publish on startup. After that point, it never appears to receive any multicast requests (nothing is logged.) As a result, I can resolve the .local name for a couple minutes until the TTL expires from the initial broadcast, then nothing.

My setup: crossover link, direct from device to ethernet port on my laptop. Can ping device from laptop.

Initial startup, everything looks good: (wireshark from my laptop on left, device shell on right.) I started testing c854aac in this example: screen shot 2018-08-10 at 9 35 45 am

After TTL expires, try again: Queries but no responses. You can see I can still ping the device by IP address from the terminal. screen shot 2018-08-10 at 9 43 37 am

Another thing I noticed (only on 0dca088) when on link-local: without the -a option, mdnsd seems to have a hard time figuring out the IP address (for binding the listen socket?)

$ mdnsd -i eth0 -n -l info
mdnsd 0.8-dev starting.
mdnsd_out(): Send Publish PTR: Name: _ssh._tcp.local., rdlen: 0, rdata: (null), rdname: foobar._ssh._tcp.local.
mdnsd_out(): Send Publish PTR: Name: _services._dns-sd._udp.local., rdlen: 0, rdata: (null), rdname: _ssh._tcp.local.
mdnsd_out(): Send Publish PTR: Name: _http._tcp.local., rdlen: 0, rdata: (null), rdname: foobar._http._tcp.local.
mdnsd_out(): Send Publish PTR: Name: _services._dns-sd._udp.local., rdlen: 0, rdata: (null), rdname: _http._tcp.local.
_r_out(): Appending name: foobar.local., type 1 to outbound message ...
Failed writing to socket: Network is unreachable
mdnsd 0.8-dev exiting.

If I give the -p option, it restarts and then publishes the records, but then same behavior after TTL expires:

$ mdnsd -i eth0 -n -l info -p
mdnsd 0.8-dev starting.
mdnsd_out(): Send Publish PTR: Name: _ssh._tcp.local., rdlen: 0, rdata: (null), rdname: foobar._ssh._tcp.local.
mdnsd_out(): Send Publish PTR: Name: _services._dns-sd._udp.local., rdlen: 0, rdata: (null), rdname: _ssh._tcp.local.
mdnsd_out(): Send Publish PTR: Name: _http._tcp.local., rdlen: 0, rdata: (null), rdname: foobar._http._tcp.local.
mdnsd_out(): Send Publish PTR: Name: _services._dns-sd._udp.local., rdlen: 0, rdata: (null), rdname: _http._tcp.local.
_r_out(): Appending name: foobar.local., type 1 to outbound message ...
Failed writing to socket: Network is unreachable
mdnsd 0.8-dev starting.
mdnsd_out(): Send Probing: Name: foobar._ssh._tcp.local., Type: 16
mdnsd_out(): Send Probing: Name: foobar._ssh._tcp.local., Type: 33
mdnsd_out(): Send Probing: Name: foobar._http._tcp.local., Type: 16

.....

If I give -a then it doesn't complain about not binding to the socket, but same behavior as before - it doesn't appear to be receiving any requests: Packets 2-10 are initial broadcast, second set are queries from my laptop after TTL expired. No responses and nothing logged on the device: screen shot 2018-08-10 at 10 01 06 am

On the device:

$ ip a show dev eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq qlen 1000
    link/ether 54:4a:16:bb:00:00 brd ff:ff:ff:ff:ff:ff
    inet 169.254.42.227/16 brd 169.254.255.255 scope link eth0
       valid_lft forever preferred_lft forever
    inet6 2002:4857:5d6d:1:564a:16ff:febb:0/64 scope global dynamic 
       valid_lft 86362sec preferred_lft 86362sec
    inet6 fe80::564a:16ff:febb:0/64 scope link 
       valid_lft forever preferred_lft forever
troglobit commented 6 years ago

Interesting, I did not expect that. I'll have a look at it during the weekend, thank you for a great report! :)

troglobit commented 6 years ago

Progress report:

Is it possible the zeroconf client periodically re-sets your IP address or sth? Or could it be a problem with the routing table? Currently mdnsd sends packets using the standard sendto() API which lets Linux send packets using the routing table. If you don't have a dedicated route for 224.0.0.0/24 the default route will be used and I don't think it'll work without one.

This is also what limits mdnsd right now to only run on one interface, see #8 for more on this. I'll try to figure out how the Apple mDNSResponder operates but it's not on the top of my list right now.

Also, I haven't been able to reproduce that other issue; trouble figuring out the IP address. But there too mdnsd parses /proc/net/route to find the outbound interface for the default route and then read that interface's address using getifaddrs().

thom-nic commented 6 years ago

It could certainly be a route issue; I'm using the standard zcip script from BusyBox examples. I'll look at the routing table.

FWIW: inbound ping and SSH work to the device, so TCP and ICMP listen works correctly. As I mentioned before, initial multicast announcements are sent correctly but I don't see any multicast requests received on the device afterwards. Not sure how the routing table might affect that.

I'm using kernel v4.9.59 and no sysctl settings, so whatever the default behavior is for ipv4.

I'll see if I can gather more info on the device side, at least I can verify the listen socket is bound using netstat. And I'll report back with routing details.

troglobit commented 6 years ago

Thanks! :)

I've reproduced something now. I fired up TroglOS, which is BusyBox based, and set up everything manually. With no default route, only the 169.254/16 net route, I get similar problems you've reported. I'm looking into improving the address and interface detection now and may have some fixes for tomorrow (CET).

Update: as soon as I add either a default route ip route add default dev eth0, or a net route for IP multicast ip route add 224.0.0.0/24 dev eth0, everything works. I haven't cracked yet how (or even if!) the Apple mDNSResponder can work without any of these routes.

troglobit commented 6 years ago

OK, I think I've fixed it. Found a way around the routing table that we can also use for #8 (later on).

troglobit commented 6 years ago

@thom-nic Please let me know how it works for you, I'm really curious :)

thom-nic commented 6 years ago

Hey that looks like it worked! I'm seeing packets now from the device when on a link-local net.

For the record, assume this will not be surprising but here's my route table while on DHCP:

Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         router.mycompan 0.0.0.0         UG    0      0        0 eth0
169.254.0.0     *               255.255.0.0     U     0      0        0 eth0
192.168.1.0     *               255.255.255.0   U     0      0        0 eth0

and on local link:

Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
169.254.0.0     *               255.255.0.0     U     0      0        0 eth0

So, you were right there's no default route. Should there be?

troglobit commented 6 years ago

Good to hear, this is exactly what I could see as well, thanks for reporting back! :-)

Now that I've fixed ingress/egress filtering for the mulitcast socket in mdnsd, you don't need a default route for mdnsd to work. But you might want to have a default route (with a high metric so it acts as fallback) in your link-local case ... problem is what should your next-hop be ... oh well. Guess it all depends on your other applications.

thom-nic commented 6 years ago

You know, I wanted to ask what you were seeing on that Debian VM, where it was working? I'm happy to modify the zcip script if it will potentially solve other networking issues.

I think, is there supposed to be a "next hop" in a zeroconf network? Or by definition is it just one big subnet not connected to anything else?

troglobit commented 6 years ago

They use an interface route, which might help some use-cases, like this (notice the metric):

example@stretch:~$ ip -br addr
lo               UNKNOWN        127.0.0.1/8 
ens3             UP             169.254.5.91/16 
example@stretch:~$ ip -br route
default dev ens3 scope link metric 1002 
169.254.0.0/16 dev ens3 proto kernel scope link src 169.254.5.91 

Regarding your question, I think it's just supposed to be link-local to the given LAN. At work we do industrial switches + routers and we've been toying with the idea of having our multi-homed devices run link-local with source-routing set up for each such interface.

thom-nic commented 6 years ago

So, this is working great - I've been playing with it for a few days; I just noticed two things...

I thought maybe, I could run mdnsd like other network daemons (dropbear, httpd, ntpd) and just start them up before udhcpc or zcip have assigned any IP address: however in that case, mdnsd does its initial broadcast with an address of 0.0.0.0. At the very least this sort-of "poisons" the mDNS resolver cache until TTL expires. But subsequent announcements don't update the IP address even after I have valid addresses assigned to the interface. Still seeing 0.0.0.0 advertised after watching Wireshark for a few minutes. FWIW in this case I'm just running mdnsd -p during init, and not restarting the daemon at any point afterwards.

Second... I have both zcip and udhcpc triggered from ifplugd events. It seems, depending on who assigns an IP last, that is the address chosen by mdnsd. (That is, assuming I'm not in the situation described above, and mdnsd is started after zcip and udhcpc have done their thing.) So it's a race; if udhcpc assigns an address first, then zcip, things look like this:

$ ip addr show dev eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq qlen 1000
    link/ether 54:4a:16:bb:00:00 brd ff:ff:ff:ff:ff:ff
    inet 169.254.42.227/16 brd 169.254.255.255 scope link eth0
       valid_lft forever preferred_lft forever
    inet 192.168.1.178/24 brd 192.168.1.255 scope global eth0
       valid_lft forever preferred_lft forever

In this case, it appears that 169.254.42.227 is the address advertised by mdnsd. One might argue the DHCP address should be used instead (maybe the selection criteria prefers a global over link local address?) One could also argue that mDNS goes in hand with ZeroConf, so maybe it's natural to choose the LL address.

As an aside, I also did what you suggested previously, and added an ifplugd script that assigns a high-metric default route when an interface comes up.

Now: that all said, I'm happy to keep my setup where mdnsd [re]starts are triggered by zcip or udhcpc, in which case if I care about preferring the DHCP-assigned address I can use the -a flag. So, I think none of the above is really an issue that I can't work around anymore. Just wanted to point out those observations.

Overall, I'm thrilled with mdnsd and your support has been 💯 💯 💯 ! Thanks again.

troglobit commented 6 years ago

I'm so happy to hear back from you, I've been thinking about you. Thank you for the kind words! :smiley:

I've just opened #16 with your new observations, I hope I didn't miss any vital piece of information. If I did, please don't hesitate to add to that issue. I agree fully with your findings. The address selection criteria you propose seem obvious, i.e. prefer a global address over link-local. Possibly there is a way to announce all (IPv4) addresses, but I don't know yet.

Great to hear you're making strides with your setup! I've been meaning to use ifplugd more myself for some embedded targets of mine :)

I think it's maybe time to release v0.8, as soon as I've addressed #16, because it sounds like you can live without #15, right?

thom-nic commented 6 years ago

Hehe, actually I didn't notice #15, since this issue has been fixed mdnsd has done a good job of proactively announcing so the A record stays fresh in my OS's resolver cache. (And since I startup mdnsd via ifplugd, it always does a broadcast after the link comes up!)

I could see #15 being an issue, if device was connected to a switch w/o DHCP, then user's laptop was subsequently connected to the switch, it would have missed the initial mdnsd announce. But that's not an expected usage pattern 98% of the time, so no, I don't think it's a blocker.

16 description looks good otherwise though!

troglobit commented 6 years ago

Good, then we agree on #15, removing from v0.8 release target :)

Thanks, I'll get to work on #16 as soon as possible. Likely tomorrow.

Cheers!