keepsimple1 / mdns-sd

Rust library for mDNS based Service Discovery
Apache License 2.0
96 stars 37 forks source link

Some weird caching behavior #213

Closed hrzlgnm closed 4 months ago

hrzlgnm commented 4 months ago

Having the following setup

Let's say we have some sort of redundancy setup where we use two network interfaces in different subnets. And on both interfaces we also publish the same instance name, but a different hostname is used, to avoid collisions:

When resolving those using cargo run --example=query _test._tcp one get both services resolved when running on the same machine. But now comes the catch, if one restarts the browse, on the same mdns-daemon instance with the same service_type _test._tcp one only get one of the services resolved.

See #214 where i modified the query example to do that.

hrzlgnm commented 4 months ago

Example output of the modified query example from #214

---first receive cycle---
At 722.976µs : SearchStarted("_test._tcp.local. on addrs [192.168.122.79, 192.168.100.140, 192.168.42.9, fe80::67ff:49ff:c2b5:d079, fe80::b6ec:dfe:2f30:750f, fe80::d50c:ccac:7d50:ccb2]")
At 39.257001ms : ServiceFound("_test._tcp.local.", "thunk-void-vm._test._tcp.local.")
At 39.332082ms: Resolved a new service: thunk-void-vm._test._tcp.local.
 host: thunk-void-vm-4f5eea3e-318c-4c21-a648-13ffeee510e8.local.
 port: 4223
 Address: 192.168.100.140
At 49.996074ms: Resolved a new service: thunk-void-vm._test._tcp.local.
 host: thunk-void-vm-9e6bc890-df28-444a-8515-bf785ecf4bf6.local.
 port: 4223
 Address: 192.168.122.79
---second receive cycle---
At 7.003371126s : SearchStarted("_test._tcp.local. on addrs [192.168.122.79, 192.168.100.140, 192.168.42.9, fe80::67ff:49ff:c2b5:d079, fe80::b6ec:dfe:2f30:750f, fe80::d50c:ccac:7d50:ccb2]")
At 7.0033837s : ServiceFound("_test._tcp.local.", "thunk-void-vm._test._tcp.local.")
At 7.003387887s: Resolved a new service: thunk-void-vm._test._tcp.local.
 host: thunk-void-vm-9e6bc890-df28-444a-8515-bf785ecf4bf6.local.
 port: 4223
 Address: 192.168.122.79
---
hrzlgnm commented 4 months ago

The second resolved service seems to overwrite the first one after being resolved. I've seen cases where the order was the other way around and who came second won.

But I'm also not sure whether the behavior of the publishing side is correct, according to the respective standards.

hrzlgnm commented 4 months ago

I played around with https://gitlab.com/hrzlgnm/m/-/blob/master/zerodings/resolve.py?ref_type=heads and python zeroconf also seems to update the first entry, when seeing the second one:

Service thunk-void-vm._test._tcp.local. added, service info: ServiceInfo(type='_test._tcp.local.', name='thunk-void-vm._test._tcp.local.', addresses=[b'\xc0\xa8zO'], port=4223, weight=0, priority=0, server='thunk-void-vm-9e6bc890-df28-444a-8515-bf785ecf4bf6.local.', properties={}, interface_index=None)
Service thunk-void-vm._test._tcp.local. updated ServiceInfo(type='_test._tcp.local.', name='thunk-void-vm._test._tcp.local.', addresses=[b'\xc0\xa8d\x8c'], port=4223, weight=0, priority=0, server='thunk-void-vm-4f5eea3e-318c-4c21-a648-13ffeee510e8.local.', properties={}, interface_index=None)
hrzlgnm commented 4 months ago

Using avahi-browse -rp _test._tcp, one also can observe the same behavior, that one of the records overwrites the one in the cache:

First resolve after fresh restart of avahi-daemon:
----------------------------------------------------------
+;virbr0;IPv4;thunk-void-vm;_test._tcp;local
+;virbr1;IPv4;thunk-void-vm;_test._tcp;local
=;virbr0;IPv4;thunk-void-vm;_test._tcp;local;thunk-void-vm-f1715f24-d718-44ec-871b-abe05104215f.local;192.168.122.79;4223;
=;virbr1;IPv4;thunk-void-vm;_test._tcp;local;thunk-void-vm-f1715f24-d718-44ec-871b-abe05104215f.local;192.168.122.79;4223;
----------------------------------------------------------
From there on, I only could see:
----------------------------------------------------------
+;virbr1;IPv4;thunk-void-vm;_test._tcp;local
+;virbr0;IPv4;thunk-void-vm;_test._tcp;local
=;virbr1;IPv4;thunk-void-vm;_test._tcp;local;thunk-void-vm-52143363-5599-49f3-a6ca-c53b62e0e6e2.local;192.168.100.140;4223;
=;virbr0;IPv4;thunk-void-vm;_test._tcp;local;thunk-void-vm-52143363-5599-49f3-a6ca-c53b62e0e6e2.local;192.168.100.140;4223;
----------------------------------------------------------
hrzlgnm commented 4 months ago

I thought about the "publishing" and "redundancy" approach a bit more and came to the conclusion, that the publishing approach is perhaps wrong. It makes no sense to publish the same instance name on both network interfaces without causing collisions or cache overwrites.

keepsimple1 commented 4 months ago

When resolving those using cargo run --example=query _test._tcp one get both services resolved when running on the same machine.

Although it looks like both services are resolved, they are actually two events from the same cache. As ServiceInfo struct only contains / supports one hostname, the 2nd resolve overwrites the the first resolve. (In this sense, it is same / similar to python zeroconf or avahi)

But now comes the catch, if one restarts the browse, on the same mdns-daemon instance with the same service_type _test._tcp one only get one of the services resolved.

When searching again, the cache is still "hot" (not expired), so it will immediately resolve from the cache, with the 2nd instance info.

That said, internally we actually stored both records in the vec, but it will only use the first entry. And when a new record are received, it is inserted at the head of the vec, hence effectively overwrites in the ServiceInfo.

keepsimple1 commented 4 months ago

It makes no sense to publish the same instance name on both network interfaces without causing collisions or cache overwrites.

I tend to agree. From what I saw, one instance is always mapping to one host (not multiple hosts). Not sure if there are exceptions. (And if considering load-balancing, I think most time is using multiple IP addrs for one host / instance).

hrzlgnm commented 4 months ago

Thanks for your input @keepsimple1, I'm closing the issue as it isn't one. I think it makes more sense to publish the record with multiple IP addrs.