keepsimple1 / mdns-sd

Rust library for mDNS based Service Discovery
Apache License 2.0
88 stars 38 forks source link

Having a browse running for a longer time, services with a ttl of 60 are sort of flapping #216

Closed hrzlgnm closed 1 month ago

hrzlgnm commented 1 month ago

Having the following publishing script, that set all TTL to 60

And running the query example against it, it seems to be sort of flapping:

What i can observe using `cargo run --example=query _test._tcp

At 232.17µs : SearchStarted("_test._tcp.local. on addrs [192.168.30.128, 2003:e8:bf3e:3f00:9caf:6c11:f93d:47bf, fe80::4ae2:8b49:d234:7a45, fe80::f3fc:8e58:38e3:ff2b, 192.168.178.84]")
At 52.022944ms : ServiceFound("_test._tcp.local.", "void-void-vm._test._tcp.local.")
At 52.243274ms: Resolved a new service: void-void-vm._test._tcp.local.
 host: void-void-vm.local.
 port: 4223
 Address: 192.168.178.84
At 1.002199044s : SearchStarted("_test._tcp.local. on addrs [192.168.30.128, 2003:e8:bf3e:3f00:9caf:6c11:f93d:47bf, fe80::4ae2:8b49:d234:7a45, fe80::f3fc:8e58:38e3:ff2b, 192.168.178.84]")
At 3.00448952s : SearchStarted("_test._tcp.local. on addrs [192.168.30.128, 2003:e8:bf3e:3f00:9caf:6c11:f93d:47bf, fe80::4ae2:8b49:d234:7a45, fe80::f3fc:8e58:38e3:ff2b, 192.168.178.84]")
At 7.013818528s : SearchStarted("_test._tcp.local. on addrs [192.168.30.128, 2003:e8:bf3e:3f00:9caf:6c11:f93d:47bf, fe80::4ae2:8b49:d234:7a45, fe80::f3fc:8e58:38e3:ff2b, 192.168.178.84]")
At 15.018062584s : SearchStarted("_test._tcp.local. on addrs [192.168.30.128, 2003:e8:bf3e:3f00:9caf:6c11:f93d:47bf, fe80::4ae2:8b49:d234:7a45, fe80::f3fc:8e58:38e3:ff2b, 192.168.178.84]")
At 31.023905263s : SearchStarted("_test._tcp.local. on addrs [192.168.30.128, 2003:e8:bf3e:3f00:9caf:6c11:f93d:47bf, fe80::4ae2:8b49:d234:7a45, fe80::f3fc:8e58:38e3:ff2b, 192.168.178.84]")
At 63.026812518s : SearchStarted("_test._tcp.local. on addrs [192.168.30.128, 2003:e8:bf3e:3f00:9caf:6c11:f93d:47bf, fe80::4ae2:8b49:d234:7a45, fe80::f3fc:8e58:38e3:ff2b, 192.168.178.84]")
At 127.034710079s : SearchStarted("_test._tcp.local. on addrs [192.168.30.128, 2003:e8:bf3e:3f00:9caf:6c11:f93d:47bf, fe80::4ae2:8b49:d234:7a45, fe80::f3fc:8e58:38e3:ff2b, 192.168.178.84]")
At 187.103158981s : ServiceRemoved("_test._tcp.local.", "void-void-vm._test._tcp.local.")
At 255.039446937s : SearchStarted("_test._tcp.local. on addrs [192.168.30.128, 2003:e8:bf3e:3f00:9caf:6c11:f93d:47bf, fe80::4ae2:8b49:d234:7a45, fe80::f3fc:8e58:38e3:ff2b, 192.168.178.84]")
At 255.113361982s : ServiceFound("_test._tcp.local.", "void-void-vm._test._tcp.local.")
At 255.113391995s: Resolved a new service: void-void-vm._test._tcp.local.
 host: void-void-vm.local.
 port: 4223
 Address: 192.168.178.84
At 315.113645282s : ServiceRemoved("_test._tcp.local.", "void-void-vm._test._tcp.local.")

After 187 seconds, the service is removed, despite still being published. And later found again after 255s and removed again after 315s.

hrzlgnm commented 1 month ago

I cannot observe the service remove and added behavior when I run avahi-browse -pr _test._tcp.

keepsimple1 commented 1 month ago

Is there any other instances of the same service type on any of other interfaces? Particularly in the network of 192.168.30.128 ?

And, could you try to turn on the logging using RUST_LOG=debug when running the querier example, to get the logs before the ServiceRemoved event?

keepsimple1 commented 1 month ago

I was able to reproduce this issue. And I think the root cause is: we are not refreshing PTR DNS records, hence after a while PTR records expired and we remove the services.

The reason that the problem did not show up earlier is that we have a timer with exponential backoff (1s, 2s, 4s, ...), and these repeat queries would update the records. But later the gap is longer than the TTL 60 seconds, then the PTR record expires without proper refreshing.

I've opened PR #217 to fix this issue. Please check if that helps. I will try to come up with some test cases later.

Thanks for reporting this issue!

keepsimple1 commented 1 month ago

With a longer time testing, it seems the problem not completely gone. I will debug it further tomorrow.

hrzlgnm commented 1 month ago

Is there any other instances of the same service type on any of other interfaces? Particularly in the network of 192.168.30.128 ?

No there is no other instance of the same service type on any of the other interfaces. And I cannot observe the issue when there are at least two services on the network with the same type.

And, could you try to turn on the logging using RUST_LOG=debug when running the querier example, to get the logs before the ServiceRemoved event?

Are the logs still needed, as you can reproduce the issue now?

hrzlgnm commented 1 month ago

I was able to reproduce this issue. And I think the root cause is: we are not refreshing PTR DNS records, hence after a while PTR records expired and we remove the services.

The reason that the problem did not show up earlier is that we have a timer with exponential backoff (1s, 2s, 4s, ...), and these repeat queries would update the records. But later the gap is longer than the TTL 60 seconds, then the PTR record expires without proper refreshing.

I've opened PR #217 to fix this issue. Please check if that helps. I will try to come up with some test cases later.

Thanks for reporting this issue!

I've checked the fix from #217 and cannot observe the issue anymore when running the querier example for at least so long that the message At 1023.057225766s : SearchStarted("_test._tcp.local... was logged.

hrzlgnm commented 1 month ago

Additionally i also tested ending the publishing python script, and restarting it, which has some interesting effect:

After a quit and restart of the publish3.py, the service is Resolved twice immediately.

At 8451.632279173s : ServiceFound("_test._tcp.local.", "void-void-vm._test._tcp.local.")
At 8451.632511101s: Resolved a new service: void-void-vm._test._tcp.local.
 host: void-void-vm.local.
 port: 4223
 Address: 192.168.178.84
At 8451.644626053s: Resolved a new service: void-void-vm._test._tcp.local.
 host: void-void-vm.local.
 port: 4223
 Address: 192.168.178.84

A log with debug enabling showing this case where the publish3.py with reduced ttls to 10 was restarted 3 times and stopped at the end: restarts-of-publish-log2.log

keepsimple1 commented 1 month ago

From your log, it looks like it was because an address record was received with different cache_flush = false (ttl = 60) , which previously cases are cache_flush = true (ttl = 10). Such record is considered "new" by our current code as DnsEntry struct is different if cache_flush changes: https://github.com/keepsimple1/mdns-sd/blob/5eb74b508d602ba2181029c6d6cb434d8dc540a5/src/dns_parser.rs#L64-L68

Here are all the Resolved events:

$ grep Resolved restarts-of-publish-log2.log                      
At 8.277988737s: Resolved a new service: void-void-vm._test._tcp.local.
At 8.282031276s: Resolved a new service: void-void-vm._test._tcp.local.
At 13.989534257s: Resolved a new service: void-void-vm._test._tcp.local.
At 20.537252221s: Resolved a new service: void-void-vm._test._tcp.local.
At 20.544644141s: Resolved a new service: void-void-vm._test._tcp.local.
At 30.538414974s: Resolved a new service: void-void-vm._test._tcp.local.
At 178.756557312s: Resolved a new service: void-void-vm._test._tcp.local.
At 178.760520066s: Resolved a new service: void-void-vm._test._tcp.local.
At 182.658263946s: Resolved a new service: void-void-vm._test._tcp.local.

For every pair of immediate Resolved events, the 2nd one seems to be caused by a "new" DnsAddress record. For example:

[2024-05-12T13:43:57.456Z DEBUG mdns_sd::dns_parser] DnsAddress { record: DnsRecord { entry: DnsEntry { name: "void-void-vm.local.", ty: 1, class: 1, cache_flush: false }, ttl: 60, created: 1715521437456, expires: 1715521497456, refresh: 1715521485456 }, address: 192.168.178.84 }
At 8.282031276s: Resolved a new service: void-void-vm._test._tcp.local.

Any ideas why some of the DnsAddress records has TTL 60 and cache-flush false ?

As the solution, I think we could add a logic that if cache_flush is false, we add it to the end of the Vec for the same record name.

Btw, I think it is a separate issue from the original issue, so I think we can address it in a separate PR.

keepsimple1 commented 1 month ago

With a longer time testing, it seems the problem not completely gone. I will debug it further tomorrow.

I debugged a it more. Looks like it was because I forgot to add new refresh timers when a refresh time is further updated to 85%, or 90% or 95% of the TTL. I updated the PR to fix that. My testing seems to be good. If you could also test this updated patch, that will be great!

hrzlgnm commented 1 month ago

With a longer time testing, it seems the problem not completely gone. I will debug it further tomorrow.

I debugged a it more. Looks like it was because I forgot to add new refresh timers when a refresh time is further updated to 85%, or 90% or 95% of the TTL. I updated the PR to fix that. My testing seems to be good. If you could also test this updated patch, that will be great!

I tested the updated patch and it looks good.

hrzlgnm commented 1 month ago

From your log, it looks like it was because an address record was received with different cache_flush = false (ttl = 60) , which previously cases are cache_flush = true (ttl = 10). Such record is considered "new" by our current code as DnsEntry struct is different if cache_flush changes:

I've redone the test, and I indeed can observe the one with ttl=60 being send, despite using a ttl of 10. I guess there was another mdns responder like avahi involved, which saw the same service with a ttl=60 before. I'll investigate further later.

https://github.com/keepsimple1/mdns-sd/blob/5eb74b508d602ba2181029c6d6cb434d8dc540a5/src/dns_parser.rs#L64-L68

Here are all the Resolved events:

$ grep Resolved restarts-of-publish-log2.log                      
At 8.277988737s: Resolved a new service: void-void-vm._test._tcp.local.
At 8.282031276s: Resolved a new service: void-void-vm._test._tcp.local.
At 13.989534257s: Resolved a new service: void-void-vm._test._tcp.local.
At 20.537252221s: Resolved a new service: void-void-vm._test._tcp.local.
At 20.544644141s: Resolved a new service: void-void-vm._test._tcp.local.
At 30.538414974s: Resolved a new service: void-void-vm._test._tcp.local.
At 178.756557312s: Resolved a new service: void-void-vm._test._tcp.local.
At 178.760520066s: Resolved a new service: void-void-vm._test._tcp.local.
At 182.658263946s: Resolved a new service: void-void-vm._test._tcp.local.

For every pair of immediate Resolved events, the 2nd one seems to be caused by a "new" DnsAddress record. For example:

[2024-05-12T13:43:57.456Z DEBUG mdns_sd::dns_parser] DnsAddress { record: DnsRecord { entry: DnsEntry { name: "void-void-vm.local.", ty: 1, class: 1, cache_flush: false }, ttl: 60, created: 1715521437456, expires: 1715521497456, refresh: 1715521485456 }, address: 192.168.178.84 }
At 8.282031276s: Resolved a new service: void-void-vm._test._tcp.local.

Any ideas why some of the DnsAddress records has TTL 60 and cache-flush false ?

I've redone the test, and I indeed can observe the one with ttl=60 being send, despite using a ttl of 10. I guess there was another mdns responder like avahi involved, which saw the same service and cached the entry for another interface with a ttl=60 before. I'll investigate further later.

As the solution, I think we could add a logic that if cache_flush is false, we add it to the end of the Vec for the same record name.

Btw, I think it is a separate issue from the original issue, so I think we can address it in a separate PR.

Yes let's move it to another issue, I'll open a separate one after investigating further why this is happening in my setup.

hrzlgnm commented 1 month ago

From your log, it looks like it was because an address record was received with different cache_flush = false (ttl = 60) , which previously cases are cache_flush = true (ttl = 10). Such record is considered "new" by our current code as DnsEntry struct is different if cache_flush changes:

I've redone the test, and I indeed can observe the one with ttl=60 being send, despite using a ttl of 10. I guess there was another mdns responder like avahi involved, which saw the same service with a ttl=60 before. I'll investigate further later.

https://github.com/keepsimple1/mdns-sd/blob/5eb74b508d602ba2181029c6d6cb434d8dc540a5/src/dns_parser.rs#L64-L68

Here are all the Resolved events:

$ grep Resolved restarts-of-publish-log2.log                      
At 8.277988737s: Resolved a new service: void-void-vm._test._tcp.local.
At 8.282031276s: Resolved a new service: void-void-vm._test._tcp.local.
At 13.989534257s: Resolved a new service: void-void-vm._test._tcp.local.
At 20.537252221s: Resolved a new service: void-void-vm._test._tcp.local.
At 20.544644141s: Resolved a new service: void-void-vm._test._tcp.local.
At 30.538414974s: Resolved a new service: void-void-vm._test._tcp.local.
At 178.756557312s: Resolved a new service: void-void-vm._test._tcp.local.
At 178.760520066s: Resolved a new service: void-void-vm._test._tcp.local.
At 182.658263946s: Resolved a new service: void-void-vm._test._tcp.local.

For every pair of immediate Resolved events, the 2nd one seems to be caused by a "new" DnsAddress record. For example:

[2024-05-12T13:43:57.456Z DEBUG mdns_sd::dns_parser] DnsAddress { record: DnsRecord { entry: DnsEntry { name: "void-void-vm.local.", ty: 1, class: 1, cache_flush: false }, ttl: 60, created: 1715521437456, expires: 1715521497456, refresh: 1715521485456 }, address: 192.168.178.84 }
At 8.282031276s: Resolved a new service: void-void-vm._test._tcp.local.

Any ideas why some of the DnsAddress records has TTL 60 and cache-flush false ?

I've redone the test, and I indeed can observe the one with ttl=60 being send, despite using a ttl of 10. I guess there was another mdns responder like avahi involved, which saw the same service and cached the entry for another interface with a ttl=60 before. I'll investigate further later.

As the solution, I think we could add a logic that if cache_flush is false, we add it to the end of the Vec for the same record name. Btw, I think it is a separate issue from the original issue, so I think we can address it in a separate PR.

Yes let's move it to another issue, I'll open a separate one after investigating further why this is happening in my setup.

I've checked this again, and it was just a wrong usage of the zeroconf python API on my side, if one wants to limit the publish to only one interface, one has to pass this info to the Zeroconf() constructor. See updated publish4.py which does not expose the "weird behavior". So there is no real issue, just wrong API usage.

keepsimple1 commented 1 month ago

PR #217 merged. Thank you for testing diligently!