philippe44 / LMS-Cast

Chromecast to LMS bridge
64 stars 10 forks source link

Mdns lifetimes #30

Closed jtbr closed 1 year ago

jtbr commented 1 year ago

I was having problems with cast devices disappearing from LMS, especially when removalTimeout was 0. In particular sometimes I would see messages like this, showing a renderer be added and removed within a few milliseconds:

[11:59:53.857775] mDNSsearchCallback:691 ----------------- round ------------------
[11:59:53.857832] mDNSsearchCallback:694 host 192.168.10.157, srv 192.168.10.157, name Chromecast-Audio-5287504092fcc733fc2769fcc96d472d._googlecast._tc
[11:59:53.857912] AddCastDevice:961 [0xb40950]: adding renderer (Kitchen speaker) with mac CCCC-32CA5D39
[11:59:53.858151] stream_thread_init:477 [0xaa9f98] streambuf size: 524288
[11:59:53.858212] output_thread_init:616 [0xaa9f98] init output media renderer
[11:59:53.858232] decode_thread_init:168 [0xaa9f98]: init decode
[11:59:53.858367] resample_init:340 [0xaa9f98]: resampling sync recipe: 0x00, flags: 0x00, scale: 0.89, precision: 0.0, passband_end: 0.00000, stopband_
[11:59:53.858407] discover_server:826 [0xaa9f98] sending discovery
[11:59:53.858432] mDNSsearchCallback:837 [0xb40950]: removing renderer (Kitchen speaker) on timeout

It turned out that the same call to mDNSsearchCallback was doing both the adding and the removing, and this because there was not an active connection to the renderer (at least not yet). After this, the devices would not re-appear, because mdns thought they were still active.

I made some modifications to the removal logic to reflect the behavior that I think was intended. Devices are removed when the mdns times out. If removalTimeout > 0 or if (removalTimeout == 0 and there is still an active connection to the device) then removal will be delayed until either the timeout expires or the connection ends, respectively. Lifetime is managed entirely inside of the mdns code. Because the mDNSSearchCallback function is only being called when mdns triggers an event (which can be rare), I added another call to periodically check whether removalTimeout has expired for some device(s). For safety, I moved the mutex around the whole function because this code will now be called from two different threads.

I've been testing the changes for the last week on my three chromecast audio devices (including one group) and it seems to be working as expected. None are disappearing, except for when they've actually been disconnected or had mdns time out.

In addition I found a mutex unlock that seemed out of place, I added a few comments that were useful for me in understanding the code and some log messages for debugging purposes. Hope these are useful, but obviously, your call on if you want to integrate them.

Many thanks.

jtbr commented 1 year ago

FYI the latest change should also fix the tab spacing I messed up

jtbr commented 1 year ago

One last thing. Might be worth suggesting to use a positive removal timeout in the documentation, since I do find that my devices often miss the mdns keep-alive deadline by sometimes a few seconds or a couple minutes, even if they're supposedly still online...

philippe44 commented 1 year ago

I had a better look at my code and there is an issue which explains why even with setting "Expired" to 0 it did not work and why it's worse "recently".

A few weeks ago (I'd need to check precisely) I changed the callback system so that it would call regularly, even when there is nothing to report, which triggers expired devices to be removed even when they were kept for a while due to use of a timeout. That bit did not work previously when the callback was not called regularly. A device would expire, the timeout would "kick in" but with no follow-up until something else happens, which as you pointed out, can take a very long time. So, device wrongly kicked-out would stay and user be (sort of) happy.

Now, the real problem is that you can see there is a detection to filter out devices that are on the same subnet but are announced by repeaters (not sure of the exact mDNS term) but we want to keep devices that are coming from other subnets.

Now, I did a endianness mistake in the netmask which means that "repeater" announces, even on the same subnet, where still taken into consideration. And these announces expires regularly / randomly.

So, net net, before I corrected the issue of no timeout kicking-in, the real netmask issue was hidden, once I fixed the timeout, the netmask issue bit...

Anyway, I'll update my code accordingly base don your PR and some mods. It's on all my apps that use mDNS

Thanks for helping me to figure out this one.