Define an interface for integrations to configure stream to support expiring URLs

Context

TL;DR: (1)Define an interface for integrations to configure stream and (2) revisit concatenation across streams.

The nest SDM integration has urls that expire and must be refreshed every 5 minutes as a privacy/security feature. See Device Access: Access a live stream for details on the API and see how Home Assistant schedules an alarm to make this happen.
The current interaction between camera and stream is centered around the assumption that stream_source is a stable URL, which is a very reasonable assumption for most cameras that just expose a single url.
1. More context: There is also a dict of stream_options, and passes in the keepalive pref managed by the camera itself.
The nest implementation can't confirm to that assumption and the impact is captured in Nest SDM Camera stops streaming, and is unable to provide an excellent user experience. Namely:
1. When viewing a stream in the UI, it will break when the URL expires. The user must manually refresh.
2. The Preload Stream option does not work (and may not be recoverable?), so the only option is to load a stream every time which may result in additional latency when joining a stream (You'd think it would help, but thumbnail generated from stream from device triggers are quite old)
3. Worker threads continue to process streams and add loud error messages to the logs. In the case of Preload Stream that may be one thread every time stream was invoked for a different URL!

Talking with @hunterjm on discord, we thought an architecture issue would be a good place to discuss next steps and some of this proposal came from that talk. (see discussion notes for the raw notes). @uvjustin and @elupus -- would love to hear more thoughts, things to watch out for, or suggestions for PR order. We can also find a time to chat on the Home Assistant discord of that would be better.

There are some opportunities to align the solution with other existing problems, namely:

There is already logic within the worker to re-join a stream when it goes down. (e.g. restarting on failure, handling backoff, etc)
Restarting streams may already have some gaps today. (I'm actually not super familiar with the details of the gaps here, but it may involve adjusting to PTS/DTS changes when re-joining a stream, and presenting a unified view to the frontend. Some cameras may have a single long running stream that is joined while others may act more like a stream per session and provide a new PTS/DTS to each viewer)
Test coverage of the stream worker is ~15%. The tests in test_hly.py are currently disabled, though this is likely due to historical reasons of the test platform. However, even if re-enabled, the tests decode a specific video file that may not exercise all the corner cases handled in the stream worker. My understanding is that while many high quality cameras behave predictably, there may be a long tail of cameras with ... unique .. behaviors where changes to stream behavior may result in issues that are difficult to debug. Existing maintainers of stream may have more color here.
Some relevant historical PRs on stream state are https://github.com/home-assistant/core/pull/42399 and https://github.com/home-assistant/core/pull/42277 just for context.

Proposal

Define an interface for integrations to configure stream. Integrations need a way to update changes to stream configuration that impact the worker -- namely the stream URL.
Revisit how concatenation across individual streams for display is handled.
Revisit testing strategy for the stream component

We'd likely want to make some improvements on #3 while doing any larger changes. #1 would be first to make preload work, then, #2 of handling worker restarts better would be ideal and can come after.

Integration Stream configuration

Introduce the notion of a stream id or cache_key, e.g. associated with a camera, to uniquely identify a combined stream independent of the url. Something like the camera's entity_id or unique_id could be a fit. We'd still want to support lookup by stream url as a performance win so that async_handle_record_service and request_stream can continue to share a single worker. Technically, some of the stream_options could factor in here, but these are unlikely to change in practice. It may also make sense to factor this out into some internal wrapper within stream.

Add new top-level methods in stream to communicate with the stream worker, and push in changes to the stream source by stream id or cache_key. The integration will (through the new top level methods) communicate with Stream. Stream will continue to manage start/stop of the worker when the url/options change. Additionally, we can pull out the keepalive logic from worker so that Stream manages that lifecycle. That is, when a worker exits, Stream will be responsible for restarting it. Currently the worker takes in a reference to a stream, which will be changed instead to take a source and options only.

Concatenation of Streams

The worker is currently responsible for:

invoking pyav on a stream source
demuxing audio/video packets
peeking into the start of the stream to validate , and overall ensuring packets overall are ordered properly
writing packets to output buffers
grouping packets into segments on keyframe and writing then to stream outputs, etc. The worker has 5 internal methods, hinting that we may be outgoing the existing structure.

It may instead be a good time to separate out the work into a pipeline or series of stream processors (e.g. like StreamBuffer or StreamOutput or something else) for ensuring DTS order, vs buffering, vs grouping packets into segments and tracking their duration, since these seem separable operations that don't appear to rely on each other. Failures down stream still result in stopping the worker, and is ok if a bad stream is noticed outside of the main decode loop.

This will also want to incorporate logic for preserving output sequence/pts/dts across streams that either restart from 0 or keep up an existing sequence.

Testing Strategy

An approach in https://github.com/home-assistant/core/pull/44161 mocks out pyav, creates fake containers, and simulate packets, and exercises how they turn into segments.
1. Significantly increase coverage to >92%, though is a fairly non-trivial fixture
2. This really is more about exercising the code than emulating real world cameras.
Revisit the existing tests in test_hls and see if they can be re-enabled.
1. Uses static files checked into the repository and exercises decoding. Very simple since they are black box.
2. Need to understand why they were flaky. Just cpu intensive presumably?
3. Likely do not exercise the corner cases of real cameras. Does not really exercise how streams are decoded, just that they don't totally throw exceptions.

I think it may be interesting to work on both approaches in parallel and establish a high baseline of test coverage as refactoring starts. Given the complexity of stream, I think test more test coverage is worth the down sides. Since most of the mocking is at the pyav level, this approach can work even with refactoring proposed above, if we want to change the scope of what is tested with the fixture.

Consequences & Alternatives

An integration such as nest would need to more deeply understand the interaction between camera and stream (e.g. we'd still need the notion of a stream id or cache_key). However, any additional complexity would be limited to just that integration and stream rather than introduced into camera . The above proposal seems like an OK way to manage that complexity.
It seems like starting/stopping stream threads could add some overhead, but hopefully not much compared to just sleeping in a worker thread on restarts.
As mentioned above, given the complexity of streams, we want to be careful not to make testing overly complex
Could keep more with status quo, stream worker is restarted when an url is passed in to request_stream. This was the original approach in https://github.com/home-assistant/core/pull/43718. During review, we thought instead to explore the idea of keeping the worker thread alive and passing down in a new url might make sense.
Status quo, stream worker pulls in a new url: Define a new class to hold stream options passed into stream by camera. The stream worker can look for an expiration time to decide when to exit and start over. https://github.com/home-assistant/core/pull/44000 proposed making the stream worker stateful in order to more easily manage the combining streams. This adds complexity to the existing stream interface by requiring a callback, which is unfortunate since the complexity is not used in the common case.
FYI Some stream latency may be mitigated by PR: Generate nest image thumbnails from events may be the right fix.
Could add additional fields into the stream_options. These seem like flags that were meant to be passed directly into pyav and ffmpeg, so likely not a good choice for extending with arbitrary other values like expiration or updated callbacks.
We could increase camera test coverage by getting more samples from rtsp streams from users if needed, but that would still likely still require some level of understanding what valid output looks like. Just ensuring it doesn't crash isn't enough.
File-based testing: Could consider checking in yaml files containing a packet layout rather than using python to encode them.
Alternative: Could just make stream worker a class (see refactor stream worker into stateful class since it is acting like one already (heavy use of nonlocal across 5 functions within a function). However, this may just be a sign that we're missing an abstraction that the proposal addresses.

I think that moving StreamWorker into a class is a good approach. It will give integrations that use it the necessary power to make changes on the fly, like updating the url.

Since the camera integration already initiates the stream integration, it makes sense that they own the lifecycle of the stream. We could add a new create_stream_worker method to the CameraEntity. Nest could override it and change settings as necessary.

@allenporter Very thorough summary. Seems like you and @hunterjm had a good discussion.

Some quick thoughts:

Improving test coverage as well as testing at a lower level is good and will help with avoiding errors during any refactoring. Obviously we won't be able to test all corner cases, but we might be able to have some tests for known behaviors of certain camera types.
Tying the segments together with continuous timestamps should not be a problem, but there may be artifacts at the boundaries because the two incoming camera feeds may not be perfectly in sync.
I don't think the use of nonlocal necessarily means that the worker is acting like a class - the code in those functions was actually originally inline and separating them out into functions out was more for organization/readability (and the use of nonlocal was just to avoid passing extra parameters). However, reorganizing the worker into a class does make sense.
As you've noted above, since the integration works across a broad range of hardware and interacts with many other components, many issues are difficult to debug. From my own experience, the issues that are raised begin with fundamental issues with the code followed by issues with specific camera types to issues between stream and other components followed by issues which are more like tech support questions. However, these issues might actually manifest more in the lower level changes to AV remuxing than in the architectural changes you are proposing.
Since you've covered so much in your summary, I have one more point/discussion topic to add. We use a thread for each stream but these are still running in python so the threads are still subject to the GIL. PyAV is in Cython and much of the actual ffmpeg work is done outside the GIL, but still do some work in the worker. I don't think there's a major performance issue now, but there's also no harm in moving some of the code to Cython if it is more efficient. This might be a hassle to maintain though as we would probably have to split that Cython code into another module.

Happy to support with whatever you move ahead with.

Thanks, all very helpful feedback. I have a couple more questions.

I would like to fix the existing tests that are excluded from testing due to flakiness. When running locally it seems like the main races are that the worker thread finishes and clears the output buffer before the test can consume everything. Does that sound familiar or are there other issues?
Integrations have rules that tests are through the integration APIs and not directly testing the classes. Is that true for internal components also or is it ok to test libraries directly?

It's ok to add unit tests for core and helper parts, if they are isolated and used by other parts of the integration or other integrations. We'll make that call during review.

After mulling this over for the week, I would like to propose the first set of PRs to focus on increasing test coverage, focusing on repairing the existing tests.

PR#1: Repair test_hls.py
- Unmark tests as flaky
- Repair test_hls_stream to pass again by fixing url paths (Note: Still flaky)
- Fix bug found by test_hls_stream that results in a 5XX error when racing to fetch a playlist while it is ending. Will convert it to a 4XX error.
- Repair flakiness in test_hls_stream, test_stream_timeout, and test_stream_ended
- Proposal: patch StreamOutput.put with synchronization to fix the root cause of flakiness. The test can halt the stream worker from ending (and clearing all segments) before the test has had a chance to consume what it needs.
- Not perusing alternatives: reducing scope of tests, increasing duration of streams, changing interfaces of the code to delay cleanup
PR#2: Repair test_recorder.py
- Repair bug where the code now expects a filename, and no longer works on an in memory buffer.
- Repair synchronization issues similar to test_hls.py above
PR#3: Revisit strategy for additional test coverage in stream worker
- Will take an additional pass at simplifying https://github.com/home-assistant/core/pull/44161 which I previously sent for review
- In particular, try the approach that already exists in common.py of encoding a video rather than the existing PRs approach of totally mocking out PyAV

1) Thinking about the lifecycle and object relations, the current state looks something like this:

Home Assistant - Stream Recon (1)

Stream has tight coupling between the worker and the outputs (as in both Stream and worker know about each other and Stream and outputs know about each other). An example is that Stream starts the stream worker, then on stream end the worker invokes put with a None that cause the StreamOutputs to remove themselves from the Stream. All three classes know about keepalive. My hunch is that we'll want to decouple idle timeout checks, keepalive, and stream end a bit, though I am not sure in which order to send PRs to decouple these. An example is that Stream can notice when the worker exits to remove all stream providers (StreamOutput) rather than having the worker tell the providers to remove themselves.

2) When thinking about having camera manage the lifecycle of stream: Stream may be what we want to expose rather than a StreamWorker. We'll likely just want the worker to decode packets, whereas Stream manages starting/stopping the worker and can handle stream source updates from camera.

I think we'll want to slim down the interface of Stream otherwise it may expose too many details. The decoupling mentioned above should help a lot, since a few of the public methods on Stream are called by StreamOutput.

I believe I may have an approach that can handle updates to URL streams.

The HLS RFC has a tag called EXT-X-DISCONTINUITY. See https://tools.ietf.org/html/rfc8216#section-4.3.2.3 for more details. This tag means that the player can handle any synchronization issues caused by the stream restart. This works in my local testing pretty well!

The approach is something like:

Keep track of the sequence number of the last produced segment
On restart of a stream, the first Segment needs adjustment
When rendering the playlist.m3u8 insert a #EXT-X-DISCONTINUITY before the segment that needs adjustment

It looks like there needs to be some additional handling of EXT-X-DISCONTINUITY-SEQUENCE when segments are garbage collected as well, if I understand the RFC correctly.

I think this implies the rest of the change is fairly straight forward and we don't need to keep track of pts/dts/etc and perhaps no more large structural changes are needed.

I think it makes sense to use that tag. The only concern I have is that this might affect the buffering strategies of the various players - they might want the discontinuity segment earlier than they want it now. The Apple documentation seems to recommend 6 segments of 6 seconds each (see sections 7 and 8 here ), while we are using 3 segments of 1.5 (with 1 keyframe per second this actually rounds up to 2) seconds each. I don't see these values specified in the RFC you linked, although in 6.2.2 the fourth paragraph says we should keep the segment around "for a period of time equal to the duration of the segment plus the duration of the longest Playlist file distributed by the server containing that segment". It's unclear whether that means from the time we first add the segment to the playlist or from when we remove the segment from the playlist, but if it's the latter we should probably double the size of the StreamOutput._segments deque (the only downside is using a little more memory). Of course this change is unrelated to the architecture topic but it's probably worth considering.

I think it makes sense to use that tag. The only concern I have is that this might affect the buffering strategies of the various players - they might want the discontinuity segment earlier than they want it now. The Apple documentation seems to recommend 6 segments of 6 seconds each (see sections 7 and 8 here ), while we are using 3 segments of 1.5 (with 1 keyframe per second this actually rounds up to 2) seconds each. I don't see these values specified in the RFC you linked, although in 6.2.2 the fourth paragraph says we should keep the segment around "for a period of time equal to the duration of the segment plus the duration of the longest Playlist file distributed by the server containing that segment". It's unclear whether that means from the time we first add the segment to the playlist or from when we remove the segment from the playlist, but if it's the latter we should probably double the size of the StreamOutput._segments deque (the only downside is using a little more memory). Of course this change is unrelated to the architecture topic but it's probably worth considering.

The apple docs recommendations don't need to be followed 100%. Also, we are technically using whatever the I-Frame interval for the video clip is as our segment length since we are not transcoding video. I know you added the 1.5 logic in there, but some of my feeds are actually 2.5 based on the when the keyframes are. Overall, @allenporter I think that's a good find, and sounds much easier than trying to keep track and sync everything ourselves!

edit: If we use this feature, I'd also be for removing the arbitrary abstraction I added to the implementation at the beginning when I thought we could potentially support other protocols besides HLS to simplify the logic in the stream component further. It made it relatively easy to add record at a later date, but there is tight coupling between the recorder and HLS currently anyway.

The apple docs recommendations don't need to be followed 100%. Also, we are technically using whatever the I-Frame interval for the video clip is as our segment length since we are not transcoding video. I know you added the 1.5 logic in there, but some of my feeds are actually 2.5 based on the when the keyframes are. Overall, @allenporter I think that's a good find, and sounds much easier than trying to keep track and sync everything ourselves!

I'm just always worried that some of the random issues we encounter are due to straying from the Apple recommendations. But you're right, looking at the RFC none of that stuff is actually in there - there doesn't seem to be a mention of the segment duration and the example playlists themselves have only 3 segments. There is however that bit about keeping the segments around after removing them from the playlist. I made a quick branch to add that but not sure if it's worth a PR if there are currently no issues. Yes the 1.5 was just a minimum so we wouldn't go much below that...previously I was getting 1 second segments with my 1 keyframe per second keyframe interval (I think that's probably the most common setting for cams). I'm guessing you are getting 2.5 because your keyframes are coming every 1.25 seconds? I don't think many cameras are configured to use keyframe intervals longer than a few seconds unless they are using the non standard H264+ or H265+ codecs.

edit: If we use this feature, I'd also be for removing the arbitrary abstraction I added to the implementation at the beginning when I thought we could potentially support other protocols besides HLS to simplify the logic in the stream component further. It made it relatively easy to add record at a later date, but there is tight coupling between the recorder and HLS currently anyway.

This means removing all the fmts?

I had a similar thought when looking at the different ways that StreamOutput is used: some methods only used by hls and recorder has custom overrides. The common parts about collecting output segments from the stream worker still seems worth sharing, so I wasn't in a hurry to break it apart. I can see that tracking view specific logic may push it over the line.

This could be a matter of moving some logic up into the HLS output class, then making the view handlers know which output type they are using.

I think this implies the rest of the change is fairly straight forward and we don't need to keep track of pts/dts/etc and perhaps no more large structural changes are needed.

@allenporter We might have to figure out a way to deal with the timestamp discontinuities in recorder, but that shouldn't be too hard to do.

With @uvjustin adding recorder discontinuity support, I think we can all this done! Thanks for everyone for their help in support of nest.

Though there may be some additional cleanup in stream that can happen, maybe more ideas from #46610 that are still useful. If anyone runs across anything else that can be simplified here i'm happy to chase it down.

home-assistant / architecture