matrix-org / waterfall

A cascading stream forwarding unit for scalable, distributed voice and video conferencing over Matrix
Apache License 2.0
97 stars 5 forks source link

Add support for OpenTelemetry instrumentation #142

Closed daniel-abramov closed 1 year ago

daniel-abramov commented 1 year ago

Solves https://github.com/matrix-org/waterfall/issues/141.

daniel-abramov commented 1 year ago

So this is one big span for the whole conference with events?

The current hierarchy is Conference -> Participant -> PublishedTrack -> Subscription. The log span events are logged on the conference level only if they cannot be associated with a participant. Otherwise, they will be on a participant level, or on a published track level (or on a subscription level should they be related to the subscription). In other words, the conference span is the longest one, but majority of the events are actually not logged on the conference span, but on the spans of individual participants, subscriptions or tracks (each of them have their own span that I create when calling CreateChild() or Create()).

In EC, we have used spans for the join call / signalling events, so we should probably pick one, although I don't know that span are the right answer. If anything this could be useful to compare.

Ah, you mean for each incoming event you create a span at the beginning of the event and end it at the end of the processing of the event? I.e. each participant would have a child span for each event being handled? - That's not a bad idea, I think I could implement it as well! Should I? Or should we leave it as is and add the rest of the things on demand.

Basically, I used a span concept to denote a lifetime of an entity (lifetime of conference, participant or a published track), so that we can visually see from which point till which point a track or a participant existed. And then within the span (i.e. within the lifetime of an entity) the events were added. OpenTelemetry would log them relative to the span that they belong to.

Though creating spans for each signaling event is not bad either, my only concern is that there will be a lot of them and they may look visually overwhelming (especially for bigger conferences). On the other side, that would allow to measure the amount of time it takes to process a single event.

Also, if there's a way to add the event content, this could be useful perhaps?

E.g. signaling events? - I'm asking since I actually did add attributes to certain events when there was something to add (i.e. a simulcast quality to the event of switching the quality). For many events, I did not add any attributes since they will be inside a span that has attributes. Or maybe I misunderstood your question :)

dbkr commented 1 year ago

OK, this sounds great. If it visualises sensibly, using events for things that are single points in time makes the most sense. In EC they are spans but actually we just start them and end them immediately at the moment, which feels a bit silly. Let's stick with what you've done here for now and see how it works.

Yeah, on adding attributes, I'm thinking things like adding the SDP content where it's logging about sending SDP. We'll definitely need these for debugging at some point so if they're in the telemetry this might make life easier.

daniel-abramov commented 1 year ago

@dbkr, I updated this PR over the weekend, could you please take a look and approve if it's ok now? 🙂 I would need to merge this PR before merging https://github.com/matrix-org/waterfall/pull/143