matrix-org / waterfall

A cascading stream forwarding unit for scalable, distributed voice and video conferencing over Matrix
Apache License 2.0
97 stars 5 forks source link

SFU Refactoring #52

Closed daniel-abramov closed 1 year ago

daniel-abramov commented 1 year ago

This closes https://github.com/matrix-org/waterfall/issues/39 and does add structured logging, but not the framework for https://github.com/matrix-org/waterfall/issues/50

The main idea of this refactoring was to provide us with a somewhat stable foundation on top of which we could build new features with as much compile-time safety and with some guarantees that we lower race conditions.

See https://github.com/matrix-org/waterfall/issues/39 for a description of things that were improved and fixed.

Summary: [TODO, write some better explanation and highlights of this refactoring].

A rough diagram of how components are conceptually connected with each other: waterfall-refactored

NB: This is still work-in-progress and must be tested before merging.

daniel-abramov commented 1 year ago

@SimonBrandner,

This is largely looking good to me though I sometimes struggle a bit not to get lost... Perhaps a better file naming would be a bit, some of the file names aren't very descriptive from my POV.

Is there something that could be improved? I want to understand what makes you feel lost. Or is it just because the PR is too large?

I'm definitely in favor of smaller PRs and I'm sorry that this PR is now that big: it's as big as it currently is since it's simply a complete rewrite of a previous implementation (I barely touched main and profiling things, but the rest was changed drastically).

Perhaps it would be easier if I published it as a new repository? (since 90% of things anyway changed) Then it would be easier to review it in an editor (or an IDE) as a new SFU rather than trying to comprehend how the new changes relate to the previous implementation. (which I agree must be confusing)

What I normally do for the large PRs is to split it into smaller "atomic" commits where each commit changes a single thing and explains why it was changed: that's the approach I'm in favor of and that I try to use in all projects (I've noticed that in Matrix and Element it's not always followed: many commits just alter the state of a previous one or combine unrelated things in one commit: something that I want to change!), but I think if I could present it not as one big change, but as series of well-separated commits, it would be much easier to understand (I normally rebase commits before a push and make sure they are clear, but unfortunately this time I published it earlier than I planned so that Dave can take a look at it earlier as we spoke about it).

I am also a little concerned about completely removing the subscriber/publisher model which is very useful for simulcast and possibly other things; it's also very similar to what ion-sfu and livekit does, iirc

Good point, expected you to ask about it :)

So, the subscriber and publisher model does exist in the current implementation, but it does not have a separate module or file for it since for our current implementation it does not seem to be useful.

I.e. I can create a function in a Conference and name it getSubscribers() (it's easy to create one), but then it would immediately pose questions like what exactly the subscriber structure should contain, i.e. should it be a structure that contains the ID of a subscriber and the list of IDs of other participants to which the subscriber is subscribed to? Or should it rather be the ID of a subscriber (the participant ID) and the list of tracks? Or should it be the list of tracks and the list of participants to which we're subscribed? Etc... Currently, these can't be answered simply because we don't really need an explicit entity such as Subscriber or Publisher.

We may indeed need it in the future (e.g. when implementing simulcast)! And once we need it, we definitely give it a good thought and introduce it! But adding these entities right now seemed like a "premature optimization" to me (since it would pose the questions listed above) that would complicate our current (pretty trivial and simple) use case.

As for the other SFUs, I'm only familiar with some internals of Janus and LiveKit. You mentioned LiveKit: yes, they have a concept of a subscription and publishing a track (but that concept also we have in our implementation :slightly_smiling_face:), but they, like we, don't have special files or packages for subscriber.go or publisher.go, also there are no such structures as Subscriber or Publisher in their implementation. They do have a more concrete entity though like SubscribedTrack or Downtrack.

So I would not say that we don't have a publisher-subscriber model. In fact, we do have it between peers and a conference (peers publish the messages to which the conference is subscribed). And we also know which tracks everyone is subscribed it and which tracks are being published by each participant (see participant.go). We just do it different compared to the implementation in main branch - I could not inherit the Publisher/Subscriber logic from main since it unfortunately suffers from race conditions in multiple places and the entities are not separated there, so it also has bugs like a double-subscription (e.g. check Publisher::Subscribe() in the main). So they had to be reworked anyway [unfortunately].

And I'm pretty sure that with time we'll also have a more elaborated logic for this with more structures that have a publisher or subscriber-specific logic inside.

I would also like my PR to get merged before this and for us to somehow integrate it into this one

Would you like me to integrate the logic from the RTCP PR into this PR before merging it into main? Or would you prefer to add it to this PR?

I can do it, but I was wondering if you would like to do it yourself to get a feel of what the rewritten version looks like and how such things are implemented within the new architecture.

All of the TODO are also a little concerning

Yep, but there are also TODOs and FIXMEs in main :) There is only one place (a FIXME) in this PR that talks about a feature that is implemented in main, but not here. The rest existed in the previous implementation (it's just some of them were not marked as concerns and did not have a dedicated TODO, like for instance handling the data channel being busy or closing the SFU once Sync() ends). I decided to mark these places with TODOs (if it's not critical) and FIXMEs (if it's more critical and must be discussed), so that we don't forget about them.

SimonBrandner commented 1 year ago

@SimonBrandner,

This is largely looking good to me though I sometimes struggle a bit not to get lost... Perhaps a better file naming would be a bit, some of the file names aren't very descriptive from my POV.

Is there something that could be improved? I want to understand what makes you feel lost. Or is it just because the PR is too large?

The PR being large is definitely a part of it but I feel like the names don't tell me much in some cases...

I'm definitely in favor of smaller PRs and I'm sorry that this PR is now that big: it's as big as it currently is since it's simply a complete rewrite of a previous implementation (I barely touched main and profiling things, but the rest was changed drastically).

Perhaps it would be easier if I published it as a new repository? (since 90% of things anyway changed) Then it would be easier to review it in an editor (or an IDE) as a new SFU rather than trying to comprehend how the new changes relate to the previous implementation. (which I agree must be confusing)

What I normally do for the large PRs is to split it into smaller "atomic" commits where each commit changes a single thing and explains why it was changed: that's the approach I'm in favor of and that I try to use in all projects (I've noticed that in Matrix and Element it's not always followed: many commits just alter the state of a previous one or combine unrelated things in one commit: something that I want to change!), but I think if I could present it not as one big change, but as series of well-separated commits, it would be much easier to understand (I normally rebase commits before a push and make sure they are clear, but unfortunately this time I published it earlier than I planned so that Dave can take a look at it earlier as we spoke about it).

No worries about the PR being large, no reason to create a separate repo, I realize it's a large refactor and so there aren't many alternatives

I am also a little concerned about completely removing the subscriber/publisher model which is very useful for simulcast and possibly other things; it's also very similar to what ion-sfu and livekit does, iirc

Good point, expected you to ask about it :)

So, the subscriber and publisher model does exist in the current implementation, but it does not have a separate module or file for it since for our current implementation it does not seem to be useful.

I.e. I can create a function in a Conference and name it getSubscribers() (it's easy to create one), but then it would immediately pose questions like what exactly the subscriber structure should contain, i.e. should it be a structure that contains the ID of a subscriber and the list of IDs of other participants to which the subscriber is subscribed to? Or should it rather be the ID of a subscriber (the participant ID) and the list of tracks? Or should it be the list of tracks and the list of participants to which we're subscribed? Etc... Currently, these can't be answered simply because we don't really need an explicit entity such as Subscriber or Publisher.

We may indeed need it in the future (e.g. when implementing simulcast)! And once we need it, we definitely give it a good thought and introduce it! But adding these entities right now seemed like a "premature optimization" to me (since it would pose the questions listed above) that would complicate our current (pretty trivial and simple) use case.

As for the other SFUs, I'm only familiar with some internals of Janus and LiveKit. You mentioned LiveKit: yes, they have a concept of a subscription and publishing a track (but that concept also we have in our implementation slightly_smiling_face), but they, like we, don't have special files or packages for subscriber.go or publisher.go, also there are no such structures as Subscriber or Publisher in their implementation. They do have a more concrete entity though like SubscribedTrack or Downtrack.

So I would not say that we don't have a publisher-subscriber model. In fact, we do have it between peers and a conference (peers publish the messages to which the conference is subscribed). And we also know which tracks everyone is subscribed it and which tracks are being published by each participant (see participant.go). We just do it different compared to the implementation in main branch - I could not inherit the Publisher/Subscriber logic from main since it unfortunately suffers from race conditions in multiple places and the entities are not separated there, so it also has bugs like a double-subscription (e.g. check Publisher::Subscribe() in the main). So they had to be reworked anyway [unfortunately].

And I'm pretty sure that with time we'll also have a more elaborated logic for this with more structures that have a publisher or subscriber-specific logic inside.

Alright, let's leave it out for now and re-add it with simulcast

I would also like my PR to get merged before this and for us to somehow integrate it into this one

Would you like me to integrate the logic from the RTCP PR into this PR before merging it into main? Or would you prefer to add it to this PR?

I can do it, but I was wondering if you would like to do it yourself to get a feel of what the rewritten version looks like and how such things are implemented within the new architecture.

Yeah, I actually wanted to have a go at adding it to this PR exactly for that reason :D Might take a bit of time, but I would like to have a look tomorrow

All of the TODO are also a little concerning

Yep, but there are also TODOs and FIXMEs in main :) There is only one place (a FIXME) in this PR that talks about a feature that is implemented in main, but not here. The rest existed in the previous implementation (it's just some of them were not marked as concerns and did not have a dedicated TODO, like for instance handling the data channel being busy or closing the SFU once Sync() ends). I decided to mark these places with TODOs (if it's not critical) and FIXMEs (if it's more critical and must be discussed), so that we don't forget about them.

Sure, can we just check we're not removing any functionality with this refactor (e.g. keep-alive)?