livepeer / go-livepeer

Official Go implementation of the Livepeer protocol
http://livepeer.org
MIT License
538 stars 169 forks source link

Remote T's can get removed from the `remoteTranscoders` list if any T reconnects #2706

Closed stronk-dev closed 1 year ago

stronk-dev commented 1 year ago

Describe the bug

Whenever a T reconnects, they get a new entry in the remoteTranscoders list of transcoder objects held by the RemoteTranscoderManager Their 'expired' entry in the remoteTranscoders list only gets removed when they later get selected for a session: https://github.com/livepeer/go-livepeer/blob/e2c46a1e30ca3c5aafcc1a4f11f55523235da871/core/orchestrator.go#L968-L976

Up to this point everything looks alright (although I wonder why does the Transcoder get removed from liveTranscoders immediately, but only gets removed from remoteTranscoders on selection)

Anyway, the error happens in the removeFromRemoteTranscoders function, where it splices the expired entry out of the list https://github.com/livepeer/go-livepeer/blob/e2c46a1e30ca3c5aafcc1a4f11f55523235da871/core/orchestrator.go#L914-L938

It creates the new list as

newRemoteTs = remoteTranscoders[i-1 : i] 
newRemoteTs = append(newRemoteTs, remoteTranscoders[i+1:]...) 

So if i is higher than 1, any transcoder in this list with an index lower than i-1 is going to get dropped from the list of remote transcoders!

In all regards the T is still considered a live transcoder connected to the Orchestrator, so no error pops up on the O or T side. They just won't receive any more streams

To Reproduce

Have multiple remote T's connected in a split O/T setup. Have one (or more) T's reconnect and wait for a few streams to come in. At some point you will see one T not receiving any more streams

eliteprox commented 1 year ago

Appears related to #2605

thomshutt commented 1 year ago

Thanks @eliteprox - would you be able to work with @cyberj0g please @stronk-dev to help him reproduce the issue?

stronk-dev commented 1 year ago

I haven't encountered #2605 yet so I can't speak as to reproducing that. but it does seem like that issue is caused by the bug as described in #2706.

Reproducing #2706 is as simple as having more than 2 remote transcoders and disconnecting the T with and index in rtm.remoteTranscoders higher than 1. All T's with an index lower than i-1 will get dropped when the disconnected T gets selected so I can imagine this can cause the set to get emptied under specific conditions Might just be easiest to merge #2707 which fixes #2706 first to see if that also fixes #2605 as a side effect

eliteprox commented 1 year ago

Thanks @eliteprox - would you be able to work with @cyberj0g please @stronk-dev to help him reproduce the issue?

I've opened #2747 for this issue. I did some troubleshooting with @0xb79 and @Titan-Node this evening and reproduced it in 0.5.37 under scenarios involving multiple streams. This function cleaned up quite a bit.