livepeer / go-livepeer

Official Go implementation of the Livepeer protocol
http://livepeer.org
MIT License
538 stars 169 forks source link

Disconnecting a remote T causes O to lose all transcoders #2605

Closed eliteprox closed 1 year ago

eliteprox commented 1 year ago

Describe the bug

When operating an orchestrator with multiple remote transcoders, disconnecting one remote T can sometimes cause orchestrator to lose all transcoders and go into a no transcoders available state.

To Reproduce Steps to reproduce the behavior:

  1. Run standalone orchestrator
  2. Run two transcoder processes on localhost with orchestrator
  3. Run a third transcoder process on a separate IP address connected to the same orchestrator.
  4. Stop the third transcoder process while orchestrator has streams running on all three transcoders.
  5. Orchestrator logs are attached. orchestratorlogs.txt local-transcoder2.txt remote-transcoder3.txt

Expected behavior The stopped transcoder should disconnect without affecting workflow to the other transcoders

Desktop (please complete the following information):

Additional context See attached logs. orchestratorlogs.txt shows the orchestrator logs during this time period.

Sep 21 14:39:49 blackbox livepeer[358249]: E0921 14:39:49.374196 358249 segment_rpc.go:230] manifestID=d9918feb-bb09-41eb-8668-c8d1c43cd214 seqNo=5 orchSessionID=2aaaf010 clientIP=0.0.0.193 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E Could not transcode err="no transcoders available" Sep 21 14:39:49 blackbox livepeer[358249]: E0921 14:39:49.745008 358249 orchestrator.go:559] manifestID=09ec9btof2jkzzin seqNo=57 orchSessionID=2a965663 clientIP=0.0.0.196 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E Error transcoding segName=https://70.132.135.146:8935/stream/2a965663/57.tempfile err="no transcoders available" Sep 21 14:39:49 blackbox livepeer[358249]: E0921 14:39:49.745111 358249 segment_rpc.go:230] manifestID=09ec9btof2jkzzin seqNo=57 orchSessionID=2a965663 clientIP=0.0.0.196 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E Could not transcode err="no transcoders available"


- transcoder1.txt and transcoder2.txt show logs from two transcoders on localhost that stopped receiving work.
cyberj0g commented 1 year ago

Thanks for reporting. Couldn't reproduce this issue on a few initial attempts. Tried with both v0.5.34 and master. There are only two spots in code, which could return No transcoders available error. I suspect there may be a data race somewhere around streamSessions and remoteTranscoders maps, maybe it happens only when transcoders are at full capacity.

@eliteprox would appreciate some input to investigate further:

  1. Is this error fuzzy, or consistently reproducible?
  2. Is there a difference in capabilities between local and remote transcoders (different GPUs)?
  3. Is it reproducible with only local transcoders?
  4. Could you share a full command line for O and Ts?
cyberj0g commented 1 year ago

I believe this to have the same cause as #2706, which is currently fixed.