livepeer / go-livepeer

Official Go implementation of the Livepeer protocol
http://livepeer.org
MIT License
546 stars 171 forks source link

spikes in number of streams every 27 minutes #2020

Open Franck-UltimaRatio opened 3 years ago

Franck-UltimaRatio commented 3 years ago

Describe the bug Every 27 minutes there is a spike in number of streams. This spike is related to the numbers of stream you have (and maybe the kind of streams). For exemple, if you have 11 streams, spike go up to 20, then back to 11). If your node is capped, or almost capped, the spike exceed the maxsessions and at the end of the spike you loose some jobs (i.e : your have 68 streams, node capped a 70, after spike you drop to 40 streams). Also confirmed by Titan-node and papa bear.

This spikes look like ghost streams (every 27 minutes) linked to your actual streams. but if you can't handle them because of your sessions limit, you loose your jobs.

I can reproduce this bug with all this configurations : T and O merged or T and O Splitted Windows 10 node and Ubuntu 21.04 node

Screenshots spike1 spike2 spike3

papabear99 commented 3 years ago

I'd like to add that these 27 minute spikes only happens when transcoding or have had a session in the prior 27 minute window. When my O is quiet (waiting for work) there are no spikes. Note: the spike @ 17:12 is a test stream and is not the issue referenced. 27minSpikesGitHub

papabear99 commented 3 years ago

Another thing I noticed this morning is that the size of the spike seems to be proportional to the number of concurrent sessions. The first spike goes from 24 to 40 (16 sessions), second 20 to 34 (14 sessions), third 16 to 26 (10 sessions), and the fourth 13 to 22 (9 sessions). The first spike hit my max sessions so was probably higher than shown because I did have a drop in sessions following the spike. SpikeProportions2

Franck-UltimaRatio commented 3 years ago

That true, there is something like a ratio around X1,66

Franck-UltimaRatio commented 3 years ago

spike4 The last ghost spike, not even capped, just kill most of my current job :( including nice 4K --> 1080

papabear99 commented 3 years ago

BunnyEarSpikes

Today I'm seeing a new bunny ear 27 min spike. Just adding it for possible clues as to what's causing these 27 min spikes.

papabear99 commented 2 years ago

I'm still having this issue and I noticed an odd possibly related issue the other day where a session was switching between GPUs on 27 min intervals. This may happen regularly but this was on a slow day so it was easy to spot the change in transcode times which I then was able to see correlated to an increase/decrease in usage on the 2 different model GPUs installed on the O/T 27minSessionGPUSwitch

eliteprox commented 2 years ago

Switched to a single-GPU setup and still saw those "27-minute spikes" during high volume. It doesn't seem to be multi-GPU related. Though I did get the impression that streams were sticking around longer.

image

papabear99 commented 1 year ago

FWIW this issue still persists. It's very easy to see with a single session. 27minSIN

leszko commented 1 year ago

Thanks for chasing on that @papabear99

I did some initial investigation and I think that it's related to the B<>O auth token refresh mechanism, which does the following:

  1. Token is valid for 30 minutes
  2. Token is refreshed within the last 10% of the token's valid period
  3. That would explain 27 minutes (30m - 10%*30m)

Now, I don't understand yet why token auth refresh causes additional segments being sent. I would expect that the auth token refresh is just an RPC call /GetOrchestrator, not actually sending any segments to transcode.

Before I try to debug it deeper, I would need some additional information from you (CC: @papabear99 @eliteprox @Franck-UltimaRatio ):

  1. Could you confirm that what you count on the graph as "session" is actually transcoding a segment (something we don't expect), not just an RPC request /GetOrchestrator (something we expect)? If this question is not clear, then could you just point me to how you calculate "session" in the chart?
  2. Could you send me some sample timestamps of the spikes and the IP/DNS address of the orchestrator (10-20 data points is enough)? I'd like to confirm on our side that it corresponds to the auth token refresh.
Franck-UltimaRatio commented 1 year ago

It s more difficult for us to see this spike now because we don't have anymore the surge of stream we had during the L1 and gas spike history (as you can see in our old screenshot) ^^ i ll try to find some clear reference of this spike, but for the moment i can show you which calculation we are using for streams number : sum(livepeer_current_sessions_total{job="livepeer"})

leszko commented 1 year ago

Thanks @Franck-UltimaRatio

Some data points will be helpful as well.

papabear99 commented 1 year ago

Here's a log from the last spike on the chart I posted. 27minSpikesSIN_1min_log.txt DNS solarfarm.multiorch.com I'm using this query sum(livepeer_current_sessions_total{node_type="orch"})

leszko commented 1 year ago

Thanks for the logs @papabear99 . It's interesting, in the logs I see that each segment is actually transcoded only once.

One more question, I see in the logs that you're running an Orchestrator with a remote Transcoder. Does the issue also occur in case of running the standalone Orchestrator (without the remote Transcoder)?

papabear99 commented 1 year ago

One more question, I see in the logs that you're running an Orchestrator with a remote Transcoder. Does the issue also occur in case of running the standalone Orchestrator (without the remote Transcoder)?

Yes the earlier graphs I posted earlier are from combined OTs