Open Franck-UltimaRatio opened 3 years ago
I'd like to add that these 27 minute spikes only happens when transcoding or have had a session in the prior 27 minute window. When my O is quiet (waiting for work) there are no spikes. Note: the spike @ 17:12 is a test stream and is not the issue referenced.
Another thing I noticed this morning is that the size of the spike seems to be proportional to the number of concurrent sessions. The first spike goes from 24 to 40 (16 sessions), second 20 to 34 (14 sessions), third 16 to 26 (10 sessions), and the fourth 13 to 22 (9 sessions). The first spike hit my max sessions so was probably higher than shown because I did have a drop in sessions following the spike.
That true, there is something like a ratio around X1,66
The last ghost spike, not even capped, just kill most of my current job :( including nice 4K --> 1080
Today I'm seeing a new bunny ear 27 min spike. Just adding it for possible clues as to what's causing these 27 min spikes.
I'm still having this issue and I noticed an odd possibly related issue the other day where a session was switching between GPUs on 27 min intervals. This may happen regularly but this was on a slow day so it was easy to spot the change in transcode times which I then was able to see correlated to an increase/decrease in usage on the 2 different model GPUs installed on the O/T
Switched to a single-GPU setup and still saw those "27-minute spikes" during high volume. It doesn't seem to be multi-GPU related. Though I did get the impression that streams were sticking around longer.
FWIW this issue still persists. It's very easy to see with a single session.
Thanks for chasing on that @papabear99
I did some initial investigation and I think that it's related to the B<>O auth token refresh mechanism, which does the following:
27 minutes
(30m - 10%*30m)
Now, I don't understand yet why token auth refresh causes additional segments being sent. I would expect that the auth token refresh is just an RPC call /GetOrchestrator
, not actually sending any segments to transcode.
Before I try to debug it deeper, I would need some additional information from you (CC: @papabear99 @eliteprox @Franck-UltimaRatio ):
/GetOrchestrator
(something we expect)? If this question is not clear, then could you just point me to how you calculate "session" in the chart?It s more difficult for us to see this spike now because we don't have anymore the surge of stream we had during the L1 and gas spike history (as you can see in our old screenshot) ^^ i ll try to find some clear reference of this spike, but for the moment i can show you which calculation we are using for streams number : sum(livepeer_current_sessions_total{job="livepeer"})
Thanks @Franck-UltimaRatio
Some data points will be helpful as well.
Here's a log from the last spike on the chart I posted.
27minSpikesSIN_1min_log.txt
DNS solarfarm.multiorch.com
I'm using this query sum(livepeer_current_sessions_total{node_type="orch"})
Thanks for the logs @papabear99 . It's interesting, in the logs I see that each segment is actually transcoded only once.
One more question, I see in the logs that you're running an Orchestrator with a remote Transcoder. Does the issue also occur in case of running the standalone Orchestrator (without the remote Transcoder)?
One more question, I see in the logs that you're running an Orchestrator with a remote Transcoder. Does the issue also occur in case of running the standalone Orchestrator (without the remote Transcoder)?
Yes the earlier graphs I posted earlier are from combined OTs
Describe the bug Every 27 minutes there is a spike in number of streams. This spike is related to the numbers of stream you have (and maybe the kind of streams). For exemple, if you have 11 streams, spike go up to 20, then back to 11). If your node is capped, or almost capped, the spike exceed the maxsessions and at the end of the spike you loose some jobs (i.e : your have 68 streams, node capped a 70, after spike you drop to 40 streams). Also confirmed by Titan-node and papa bear.
This spikes look like ghost streams (every 27 minutes) linked to your actual streams. but if you can't handle them because of your sessions limit, you loose your jobs.
I can reproduce this bug with all this configurations : T and O merged or T and O Splitted Windows 10 node and Ubuntu 21.04 node
Screenshots