Closed j0sh closed 4 years ago
One approach is the following:
[1] Lookahead, multipass, arithcoder, etc etc
Weigh each GPU's score against the elapsed time vs predicted time of any concurrent segments, eg (gpu_score = sum(segment_score[i] * 1.0 - elapsed_time[i] / predicted_time[i])
) in order to improve utilization on GPUs that are perhaps close to finishing up concurrent segments.
predicted_time
is some derived measure, eg segment_duration * last_transcode_duration / last_segment_duration
. The latency value here for the first segment could be estimated to be segment_duration * 1.0
: an assumption of realtime processing.
Calculating the predicted time gets interesting when taking into account overlapping jobs, given that each additional unit of work slows down existing work.
Efficiency of this (and any other algorithm) can be evaluated by looking at P95 / P99 latencies.
The approach described at https://github.com/livepeer/go-livepeer/issues/1110#issuecomment-538066295 falls apart during high concurrencies, as session initialization on the GPU can take upwards of a minute to complete. See https://gist.github.com/j0sh/ae9e5a97e794e364a6dfe513fa2591c2 for more details.
A compromise is to only do the GPU assignment once, rather than for each segment, fixing the job to a single GPU. This approach does not allow us to shed load mid-stream, but avoids most of the issues around repeated session initialization. In effect, the algorithm becomes:
// Returns the GPU assigned to the session.
// Makes the assignment if necessary.
getGPU(job, sessions, gpus) gpu {
if nil == sessions[job] {
gpu = leastLoadedGPU(gpus)
sessions[job] = gpu
gpus[gpu] += estimateSessionCost()
}
gpu = sessions[job]
return gpu
}
// Selects the currently least loaded GPU.
LeastLoadedGPU(load []int) int {
// where `load` is a list of GPU loads
// Return the GPU index with lowest load.
return minIdx(load)
}
Internally, when we select the index of the least loaded GPU, we maintain an "ratcheting" index indicating the location from which to start searching in the GPU load list. While the algorithm still yields the least loaded GPU for each point in time, there is a natural tapering effect that emerges over time as lower indices are observed and selected more frequently. This isn't necessarily a problem per se, but simulation shows slight latency improvement when the ratchet is used.
Can we do better than the following on a per-segment basis?