livepeer / go-livepeer

Official Go implementation of the Livepeer protocol
http://livepeer.org
MIT License
546 stars 171 forks source link

GPU LB : Algorithm Improvements #1110

Closed j0sh closed 4 years ago

j0sh commented 5 years ago

Can we do better than the following on a per-segment basis?

if sessions[id].nil ? sessions[id] = gpus[i++ % len(gpus)]
gpu = sessions[id]
j0sh commented 5 years ago

One approach is the following:

  1. Take a "score" for each segment - WHFPS, summed over each rendition.
  2. Add a multiplier or constant factor for setting up a session on a new GPU if necessary.
  3. Select GPU that has the lowest score.
  4. Add the score to the GPU as a segment begins processing
  5. Subtract the score from the GPU as the segment completes.

Some assumptions:

[1] Lookahead, multipass, arithcoder, etc etc

Later Improvements:

Weigh each GPU's score against the elapsed time vs predicted time of any concurrent segments, eg (gpu_score = sum(segment_score[i] * 1.0 - elapsed_time[i] / predicted_time[i])) in order to improve utilization on GPUs that are perhaps close to finishing up concurrent segments.

predicted_time is some derived measure, eg segment_duration * last_transcode_duration / last_segment_duration. The latency value here for the first segment could be estimated to be segment_duration * 1.0 : an assumption of realtime processing.

Calculating the predicted time gets interesting when taking into account overlapping jobs, given that each additional unit of work slows down existing work.

Efficiency of this (and any other algorithm) can be evaluated by looking at P95 / P99 latencies.

j0sh commented 5 years ago

The approach described at https://github.com/livepeer/go-livepeer/issues/1110#issuecomment-538066295 falls apart during high concurrencies, as session initialization on the GPU can take upwards of a minute to complete. See https://gist.github.com/j0sh/ae9e5a97e794e364a6dfe513fa2591c2 for more details.

A compromise is to only do the GPU assignment once, rather than for each segment, fixing the job to a single GPU. This approach does not allow us to shed load mid-stream, but avoids most of the issues around repeated session initialization. In effect, the algorithm becomes:

// Returns the GPU assigned to the session.
// Makes the assignment if necessary.
getGPU(job, sessions, gpus) gpu {
  if nil == sessions[job] {
  gpu = leastLoadedGPU(gpus)
  sessions[job] = gpu
  gpus[gpu] += estimateSessionCost()
  }
  gpu = sessions[job]
  return gpu
}

// Selects the currently least loaded GPU.
LeastLoadedGPU(load  []int) int {
  // where `load` is a list of GPU loads
  // Return the GPU index with lowest load.
  return minIdx(load)
}

Internally, when we select the index of the least loaded GPU, we maintain an "ratcheting" index indicating the location from which to start searching in the GPU load list. While the algorithm still yields the least loaded GPU for each point in time, there is a natural tapering effect that emerges over time as lower indices are observed and selected more frequently. This isn't necessarily a problem per se, but simulation shows slight latency improvement when the ratchet is used.