How Paper input matches the code state s(t)?

ahmad-hl commented 2 years ago

Dear Hongzi,

I was trying to figure out the matching between the RL agent's state s(t) in the code and the input info in the paper.

Input: After the download of each chunk t, Pensieve’s learning agent takes state inputs st = (xt, τt, nt, bt ,ct ,lt) to its neural networks. xt is the network throughput measurements for the past k video chunks; τt is the download time of the past k video chunks, nt is a vector of m available sizes for the next video chunk; bt is the current buffer level; ctis the number of chunks remaining in the video; and lt is the bitrate at which the last chunk was downloaded. First of all, which code package we need to look at, multi-video_sim or sim?

When I look at sim, I see in def agent that the input state is

0: last quality ? 1: buffer_size ( bt) 2: chunk_size ? 3: delay ? is it download time (τt)? 4: next_chunk_sizes (nt) 5: remain_chunks (ct)

Could you please illustrates the matching, and the actor & critic networks (figure 5) if possible?

hongzimao commented 2 years ago

multi-video sim is for agents that can generalize to videos with different numbers and different level of bitrate encoding.

It looks to me in the above writing your understanding of the code and the paper is correct.

ahmad-hl commented 2 years ago

I have upgraded the code to work on python 3.8 and used cooked_tracesto train the multiagent RL model in sim dir. Given that I'm using a computer with 2 GPU and tensorboard to monitor, What is the time required for the model to converge? How do you know if the model converged?

Can you also explain the main components in the objective function?

# Compute the objective (log action_vector and entropy)
self.obj = tf.reduce_sum(tf.multiply(tf.log(tf.reduce_sum(tf.multiply(self.out, self.acts), axis=1, keepdims=True) - self.act_grad_weights)) 
+ ENTROPY_WEIGHT * tf.reduce_sum(tf.multiply(self.out, tf.log(self.out + ENTROPY_EPS)))

hongzimao commented 2 years ago

Thanks again for upgrading the codebase. The training wall time really depends on your physical hardware. You can monitor the learning curve and see when the performance on validation set is stabilized. To determine if the model is converged, you can use some heuristic like relative performance didn't improve much for the past xxx iteration or something. At our time, we just eyeballed it.

The main objective is just the policy gradient expression (the expression after the gradient operator). It's basically log pi_t * (R_t - baseline_t) + entropy regulator, sum over the training batch.

Hope these help.

hongzimao / pensieve

How Paper input matches the code state s(t)? #141