Some questions about pensieve

Suliucuc commented 5 years ago

Hello, I feel very sorry to bother you, thank you very much for your answer.The first is about training actor and critic networks. For each 200 samples, weight gradients are calculated and saved, and then the sample array is cleared and put a new sample into it. In other words, pensieve did not take random samples from the memory pool to train . Is it because the samples long time ago have no large availability for training the current network? Second, you have applied gradients for every 15 gradients stored, but this time the update uses 15 gradient values, which is equivalent to updating the parameters every time you train, so shy you appy gradients after accumulating 15 gradients? Last is about the convergence problem. I ran agent.py nearly tens of thousands of times, but the loss function didn't converge. Is it convenient to tell how many times you trained to achieve convergence. Next is about generating synthetic trajectories through make_traces.py. I read your paper and learned that the resulting trajectory obeys the Markov model, and in order to make the current state easier to transfer to the adjacent state, the transition probability obeys the geometric distribution. But the parameter of geometric distribution is the number of trials X. So what are the parameters of geometric distribution here? Last is about the comparison of various algorithms. Your paper compares BOLA and rate-based algorithms. You don't have the code for these two algorithms in your code. Is it convenient to share the code for these two algorithms? If it is not convenient, it does not matter, I can go online to find the source code and then modify it. I am sorry to bother you again, hope your reply.Thanks!

hongzimao commented 5 years ago

For your first question, notice that actor-critic is an on-policy reinforcement learning algorithm. Thus, it has to use the data generated by the current policy. Off-policy methods like Q learning can use data from the reply memory (the "memory pool" you mentioned). Some modified versions of actor-critic with importance sampling can also use off-policy data; but we did not implement it here.

I don't understand the second question about 15 gradients.

As in the README.md, please consider running multi_agent.py which trains the RL agent more efficiently in parallel.

The code for generating the traces is in synthetic_traces.py. We wrote the implementation details in section 5.3 "Training with a synthetic dataset" in the paper.

The code for BOLA and rate-based algorithms is in https://github.com/hongzimao/pensieve/blob/master/video_server/myindex_BOLA.html and https://github.com/hongzimao/pensieve/blob/master/video_server/myindex_RB.html. Notice that those are already implemented in javascript with dash, we just need to invoke the existing ABR modules.

Hope these help.

Suliucuc commented 5 years ago

Thanks for replying.Your reply helps me a lot.But I am still puzzled why you apply gradients after the gradient_batch have stored 15 gradients which means training 15*200 sample. Since you apply all the elements of gradients array .why not apply gradients every time once the samples number reaches 200. the code : for i in range(len(actor_gradient_batch)): actor.apply_gradients(actor_gradient_batch[i]) critic.apply_gradients(critic_gradient_batch[i])

At 2019-03-18 21:49:54, "Hongzi Mao" notifications@github.com wrote:

For your first question, notice that actor-critic is an on-policy reinforcement learning algorithm. Thus, it has to use the data generated by the current policy. Off-policy methods like Q learning can use data from the reply memory (the "memory pool" you mentioned). Some modified versions of actor-critic with importance sampling can also use off-policy data; but we did not implement it here.

I didn't the second questions about 15 gradients.

As in the README.md, please consider running multi_agent.py which trains the RL agent more efficiently in parallel.

The code for generating the traces is in synthetic_traces.py. We wrote the implementation details in section 5.3 "Training with a synthetic dataset" in the paper.

The code for BOLA and rate-based algorithms is in https://github.com/hongzimao/pensieve/blob/master/video_server/myindex_BOLA.html and https://github.com/hongzimao/pensieve/blob/master/video_server/myindex_RB.html. Notice that those are already implemented in javascript with dash, we just need to invoke the existing ABR module.

Hope these help.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.

hongzimao commented 5 years ago

These are 16 gradients. They are computed from the parallel workers and aggregated on the central agent for policy update.

hongzimao / pensieve

Some questions about pensieve #70