godka / Pensieve-PPO

The simplest implementation of Pensieve (SIGCOMM' 17) via state-of-the-art RL algorithms, including PPO, DQN, SAC, and support for both TensorFlow and PyTorch.
https://godka.github.io/Pensieve-PPO/
BSD 2-Clause "Simplified" License
65 stars 32 forks source link

Training with Multiple videos with random number of bitrates masked. #3

Closed manojsoni2 closed 4 years ago

manojsoni2 commented 4 years ago

We are trying to train this model for multiple video using PPO with different masks(variable number of bit rates masked). For example, there are currently maximum 12 bitrates and some of them are randomly masked in different videos. Some times there are only 9 of them available and some time only 6 and so on.

So far we are trying to use the same dataset as used for A3C. We have modified abr.py and abrenv.py accordingly. But in our approach there seems to be issues in experience batch creation and its further processing. Can you share a reference how to create a video data set for PPO with different masks.

godka commented 4 years ago

Hi, Thanks for following our work. Regardless of the training methodology, you can use the weighted softmax method (consistent with the original Pensieve paper) for implementing multiple video tasks. Unlike the original Pensieve repo (multi-video.py), the shape of the collected actions should be the same. Currently, I'm working for submitting another paper to a conference, so I have not time to update this repo. And in that work, I use the weighted softmax method for dealing with another task. Hope that the following step will give you a hand. And I'll update the code soon (if I've already done that work).

i) in ppo2: create_network, you should change the output of policy network like this;

pi_lin = tflearn.fully_connected(pi_net, self.a_dim, activation='linear')
mask = network_inputs[:, 2, :]
pi = mask * tf.exp(pi_lin) / tf.reduce_sum(mask * tf.exp(pi_lin) + ACTION_EPS) 

ii) In train_ppo.py, you need to pick the actions w.r.t the assigned mask. For example, if you use the Gumbel sampling trick to take actions according to the policy \pi, you can change the code like this.

# gumbel noise
#assuming that obs[2, :] is the mask
mask = obs[2, :]
# how many '1' in the mask?
action_prob_mask = action_prob[np.where(mask > 0)]
noise_mask = np.random.gumbel(size=len(action_prob_mask))
# the masked action
act_mask = np.argmax(np.log(action_prob_mask) + noise_mask)
# the real action.
act = np.where(mask > 0)[0][act_mask]
manojsoni2 commented 4 years ago

Thanks for help. Now we are facing one issue after integrating above changes, the probability is becoming nan after some iterations(4000 epochs). Have you encountered such problem?

godka commented 4 years ago

I see. It's a little bit tough. I haven't observed such problems in our work (actually, in another work, not ABR scenario). Please check that the act you really picked is NOT in your mask. You can pull requests if you solved this issue.

manojsoni2 commented 4 years ago

If possible can you share the link to the other work in which this method is used. We just want to take reference from it for integrating in to Pensieve-PPO.

godka commented 4 years ago

hi, sorry for the long delay. I have submitted multi-video version in the brach 'gumbel'. the trick is: you should construct the weighted softmax like this:

https://github.com/godka/Pensieve-PPO/blob/10ed4e2a2cb972a183a24585afffdf0fb0e945e9/src/ppo2.py#L49