hongzimao / deeprm

Resource Management with Deep Reinforcement Learning (HotNets '16)
MIT License
287 stars 156 forks source link

loss function (In Policy Gradient section), optimizer and entropy #9

Open ahmadreza9 opened 4 years ago

ahmadreza9 commented 4 years ago

Dear Mr.hongzi I was interested in your resource scheduling method. Now, I stuck in your network class. I can't understand why you used the blow function: loss = T.log(prob_act[T.arange(N), actions]).dot(values) / N Did you calculate the special loss function? If you didn't, what's the name of this loss function?

ahmadreza9 commented 4 years ago

I think that is related to Monte Carlo. image

hongzimao commented 4 years ago

Yes, the code implements the REINFORCE algorithm. Notice that the loss is log_pi(s, a) * (value - baseline). When minimizing the loss, the underlying optimizer will differentiate the loss (which corresponds to the gradient operator in the equation) and apply a gradient step. Hope this helps!

ahmadreza9 commented 4 years ago

So I can see you calculate this function: image

Now Could you tell me how did you compute Gt = vt-bt? ( I can't see anywhere you had bt calculation. So, that must be hidden in your loss function.) I just found this comment self.num_seq_per_batch = 10 # number of sequences to compute baseline in prameters class

ahmadreza9 commented 4 years ago

Yes, the code implements the REINFORCE algorithm. Notice that the loss is log_pi(s, a) * (value - baseline). When minimizing the loss, the underlying optimizer will differentiate the loss (which corresponds to the gradient operator in the equation) and apply a gradient step. I hope this helps!

Why did you use RMSprop? What are the problems of using the ADAM optimizer? and what are the differences between end # termination type, 'no_new_job' or 'all_done' variable types?

hongzimao commented 4 years ago

Here's how we computed the advantage Gt, with a time-based baseline: https://github.com/hongzimao/deeprm/blob/master/pg_re.py#L193-L202.

IIRC, RMSProp was slightly more stable than Adam in our experiment. FWIW, A3C original paper also used RMSProp (https://arxiv.org/pdf/1602.01783.pdf see Optimizations in section 4).

The last comment was about different episode termination criteria. It's their literal meaning I think, 'no new jobs' ends the episode when no new jobs are coming and 'all_done' only terminates the episode when all jobs (including the unfinished ones when 'no_new_jobs' is satisfied) are completed: https://github.com/hongzimao/deeprm/blob/b42eff0ab843c83c2b1b8d44e65f99440fa2a543/environment.py#L255-L265.

ahmadreza9 commented 4 years ago

Here's how we computed the advantage Gt, with a time-based baseline: https://github.com/hongzimao/deeprm/blob/master/pg_re.py#L193-L202.

IIRC, RMSProp was slightly more stable than Adam in our experiment. FWIW, A3C original paper also used RMSProp (https://arxiv.org/pdf/1602.01783.pdf see Optimizations in section 4).

The last comment was about different episode termination criteria. It's their literal meaning I think, 'no new jobs' ends the episode when no new jobs are coming and 'all_done' only terminates the episode when all jobs (including the unfinished ones when 'no_new_jobs' is satisfied) are completed:

https://github.com/hongzimao/deeprm/blob/b42eff0ab843c83c2b1b8d44e65f99440fa2a543/environment.py#L255-L265

.

Thanks, for your discretions. I wonder why you implemented optimizers. Did you have any special goals?(they're predefined and implemented) Does theano library implement these common optimizers or not?

ahmadreza9 commented 4 years ago
  1. You defined entropy and show the mean of that. why you didn't use this metric in your article's evaluation part?
  2. I see you use mem_alloc = 4 variable in pg_su.py. why you assign this constant?
hongzimao commented 4 years ago

We didn't reimplement the optimizers, Theano (or more commonly used Tensorflow or Pytorch all have built-in optimizers like RMSProp or Adam).

The entropy is for promoting the exploration in the beginning of RL training.

pg_su is for supervised learning. If I remember, mem_alloc is just some parameter controlling the size of generated dataset.

ahmadreza9 commented 4 years ago

Thank you for your attention but I must clarify that I talked about rmsprop_updates and I would appreciate change stepsize (3rd input of blow function) to lr_rate. (It confused me :)) def rmsprop_updates(grads, params, stepsize, rho=0.9, epsilon=1e-9):

ahmadreza9 commented 4 years ago

The special parameter in your RMSpropUpdate function is grad which returns gradients of loss with respect to params grads = T.grad(loss, params). I found torch.autograd.grad(gg, xx) for PyTorch and tf.gradients(ys,xs) for Tensorflow. Do these equivalents to your grads? (if you worked with Pytorch and Tensorflow)

hongzimao commented 4 years ago

You are right that rmsprop_updates is a customized function. I guess back at that time standardized library for those optimizers were not available :) Things are easier nowadays. And you are right about the gradient operations in tensorflow or pytorch.

ahmadreza9 commented 3 years ago

The entropy is for promoting the exploration at the beginning of RL training.

Sir, your answer about entropy is not convincing for me. You had a method for it but did not use it in your network or Reinforce training(single or multiple)