Open ahmadreza9 opened 4 years ago
I think that is related to Monte Carlo.
Yes, the code implements the REINFORCE algorithm. Notice that the loss is log_pi(s, a) * (value - baseline). When minimizing the loss, the underlying optimizer will differentiate the loss (which corresponds to the gradient operator in the equation) and apply a gradient step. Hope this helps!
So I can see you calculate this function:
Now Could you tell me how did you compute Gt = vt-bt? ( I can't see anywhere you had bt calculation. So, that must be hidden in your loss function.)
I just found this comment self.num_seq_per_batch = 10 # number of sequences to compute baseline
in prameters class
Yes, the code implements the REINFORCE algorithm. Notice that the loss is log_pi(s, a) * (value - baseline). When minimizing the loss, the underlying optimizer will differentiate the loss (which corresponds to the gradient operator in the equation) and apply a gradient step. I hope this helps!
Why did you use RMSprop? What are the problems of using the ADAM optimizer? and what are the differences between end # termination type, 'no_new_job' or 'all_done'
variable types?
Here's how we computed the advantage Gt, with a time-based baseline: https://github.com/hongzimao/deeprm/blob/master/pg_re.py#L193-L202.
IIRC, RMSProp was slightly more stable than Adam in our experiment. FWIW, A3C original paper also used RMSProp (https://arxiv.org/pdf/1602.01783.pdf see Optimizations in section 4).
The last comment was about different episode termination criteria. It's their literal meaning I think, 'no new jobs' ends the episode when no new jobs are coming and 'all_done' only terminates the episode when all jobs (including the unfinished ones when 'no_new_jobs' is satisfied) are completed: https://github.com/hongzimao/deeprm/blob/b42eff0ab843c83c2b1b8d44e65f99440fa2a543/environment.py#L255-L265.
Here's how we computed the advantage Gt, with a time-based baseline: https://github.com/hongzimao/deeprm/blob/master/pg_re.py#L193-L202.
IIRC, RMSProp was slightly more stable than Adam in our experiment. FWIW, A3C original paper also used RMSProp (https://arxiv.org/pdf/1602.01783.pdf see Optimizations in section 4).
The last comment was about different episode termination criteria. It's their literal meaning I think, 'no new jobs' ends the episode when no new jobs are coming and 'all_done' only terminates the episode when all jobs (including the unfinished ones when 'no_new_jobs' is satisfied) are completed:
.
Thanks, for your discretions. I wonder why you implemented optimizers. Did you have any special goals?(they're predefined and implemented) Does theano library implement these common optimizers or not?
mem_alloc = 4
variable in pg_su.py. why you assign this constant?We didn't reimplement the optimizers, Theano (or more commonly used Tensorflow or Pytorch all have built-in optimizers like RMSProp or Adam).
The entropy is for promoting the exploration in the beginning of RL training.
pg_su
is for supervised learning. If I remember, mem_alloc
is just some parameter controlling the size of generated dataset.
Thank you for your attention but I must clarify that I talked about rmsprop_updates and I would appreciate change stepsize (3rd input of blow function) to lr_rate. (It confused me :))
def rmsprop_updates(grads, params, stepsize, rho=0.9, epsilon=1e-9):
The special parameter in your RMSpropUpdate function is grad which returns gradients of loss with respect to params grads = T.grad(loss, params)
. I found torch.autograd.grad(gg, xx)
for PyTorch and tf.gradients(ys,xs)
for Tensorflow. Do these equivalents to your grads? (if you worked with Pytorch and Tensorflow)
You are right that rmsprop_updates
is a customized function. I guess back at that time standardized library for those optimizers were not available :) Things are easier nowadays. And you are right about the gradient operations in tensorflow or pytorch.
The entropy is for promoting the exploration at the beginning of RL training.
Sir, your answer about entropy is not convincing for me. You had a method for it but did not use it in your network or Reinforce training(single or multiple)
Dear Mr.hongzi I was interested in your resource scheduling method. Now, I stuck in your network class. I can't understand why you used the blow function:
loss = T.log(prob_act[T.arange(N), actions]).dot(values) / N
Did you calculate the special loss function? If you didn't, what's the name of this loss function?