hongzimao / input_driven_rl_example

Variance Reduction for Reinforcement Learning in Input-Driven Environments (ICLR '19)
https://people.csail.mit.edu/hongzi/var-website/index.html
MIT License
31 stars 10 forks source link

Question about the input sequence(zt) of code #1

Closed hongBry closed 5 years ago

hongBry commented 5 years ago

Hi @hongzimao , Rencently, I have insterested the code and paper. I have some confused: In your paper, the input-dependent baseline

b(ωt, zt:∞) includes the ωt and zt:∞ which is the input sequence from t onwards .

but In load_balance_actor_multi_critic_train.py,I have not seen the input sequence which is unobserved at time t.

hongzimao commented 5 years ago

Thanks for your interest! This code is the implementation for the multi-value network approach. Please refer to the section 5 and appendix I of the paper for more details.

Also, please notice that at training time, the input sequence is already known. It is known for the critic but not for the actor.

hongBry commented 5 years ago

But in the code, the input of the critic network only includes wt(observed at time t.), I don't see the representation of zt:∞(input sequence from t onwards). In multi-value network, critic does not need to consider input sequence from t onwards?

hongzimao commented 5 years ago

Taking the whole sequence of z_t:∞ needs a sequence processing unit, such as LSTM or Transformer, which is not efficient to train (first paragraph of section 5 and appendix G has more details). The multi-value network implementation bypasses this problem by dedicating a particular value network for each fixed input sequence. The critic does not need to explicitly take the sequence as input (thus trains much faster). The blue section (input-dependent baselines) of our poster should help explaining this point too https://people.csail.mit.edu/hongzi/var-website/content/poster.pdf

hongBry commented 5 years ago

I have understanded the idea. thanks.