Closed hongBry closed 5 years ago
Thanks for your interest! This code is the implementation for the multi-value network approach. Please refer to the section 5 and appendix I of the paper for more details.
Also, please notice that at training time, the input sequence is already known. It is known for the critic but not for the actor.
But in the code, the input of the critic network only includes wt(observed at time t.), I don't see the representation of zt:∞(input sequence from t onwards). In multi-value network, critic does not need to consider input sequence from t onwards?
Taking the whole sequence of z_t:∞ needs a sequence processing unit, such as LSTM or Transformer, which is not efficient to train (first paragraph of section 5 and appendix G has more details). The multi-value network implementation bypasses this problem by dedicating a particular value network for each fixed input sequence. The critic does not need to explicitly take the sequence as input (thus trains much faster). The blue section (input-dependent baselines) of our poster should help explaining this point too https://people.csail.mit.edu/hongzi/var-website/content/poster.pdf
I have understanded the idea. thanks.
Hi @hongzimao , Rencently, I have insterested the code and paper. I have some confused: In your paper, the input-dependent baseline
but In load_balance_actor_multi_critic_train.py,I have not seen the input sequence which is unobserved at time t.