Open borismilicevic opened 6 years ago
Thank you for reaching out.
1) I think you mostly have it. tf.layers.dense creates an object behind the scenes and reuses it if "reuse" is set to True, as you pointed out. For clarity I probably should have instantiated a tf.layers.Dense object in the constructor and then call()'ed it in next_action(), like I do with rnn_cell. The Agent code sets up the graph, it is only called once (well next_action() is called time_steps times), training starts after the graph is set up. So the next_action() method isn't actually "called" on each episode, each episode feeds the inputs, which runs through the already created graph.
2) You are correct. Each batch starts with the Agent.rnn_state_t zero'd out, since the episode graph starts with self.rnn_state_t equal to rnn_cell.zero_state()
Best, Mark
Thank you for replying. I got some further questions! :)
1) How would you comment gradual decrease of loss function while the accuracy on validation set does not increase (stays below 25%)? I am using my own data which contains 3 possible labels. Does that mean agent is eager to request label? Maybe changing reward parameters could change his attitude.
2) What would be the consequence of keeping LSTM memory in between batches of episodes during the training process? In that case, I assume, I would have to keep last_label as well.
Boris
1) Look at the "requests" plot for confirmation of your theory, if requests are high and not decreasing as you train, then you might need to play with the rewards, or add a convnet before the lstm (if your data is larger images), or increase the capacity of the network, or switch away from lstm to attention... If requests are decreasing, then just train longer. If requests are really low, then maybe you are penalizing requests too much. I would simplify the problem to the same 2 classes per episode and the same labels and see if it can memorize (the lstm is useless in this case), if it learns that then randomizing the labels, if it learns that then the general setup is working, increase the pool of classes that you are sampling the 2 classes from, if you run into a problem you may need to add a convnet or switch away from lstm's. A "matching network" architecture with a Q function output might scale better and to harder problems than an lstm; if you received the label for an example, then you would add the pair to your example set (https://arxiv.org/abs/1606.04080).
2) I think the RL^2 paper tried not resetting the lstm state, but it only hurt (https://arxiv.org/abs/1611.02779). That fits with my intuition. I would want anything that is consistent across episodes to be trained into the weights of the network, in order to minimize the work done by the lstm.
On Fri, Oct 19, 2018 at 6:13 AM borismilicevic notifications@github.com wrote:
Thank you for replying. I got some further questions?
1.
How would you comment gradual decrease of loss function while the accuracy on validation set does not increase (stays below 25%)? I am using my own data which contains 3 possible labels. Does that mean agent is eager to request label? Maybe changing reward parameters could change his attitude. 2.
What would be the consequence of keeping LSTM memory in between batches of episodes during the training process? In that case, I assume, I would have to keep last_label as well.
Boris
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/markpwoodward/active_osl/issues/2#issuecomment-431358784, or mute the thread https://github.com/notifications/unsubscribe-auth/AGgTpZ7Xark-u0c1rkPoWGTz_4oYK-uWks5umdAJgaJpZM4Xjykz .
Thank you for responding thus far but I have two more questions.
1) Could you give me any advice on how to set the number of lstm units (num_lstm_units)? What should I base this parameter on? Maybe the shape of an input feature vector? If my data has only two features, for example, I doubt I should keep the number of lstm units as 200. 2) Any particular reason the discount factor is set to 0.5? Isn't it more ordinary in Q learning for it to be set around 0.9? I feel like this is greatly decreasing the importance of later steps in an episode, meaning, only the first few steps of an episode have a greater impact on the loss function. Any advice on how to set this parameter?
Thanks in advance!
On Tue, Oct 30, 2018 at 6:39 AM borismilicevic notifications@github.com wrote:
Thank you for responding thus far but I have two more questions.
- Could you give me any advice on how to set the number of lstm units (num_lstm_units)? What should I base this parameter on? Maybe the shape of an input feature vector? If my data has only two features, for example, I doubt I should keep the number of lstm units as 200.
- Any particular reason the discount factor is set to 0.5? Isn't it more ordinary in Q learning for it to be set around 0.9? I feel like this is greatly decreasing the importance of later steps in an episode, meaning, only the first few steps of an episode have a greater impact on the loss function. Any advice on how to set this parameter?
Thanks in advance!
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/markpwoodward/active_osl/issues/2#issuecomment-434303600, or mute the thread https://github.com/notifications/unsubscribe-auth/AGgTpT8aMMwO6UXrPKz3s8rkDZ8MrobOks5uqFZxgaJpZM4Xjykz .
I am currently looking into your code. I've read the paper behind it and I must say it is most impressive and really interesting. The code is pretty readable and for the most part easy to understand but there are small details I need clarification on. I must say I am rather new to tensorflow's estimator mechanism, but I've done a lot of reading just to understand your code better.
1) The agent contains all the trainable parts, meaning complete network architecture with himself. While the LSTM cell is stored as his private attribute, a dense layer behind it is created "just in time". So each new batch agent starts with "reuse=False", creates a new dense layer, then changes "reuse" to True. So at a t=1 new dense layer is created, and for t>1 (for the rest of the batch of episodes duration) existing dense layer is used. This confuses me. Why do you treat dense layer different to LSTM cell? Does this mean that each new batch of episodes a new "blank" dense layer is being created? 2) I assume same LSTM cell, once created, is being each training step. But, it's internal state is reset each new batch of episodes. So the memory of lstm is not being transferred from batch to batch, am I right?
Would you be so kind to answer me these? Thanks in advance!