Open dreammonkey opened 4 years ago
Hi @dreammonkey and sorry for my late response, had some work to do and totally missed the notification ...
I think the simplest way to tackle your issue with the actions is to consider all the different combinations, which are not too numerous. You would have an action for each possible pair: [right; forward], [right; none], [right; reverse], [straight; forward], etc. So 3 * 3 = 9 actions.
Regarding the changing goal, I do not think it is an issue, the model will learn that when (X pos, Z pos) equals target, it maximizes its reward. You could try to define a continuous reward depending on the distance to the target at first, to assess the behavior of your model with an easier task.
Finally, you have a point with the cartesian coordinates. If your environment is finite, you should restrict it to [0, X_max] [0, Z_max], and scale it down to [0, 1] [0, 1] as inputs to your model. Neural networks do not like much larger values as it tends to make the gradient vanish quickly during backpropagation (https://en.wikipedia.org/wiki/Vanishing_gradient_problem).
Thanks for your reply @prouhard !
Your 3*3 suggestion makes a lot of sense, but I'm still not sure if I'm implementing it correctly...
I'm trying to find the highest value of the 9 outputs and map it to one of the 'action combinations', but it looks like the agent is continuously performing the same actions as the training evolves, which feels wrong...
The snippet below is my current setup (for the outputs), do you think this is the right approach ?
export class Model {
constructor(){
...
this.actions = [
// forward - left
[1,1],
// forward - staight
[1,0],
// forward - right
[1,-1],
// none - left
[0,1],
// none - straight
[0,0],
// none - right
[0,-1],
// reverse - left
[-1,1],
// reverse - straight
[-1,0],
// reverse - right
[-1,-1],
]
...
}
chooseAction(state) {
return tf.tidy(() => {
const logits = this.network.predict(state);
// console.log('logits:')
// logits.print();
// example output
// 0.09678, 0.0882238, 0.1310767, 0.0744453, 0.1055766, 0.1035581, 0.1257232, 0.1936997, 0.0809164
// equals 1 (more or less)
// 0.09678+ 0.0882238+ 0.1310767+ 0.0744453+ 0.1055766+ 0.1035581+ 0.1257232+ 0.1936997+ 0.0809164
const action_idx = logits.argMax(1).dataSync();
// console.log('action_idx:', action_idx);
return this.actions[action_idx]
});
}
}
hmm, ok i figured out the problem, i forgot to add the greedy (epsilon) policy, but do you think this is the correct way to choose the action ?
const action_idx = logits.argMax(1).dataSync();
return this.actions[action_idx];
Another question I'm struggling with is the internal workings of the replay function (and the discounted reward).
I'm still not grasping the complete mechanics of the code, but while reading through it and logging all the variables I noticed that the discounted reward that gets fed into the currentQ table was logging NaN
, that couldn't be intentionally, right ?
Could it be that this line:
currentQ[action] = nextState ? reward + this.discountRate * qsad[index].max() : reward;
Should actually be:
currentQ[action] = nextState ? reward + this.discountRate * qsad[index].max().dataSync() : reward;
@dreammonkey Good catch on the forgotten dataSync
, I did some refactoring on the replay
function before pushing it to GitHub and totally missed that, thanks !
Your way of choosing the action seems fine to me, you just have to take the one with the maximum logit.
If you want to force your agent to explore deeper each action (and not constantly change direction when you have an epsilon > 0.1), I found that frame skipping helped a lot, almost felt like cheating for my problem.
@prouhard @dreammonkey, guys sorry for pinging you here, I'm struggling to understand this algorithm of dqn learning, could someone help me and explain.
I'm no TensorFlow expert or ML and this line confusing me.
currentQ[action] = nextState ? reward + this.discountRate * qsad[index].max().dataSync() : reward;
How I understand qsad[index]
is a tensor with actions probabilities which we get from the future (next) state. Why we taking max probability and add it to a reward?
Also why we store it in currentQ[action]
? Because, after .dataSync()
we will lose it, isn't it?
thanks in advance.
EDITED
Hello @vtcaregorodtcev
Very brief answer,
I struggled a very long time to really understand the concept, and even then my knowledge about the subject is limited and purely out of interest (I am not a data scientist :) )
But qsad[index] is the probability that is calculated by the network during the training phase and stored in the replay memory for processing during the replay phase (the function where you copied the code block from). This probability is evaluated using the max or argMax function to retrieve the 'most likely option'. Depending on the state of the network this value can be on or off, but that is irrelevant.
The value is then updated using the discountFactor (which is basically a tool to make the network look more into the future or closer to the present see here
What's important is that the collection of currentQ is then fed to the network (with all the - slightly - altered values). By retraining (this part is basic ML I believe, not specific to Reinforcement learning) the network gradually improves its knowledge of its environment, ultimately (hopefully) evolving to a fully trained (and production ready) network.
With all respect to the author, @prouhard, but I never actually got this code fully working myself, but I did use it as a reference for my own project and succeeded at doing what is in the comments above. There aren't many tfjs examples demonstrating dqn, most of the tutorials are in python, but same logic applies.
TIP: tensorflowjs has an excellent dqn (snake) demo, that elaborates even further on this algorithm. In it they use 2 networks to prevent the network form influencing itself, it's really worth checking that out !
Hi @vtcaregorodtcev,
This line is in fact the most difficult, so there is no issue taking more time to figure it out.
It basically updates the predicted value of taking action
when in state
with the (discounted) expected value of the new state in which this action puts us.
If you wish to know more about it, it is in fact the Bellman equation. You can find a lot of awesome resources on the web, which will explain it far better than I could do.
As @dreammonkey explained it very well, we now want to make the neural network learn these updated values in the different states, as a traditional supervised learning problem. The model will then be able to better estimate the value of taking a specific action in a specific state.
@prouhard
Been staring at this for a while in the replay method:
currentQ[action] = nextState ? reward + this.discountRate * qsad[index].max().dataSync() : reward;
currentQ is a Tensor, and action is either -1, 0, or 1. Not following how you are able to use action as an index into the Tensor like that.
Hi,
I read your blog article and I'm trying to adapt your logic to train a model for a vehicle that could learn to find its way in a 3D/physics environment. Given a state pos_x, pos_z, speed, target_x, target_z (currently disregarding y-axis/height) and actions defined as: turn: (-1 right, 0: straight: 1: left) accelerate (1: forward, 0: none, -1: reverse)...
My current setup for the model is:
As for the output (interpretation) of the prediction I'm struggling at implementing the logic to map the returned tensor to my desired movement actions:
I should probably be splicing the tensor into 2 parts and perform above logic on each spliced bit, but I'm not sure A. how to do that, B. if that wouldn't break the model's logic...
Another idea I had is to use a tensor of size 2 to use as the output and then apply a logic to get my actions:
Would that work as well ?
Even more I'm wondering if this would even ever work considering the target_x and target_z would change every time a target has been reached (in contrast to the mountaincar problem where the goal is always the same).
Furthermore my physics world is based on a Cartesian coordinate system, wouldn't that make the model flip when the vehicle crosses either of the axis provided in the state (as it enters another quadrant) ?
A lot of questions, i know, but I'd love to be pointed in the right direction... The bottomline being: could Reinforcement Learning be used to effectively solve pathfinding... ?
Thanks