Open sshegheva opened 9 years ago
Hi, this is actually a feature.
There is a method on the ValueIteration class called _boundIter that computes the upper bound of how many iterations the algorithm will need to complete given a certain tolerance. This tolerance is called epsilon
in the code. Since you didn't pass any value for epsilon the default of 0.01 is used. _boundIter uses a theorem from the Puterman 1994 book to calculate a bound on the number of iterations needed. However, the actual number of iterations needed until the value function converges may be fewer. In the forest example case it will always converge to the optimal policy in four steps because V-variation
is 0 at iteration 4 which means the optimal policy is found.
However using the small example might help here: epsilon
can control the tolerance you are willing to accept in the epsilon-optimal value function.
import mdptoolbox.example
P, R = mdptoolbox.example.small()
vi = mdptoolbox.mdp.ValueIteration(P, R, 0.96)
vi.setVerbose()
vi.run() # takes 36 iterations
vi = mdptoolbox.mdp.ValueIteration(P, R, 0.96, epsilon=0.00001)
vi.setVerbose()
vi.run() # takes 62 iterations
You will notice that the smaller epsilon
will cause the iterations to stop at smaller V-variations. So using epsilon
you can control the precision of the optimal value function. This is why it is called epsilon-optimal, because it depends on the value of epsilon.
I would suggest that max_iter
is mostly useful to limit the time that you are willing to wait for a solution to be found. This basically stops the algorithm from running forever and not returning anything.
Hope this helps.
Thanks for the explanation. It is very useful.
The reason I wanted to specify the max_iterations is that I can observe the convergence of each algorithm and not all of them have epsilon.
What worked for me is setting the max_iter after the object has been initialized, like
vi = mdptoolbox.mdp.ValueIteration(P, R, 0.96)
vi.max_iter = 10
In this case it wont go beyond those 10 iterations.
Great job, I liked playing with the forest example
Ah yes, that is a good solution. Another option that might work is to modify the _bellmanOperator
method in the base MDP class to save the value function to disk or database or something each time it is called. You can use self.iter
to store which iteration it is up to. You can use self.__class__
to store the name of the algorithm. Line 261 and 262 should help.
Using the forest example from the quickstart guide and modifying the max_iter does not seem to work.
It looks like it always defaults to 39 and stops.
I am experimenting with number of iterations to see how the algorithm is performing and it would be very beneficial to collect values at each iteration
Update:
Setting MDP to verbose shows that only 4 iterations were executed. So what does 39 means here?