sawcordwell / pymdptoolbox

Markov Decision Process (MDP) Toolbox for Python
BSD 3-Clause "New" or "Revised" License
518 stars 252 forks source link

max_iteration parameters is not respected in the ValueIteration Algorithm #13

Open sshegheva opened 9 years ago

sshegheva commented 9 years ago

Using the forest example from the quickstart guide and modifying the max_iter does not seem to work.

It looks like it always defaults to 39 and stops.

import mdptoolbox.example
P, R = mdptoolbox.example.forest()
vi = mdptoolbox.mdp.ValueIteration(P, R, 0.9, max_iter=100)
vi.run()
vi.policy
print vi.V
print vi.max_iter

(5.051970000000001, 8.291970000000001, 12.291970000000001)
39

I am experimenting with number of iterations to see how the algorithm is performing and it would be very beneficial to collect values at each iteration

Update:

Setting MDP to verbose shows that only 4 iterations were executed. So what does 39 means here?


  Iteration     V-variation
    1         4.0
    2         2.43
    3         0.81
    4         0.0
Iterating stopped, epsilon-optimal policy found.
(5.051970000000001, 8.291970000000001, 12.291970000000001)
39
sawcordwell commented 9 years ago

Hi, this is actually a feature.

There is a method on the ValueIteration class called _boundIter that computes the upper bound of how many iterations the algorithm will need to complete given a certain tolerance. This tolerance is called epsilon in the code. Since you didn't pass any value for epsilon the default of 0.01 is used. _boundIter uses a theorem from the Puterman 1994 book to calculate a bound on the number of iterations needed. However, the actual number of iterations needed until the value function converges may be fewer. In the forest example case it will always converge to the optimal policy in four steps because V-variation is 0 at iteration 4 which means the optimal policy is found.

However using the small example might help here: epsilon can control the tolerance you are willing to accept in the epsilon-optimal value function.

import mdptoolbox.example
P, R = mdptoolbox.example.small()
vi = mdptoolbox.mdp.ValueIteration(P, R, 0.96)
vi.setVerbose()
vi.run()  # takes 36 iterations
vi = mdptoolbox.mdp.ValueIteration(P, R, 0.96, epsilon=0.00001)
vi.setVerbose()
vi.run()  # takes 62 iterations

You will notice that the smaller epsilonwill cause the iterations to stop at smaller V-variations. So using epsilon you can control the precision of the optimal value function. This is why it is called epsilon-optimal, because it depends on the value of epsilon.

I would suggest that max_iter is mostly useful to limit the time that you are willing to wait for a solution to be found. This basically stops the algorithm from running forever and not returning anything.

Hope this helps.

sshegheva commented 9 years ago

Thanks for the explanation. It is very useful.

The reason I wanted to specify the max_iterations is that I can observe the convergence of each algorithm and not all of them have epsilon.

What worked for me is setting the max_iter after the object has been initialized, like

vi = mdptoolbox.mdp.ValueIteration(P, R, 0.96)
vi.max_iter = 10

In this case it wont go beyond those 10 iterations.

Great job, I liked playing with the forest example

sawcordwell commented 9 years ago

Ah yes, that is a good solution. Another option that might work is to modify the _bellmanOperator method in the base MDP class to save the value function to disk or database or something each time it is called. You can use self.iter to store which iteration it is up to. You can use self.__class__ to store the name of the algorithm. Line 261 and 262 should help.