matsim-org / matsim-code-examples

A repository containing code examples around MATSim
GNU General Public License v3.0
79 stars 176 forks source link

Algorithms for scoring? #101

Closed vishalmhjn closed 5 years ago

vishalmhjn commented 5 years ago

Hello, Where can I get more details on the algorithms used by MATSim for plan-selection, Scoring and Replanning. Are the algorithms already similar to Q-learning and other Reinforcement learning algorithms? If not, Do you have any future plans to implement these?

tduberne commented 5 years ago

The best description of the algorithm can be found in the first part of the book. It is what we call a co-evolutionnary algorithm, where agents iteratively improve their travel plan for the day, given what others are doing.

I only have basic knowledge of reinforcement learning, but I have the feeling that what we do is quite different. As far as my understanding goes, reinforcement learning is concerned with adapting the probability of certain responses to certain stimuli, such that the response is "good" in some way. In MATSim, by default, agents do not perceive anything, they just follow a plan blindly. If you were a MATSim agent, in the morning, you would print out a very detailed plan ("leave home at 8:47, walk 6.3 minutes to the transit stop, board bus nr blabla...."), and just look at it the whole day and do everything that is written there, no matter what. It might be the best option, but it might also be that the bus the agent wants to board has a failure and never comes. A MATSim agent will wait for it, staring at his printed-out plan, oblivious to whatever information he might hear.

There are cases where agents are a bit more reactive (for instance in simulations involving taxi services), but their "reactions" are fixed algorithms.

We do not have a plan to include reinforcement learning into MATSim, but I can see a potential for it. Something in this direction was done for parking search: the agents did not select a sequence of actions, but strategies to search a parking spot, which did contain reactions to stimuli. How much this conforms to classical reinforcement learning strategies, I cannot say, but it might give you directions on how to do this, if this is in your interest.

vishalmhjn commented 5 years ago

Thank you for the detailed response.

kainagel commented 5 years ago

Hm, I would have argued a bit differently. In fact, when we designed MATSim, we looked at reinforcement learning. Our understanding, in fact, was that there is a reinforcement learning problem (RLP), and then solution algorithms for it, for example Q learning. RLP essentially says that there are (state,action)-pairs, and you are getting rewards that are based on state, action, or both, where the system may be stochastic and thus the outcome of an action may be stochastic. Goal is to maximize some (normally discounted) sum over all rewards. Q learning, as said, is one possible solution algorithm. (Another one, in fact, is Dynamic Programming, at least if you have a final state.)

Now the thing is that in standard MATSim the corresponding state space is quite reduced. The only state is "over night" or "before the day begins", and the only actions are to select a plan. (And the rewards, in consequence, are the scores.) . However, after selecting the action (= the plan), the system is stochastic.

What we don't have is something like multi-stage learning. And in that sense, clearly, what Thibaut writes is consistent with what I write, except that I would have said "MATSim is Q learning with a very reduced space for state and actions" while Thibaut writes that it is not Q learning at all.

https://doi.org/10.1177/0361198105193500119 is looking at using Q learning for activity chains.

vishalmhjn commented 5 years ago

Thank you, this is very interesting. Just to be more clear: By "multi-stage learning", I interpret it as follows: One MATSim episode (24 hours) would consist of an optimal policy search for each and every agent using RLP algorithm. Instead of having a basket of plans (as in case of Co-evolutionary algorithms ), RLP algorithm would lead to a policy function for each agent (based on state action-pairs) at end of each episode. This optimal policy function will decide that performance of an agent during the day.

I feel that there is a trade-off. The more information (optimal policy function) in case of RL approach comes at the cost of memory and computation overload due to number of parameters and episodes needed to reach convergence, which in some cases might even not be practical. The heuristic approach of co-evolutionary algorithm gives less information (no policy for plan selection) but seems highly efficient.

kainagel commented 5 years ago

Yes, that is what I meant by multi-stage learning.

Advantage of such a policy would be that it would remain optimal under disturbances, such as unexpected congestion.

But yes, so far this seems to be too slow. We need something like 70% of agents using their "normal" plan since otherwise the feedback of the system to agents that explore becomes too wrong. So with 1000 iterations, one would have 300 runs through the Q learning graph.

On the other hand, what is the state space? In the above paper, we used, per activity something like (pastDurationAtActivity, currentTimeOfDay) in 15min time bins, since in some situations you need to consider the departure time (e.g. for an appointment) while in others you want to perform your activity for a certain duration. For that, 300 iterations of Q learning would not be enough.

Another problem is that one can, in fact, figure out the optimal allocation of time analytically (it ends up being a 1d problem), so using Q learning here seems total overkill.