Applying simple tabular Q-learning(discreditized) possible?

mrak31 commented 3 years ago

New to reinforcement learning and wondering if its possible to apply simple tabular Q-learning (maybe with some discreditization) to the ro-gym inventory_management.py environment?

I'm also wondering what each tuple of the "observation" represents and which value shows the cumulated rewards that i can plot? example observation from random action: [58. 0. 31. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.

1. 1. 1. 1. 8.] (first 3 is the inventory at each stage correct?)

hdavid16 commented 3 years ago

Hi @mrak31,

The discounted profit at each node at each period is stored in self.P, which is a dataframe. This should make it easy to plot the reward.

Regarding the observation space, it is a vector with (m - 1) * (max_lead_time + 1) elements, where m is the number of stages in the serial supply chain and max_lead_time is the maximum lead time in the system. You have m-1 because for m stages, you will have m-1 links. The first m-1 elements in the vector are the on-hand inventory positions at each of the stages (excluding the most upstream node, which has unlimited inventory - something which was been generalized in the supply network environment so that you can have limited upstream inventory). The subsequent elements in the vector are the pipeline inventories on each link. These store the historical inventory requests on each link that have not made it down the pipeline yet. So, say I placed a replenishment order of 2 units and then another of 1 unit and then another of 3 units on a particular link. If none of these have arrived at the node at time t, then these three elements would be listed in the observation space. There is also padding (extra zeros) here since some links have shorter lead times than others, and would thus need less slots on the observation space.

This may not be the best/most straightforward implementation of the system state and can be revised at some point.

You may also find it useful to access the following dataframes that log the system behavior:

self.I: on-hand inventory at each node at each period
self.T: pipeline inventory at each link at each period
self.R: replenishment orders at each link at each period
self.D: demand at each retailer node at each period
self.S: sales at each node at each period
self.B or self.LS: backlog (or lost sales) at each retailer at each period

hdavid16 commented 3 years ago

New to reinforcement learning and wondering if its possible to apply simple tabular Q-learning (maybe with some discreditization) to the ro-gym inventory_management.py environment?

I'll defer this question to @hubbs5, who is the expert in RL.

hubbs5 commented 3 years ago

Yes, it could be discretized as a tabular Q-learning problem. There are lots of problems in OR that are solved in this and related ways (e.g. with dynamic programming). I don't think the standard problem that we have in the library would be too large to solve that way, but these problems do tend to scale very poorly as networks grow and become more complicated, because your state space grows rapidly.

ashwin-M-D commented 3 years ago

I have found, after working on the vehicle routing problem, the models tend to occupy a lot of RAM if you go with the conventional tabular q-learning method. However, using this project, you can modify the problems a bit to simplify the state space. Also if you are just demonstrating the model and not scaling it, for smaller state spaces, the tabular method usually trains faster and performs better as well.

hubbs5 / or-gym

Applying simple tabular Q-learning(discreditized) possible? #11