hpi-sam / rl-4-self-repair

Reinforcement Learning Models for Online Learning of Self-Repair and Self-Optimization
MIT License
0 stars 1 forks source link

Integration of the environment into a gym env #5

Closed 2start closed 4 years ago

2start commented 4 years ago

We need to integrate the environment into the interface provided by gym.

Gym is a RL framework that provides different environments with the same interface to test different RL algorithms. We should integrate our environment with the given interface.

Here is an article about creating a gym environment: Gym example

Here is a description by @christianadriano. However, we just decided to use the naming conventions of gym.

 "        Returns four outputs\n",
        "        -------\n",
        "        observation, reward, episode_over, cumulative_reward, info : tuple\n",
        "            observation (object) :\n",
        "                It is part of the internal state made visible to the agent\n".
        "                In our case, it is an ordered set of pairs <component, failure>, which is \n",
        "                initially created as a FIFO (first in first out) list\n",
        "                of failures that happened to different compoments, i.e., the\n",
        "                position of the pairs <component, failure> depends on when\n",
        "                the failure occurred. \n",
        "                Earlier failures are placed towards the end of the list, while\n",
        "                more recent failures are at the beginning of the list. This list\n",
        "                can be re-ordered by using the swap action (see next).\n",
        "            reward (float) :\n",
        "                if action==repair, then returns a utility increase for the\n",
        "                corresponding <component, failure> pair. \n",
        "                Constrain: only the top of the list can be repaired, hence if\n",
        "                the component to be repaired is not at the top of the \n",
        "                list, then, the agent has to call for a swap action.\n",
        "\n",
        "                if action==swap, if successful, swap the two components places\n",
        "                and returns the cost of doing the swap. \n",
        "                \n",
        "                The total reward that has to me maximize will be kept up-to-date\n",
        "                by another class.\n",
        "            episode_over (bool) :\n",
        "                Tells whether it is time to reset the environment again. The \n",
        "                episodes are automacally over when we emptied the list of tuples\n",
        "                <component,failure>. Hence, TRUE indicates that the episode has \n",
        "                terminated.\n",
        "            info(dict) :\n",
        "                 Diagnostic information useful for debugging. \n",
        "                 We can report here the transition matrix used, the full table\n",
        "                 of <compoment, failure, utility_increase> tuples.\n",
        "\n",
        "                 However, note that official evaluations of our agent should not \n",
        "                 use this internal information for learning.\n",
MrBanhBao commented 4 years ago

@2start Implemented a gym.env in the envs dir. You also can find a notebook with some examples how to use this environment.

@brrrachel copied your data_handler into the env directory. I added the method "get_repair_failure_probability". It should return the probability if a repair action will fail or not. At the moment it is static with a probability of 10% that a repair will fail. I do not know if there are information of repair action failures. Maybe @christianadriano has a idea how to choose the failure rate.

MrBanhBao commented 4 years ago

@brrrachel @2start @christianadriano We should also think of a strategy to "punish" the agent's action to repair a already repaired component.

e.g.: Components: ["A", "B", "C"] -> repair "A" -> ["B", "C"] -> repair "A" -> ["B", "C"]

At the moment the reward of this redundant actions are zero. It is possible that the agent's still learns that in a certain state specific actions do not make any sense. But believe the agent's policy would converge much faster if we make this punishment negative in respect to the step it already made. Means the negative reward increases by the amount of steps already been taken. This negative reward should be somehow in the range of the difference reward functions ('cubic', 'log10', 'srt').

2start commented 4 years ago

Thanks! Regarding the negative reward: Whats the counterargument against a fixed negative reward?

brrrachel commented 4 years ago

@2start thats what @christianadriano mentioned last week. We want to change to a non-stationary environment where e.g. the number of users/computers might influence the overall system behaviour and rewards might be dependent on this / influenced by this (e.g. with more users we get worse reward).