tharmoth / MountainCart

0 stars 1 forks source link

How did you write the code for MountainCart.py? #2

Open montallban opened 4 years ago

montallban commented 4 years ago

Have you done something like this before? I'm trying to understand it and see what it's doing. The comments are a bit sparse so it hasn't been exactly easy. Perhaps you can lead me through at some point? What will I have to change in order to implement eligibility traces?

Edit: Really, I think just seeing the pseudocode would help.

tharmoth commented 4 years ago

I worked through this tutorial on q learning. Then I figured out discretization on my own and solved the cart pole problem. I then adapted this for MountainCart. I believe at some point i changed the Q learning algorithm to what we learned in class as the implementation from that tutorial wasn't working for mountain cart.

Sorry about the readability, I just pushed the code I was playing around with and haven't made it human readable. I'll go back through and clean the code up.

As for changing this to implement eligibility traces, you would need to change the train() method to use eligibility traces instead of raw q learning, and maybe also change the .evaluate() method if eligibility traces change more than just the q table.

tharmoth commented 4 years ago

Here's some vaguely pseudo code of the train method.

    def train(self):
        streak = 0
        max_iterations = 10000
        loop while episodes are less than max or conditions are met
            update hyperperameters

            reset gym ai state

            run the simulation until complete

                # save the old state
                angle_old, velocity_old = angle, velocity

                # either do something random or do the models best predicted action
                if random.uniform(0, 1) < self.epsilon:
                    action = self.env.action_space.sample()  # Explore action space
                else:
                    action = np.argmax(self.q_table[angle_old][velocity_old])  # Exploit learned values

                # run the simulation
                next_state, reward, done, info = self.env.step(action)

                # convert the continueous state data to discrete data
                angle, velocity = self.bin_data(next_state)

                # update the q learning model
                next_max = np.max(self.q_table[angle][velocity_new])
                old_value = self.q_table[angle_old][velocity_old][action]
                self.q_table[angle_old][velocity_old][action] += self.alpha * (reward + self.gamma * next_max - old_value)

                # get ready for next loop
                state = next_state
                epochs += 1

            # The rest of the code are arbitrary conditions that signal the model is trained
            # I was playing around with them and these conditions seem to yield good results most of the time
            # Feel free to play around with this as much as youd like
            if epochs < 130:
                streak += 1
            else:
                streak = 0

            if streak > 2:
                print("Found Streak at Episode: " + str(episode))
                break

            if epochs < 100:
                print("Optimal Detected")
                # break

            # Print progress bar and then add data to graph
            if episode % (max_iterations / 10) == 0:
                # print("Training " + str(i / max_iterations * 100) + "% Complete.")
                pass
            self.convergence_graph.append(epochs)