Open montallban opened 4 years ago
I worked through this tutorial on q learning. Then I figured out discretization on my own and solved the cart pole problem. I then adapted this for MountainCart. I believe at some point i changed the Q learning algorithm to what we learned in class as the implementation from that tutorial wasn't working for mountain cart.
Sorry about the readability, I just pushed the code I was playing around with and haven't made it human readable. I'll go back through and clean the code up.
As for changing this to implement eligibility traces, you would need to change the train() method to use eligibility traces instead of raw q learning, and maybe also change the .evaluate() method if eligibility traces change more than just the q table.
Here's some vaguely pseudo code of the train method.
def train(self):
streak = 0
max_iterations = 10000
loop while episodes are less than max or conditions are met
update hyperperameters
reset gym ai state
run the simulation until complete
# save the old state
angle_old, velocity_old = angle, velocity
# either do something random or do the models best predicted action
if random.uniform(0, 1) < self.epsilon:
action = self.env.action_space.sample() # Explore action space
else:
action = np.argmax(self.q_table[angle_old][velocity_old]) # Exploit learned values
# run the simulation
next_state, reward, done, info = self.env.step(action)
# convert the continueous state data to discrete data
angle, velocity = self.bin_data(next_state)
# update the q learning model
next_max = np.max(self.q_table[angle][velocity_new])
old_value = self.q_table[angle_old][velocity_old][action]
self.q_table[angle_old][velocity_old][action] += self.alpha * (reward + self.gamma * next_max - old_value)
# get ready for next loop
state = next_state
epochs += 1
# The rest of the code are arbitrary conditions that signal the model is trained
# I was playing around with them and these conditions seem to yield good results most of the time
# Feel free to play around with this as much as youd like
if epochs < 130:
streak += 1
else:
streak = 0
if streak > 2:
print("Found Streak at Episode: " + str(episode))
break
if epochs < 100:
print("Optimal Detected")
# break
# Print progress bar and then add data to graph
if episode % (max_iterations / 10) == 0:
# print("Training " + str(i / max_iterations * 100) + "% Complete.")
pass
self.convergence_graph.append(epochs)
Have you done something like this before? I'm trying to understand it and see what it's doing. The comments are a bit sparse so it hasn't been exactly easy. Perhaps you can lead me through at some point? What will I have to change in order to implement eligibility traces?
Edit: Really, I think just seeing the pseudocode would help.