Open kurtzace opened 1 month ago
Reinforcement learning algorithms are trained by repeated optimization of cumulative rewards. The model will learn which action (and then subsequent actions) will result in the highest cumulative reward on the way to the goal. Learning doesn’t just happen on the first go; it takes some iteration. First, the agent needs to explore and see where it can get the highest rewards, before it can exploit that knowledge.
Exploitation and Convergence With more experience, the agent gets better and eventually is able to reach the destination reliably. Depending on the exploration-exploitation strategy, the vehicle may still have a small probability of taking random actions to explore the environment.
wiki The parameters passed to the reward function describe various aspects of the state of the vehicle, such as its position and orientation on the track, its observed speed, steering angle and more. We will explore some of these parameters and how they describe the vehicle as it drives around the track:
Position on track (The parameters x and y describe the position of the vehicle in meters, measured from the lower-left corner of the environment.)
Heading (The heading parameter describes the orientation of the vehicle in degrees, measured counter-clockwise from the X-axis of the coordinate system.)
Waypoints (The waypoints parameter is an ordered list of milestones placed along the track center. Each waypoint in waypoints is a pair [x, y] of coordinates in meters, measured in the same coordinate system as the car's position.) Image: anticlockwise way points
Track width (The track_width parameter is the width of the track in meters.)
Distance from center line (The distance_from_center parameter measures the displacement of the vehicle from the center of the track. The is_left_of_center parameter is a boolean describing whether the vehicle is to the left of the center line of the track.)
All wheels on track
Speed (The speed parameter measures the observed speed of the vehicle, measured in meters per second.)
Steering angle (The steering_angle parameter measures the steering angle of the vehicle, measured in degrees. This value is negative if the vehicle is steering right, and positive if the vehicle is steering left.)
Important parameters
x and y | The position of the vehicle on the track |
---|---|
heading | Orientation of the vehicle on the track |
waypoints | List of waypoint coordinates |
closest_waypoints | Index of the two closest waypoints to the vehicle |
progress | Percentage of track completed |
steps | Number of steps completed |
track_width | Width of the track |
distance_from_center | Distance from track center line |
is_left_of_center | Whether the vehicle is to the left of the center line |
all_wheels_on_track | Is the vehicle completely within the track boundary? |
speed | Observed speed of the vehicle |
steering_angle | Steering angle of the front wheels Range: -30:30 The negative sign (-) means steering to the right and the positive (+) sign means steering to the left. |
Type: Boolean
Range: (True:False)
A Boolean flag to indicate whether the agent has off track (True) or not (False) as a termination status.
Type: Boolean
Range: [True:False]
A Boolean flag to indicate if the agent is driving on clock-wise (True) or counter clock-wise (False).
It's used when you enable direction change for each episode.
Type: float
Range: -180:+180
Heading direction, in degrees, of the agent with respect to the x-axis of the coordinate system.
In this example, we give a high reward for when the car stays on the track, and penalize if the car deviates from the track boundaries. This example uses the all_wheels_on_track, distance_from_center and track_width parameters to determine whether the car is on the track, and give a high reward if so. Since this function doesn't reward any specific kind of behavior besides staying on the track, an agent trained with this function may take a longer time to converge to any particular behavior.
def reward_function(params):
'''
Example of rewarding the agent to stay inside the two borders of the track
'''
# Read input parameters
all_wheels_on_track = params['all_wheels_on_track']
distance_from_center = params['distance_from_center']
track_width = params['track_width']
# Give a very low reward by default
reward = 1e-3
# Give a high reward if no wheels go off the track and
# the agent is somewhere in between the track borders
if all_wheels_on_track and (0.5*track_width - distance_from_center) >= 0.05:
reward = 1.0
# Always return a float value
return float(reward)
. Follow Center Line In this example we measure how far away the car is from the center of the track, and give a higher reward if the car is close to the center line. This example uses the track_width and distance_from_center parameters, and returns a decreasing reward the further the car is from the center of the track. This example is more specific about what kind of driving behavior to reward, so an agent trained with this function is likely to learn to follow the track very well. However, it is unlikely to learn any other behavior such as accelerating or braking for corners.
def reward_function(params):
'''
Example of rewarding the agent to follow center line
'''
# Read input parameters
track_width = params['track_width']
distance_from_center = params['distance_from_center']
# Calculate 3 markers that are at varying distances away from the center line
marker_1 = 0.1 * track_width
marker_2 = 0.25 * track_width
marker_3 = 0.5 * track_width
# Give higher reward if the car is closer to center line and vice versa
if distance_from_center <= marker_1:
reward = 1.0
elif distance_from_center <= marker_2:
reward = 0.5
elif distance_from_center <= marker_3:
reward = 0.1
else:
reward = 1e-3 # likely crashed/ close to off track
return float(reward)
def reward_function(params):
'''
Example of penalize steering, which helps mitigate zig-zag behaviors
'''
# Read input parameters
distance_from_center = params['distance_from_center']
track_width = params['track_width']
abs_steering = abs(params['steering_angle']) # Only need the absolute steering angle
# Calculate 3 marks that are farther and father away from the center line
marker_1 = 0.1 * track_width
marker_2 = 0.25 * track_width
marker_3 = 0.5 * track_width
# Give higher reward if the car is closer to center line and vice versa
if distance_from_center <= marker_1:
reward = 1.0
elif distance_from_center <= marker_2:
reward = 0.5
elif distance_from_center <= marker_3:
reward = 0.1
else:
reward = 1e-3 # likely crashed/ close to off track
# Steering penality threshold, change the number based on your action space setting
ABS_STEERING_THRESHOLD = 15
# Penalize reward if the car is steering too much
if abs_steering > ABS_STEERING_THRESHOLD:
reward *= 0.8
return float(reward)
how to be fast tip
https://youtu.be/wqf-dJyU_WA?si=B2DM-7RXUoc6FDNI
https://youtu.be/KBXMan0Dafw?si=YGjixuJoc7HwibZV
Here are the YouTube links wrapped in HTML tags:
from above: way points counterclockwise
A to Z Speedway It’s easier for an agent to navigate this extra wide version of re:Invent 2018. Use it to get started with object avoidance and head-to-head race training.
Length: 16.64 m (54.59') Width: 107 cm (42")
Direction: Clockwise, Counterclockwise
Image wise representation of parameter heading
when in anti clockwise
heading - 125
heading 178
-77 on way down
Random thoughts on What could an ideal reward function be?
Image below: Think in terms of percentages, superimposed compass of percentage on track - clockwise
clockwise way points, downloaded the track numpy from this site
and plotted the waypoints of clockwise track as I could not find one online.
import matplotlib.pyplot as plt
import numpy as np
tracksPath = '~/Downloads/reInvent2019_wide_cw.npy'
# Track name
track_name = "A to Z Speedway"
# Location of tracks folder
absolute_path = "."
# Get waypoints from numpy file
waypoints = np.load(tracksPath)
# Get number of waypoints
print("Number of waypoints = " + str(waypoints.shape[0]))
# Plot waypoints
for i, point in enumerate(waypoints):
waypoint = (point[2], point[3])
plt.scatter(waypoint[0], waypoint[1])
plt.text(waypoint[0], waypoint[1], str(i), fontsize=9, ha='right')
print("Waypoint " + str(i) + ": " + str(waypoint))
# Display the plot
plt.xlabel('X Coordinate')
plt.ylabel('Y Coordinate')
plt.title(f'Waypoints for {track_name}')
plt.show()
is_reversed - give it negative reward
is_offtrack - give it negative reward
Prevent zig-zag (copy the reward function above)
is_crashed - give it negative reward
give it positive reward for speed to be always above 1.3
Positive reward: In general have a speed of 1.5 for turns and 3.5 to 4 for straight roads.
Heading should not vary by 50 degree from the previous heading that is sterring angle should not be more than 50 degree
when the speed is 4 - it must be very close to the center line
when car goes away from centre line by 15% then reduce speed to 3
when car goes away from centre line by 30% then reduce speed to 2
when car goes away from centre line by 50% then reduce speed to 1.4
Keep a very high reward for being on center line (copy from above reward function)
all wheels should be on track - high reward
We want the vehicle to drive toward the correct direction. By "correct direction" one obvious candidate is the waypoints that outline the center line of the track.
better waypoints for clockwise
Evaluation with limits of 1.5 to 3 speed
75% to 100% - speed of 4
60% to 74% - speed of 1.5
40% to 59% - speed of 3
25% to 39% - speed of 1.5
10% to 24% - speed of 3
0% to 9% - speed of 1.5
75% to 100% - follow center line (reward fuction above)
60% to 74% - turn right
40% to 59% - follow center line (reward fuction above)
25% to 39% - turn right
10% to 24% - speed of 3 follow center line (reward fuction above), but also mild turning to left
0% to 9% - speed of 1.5 - turn right
0% to 9% - could be right from center line by 50% 75% to 100% - speed of 4 - should be exect on center line
60% to 74% - speed of 1.5 - could be right from center line by 50%
35% to 40% - speed of 3 - could be right from center line by 50%
40% to 60% - speed of 3 - could be left from center line by 50%
25% to 39% - speed of 1.5 - could be right from center line by 50%
10% to 24% - speed of 3 - should be exect on center line
75% to 100% - heading could be 180 degree
40% to 59% - heading could be -55 degree
10% to 24% - heading could be 103 degree
25% to 39% - heading could be 0 degree
Actual race day organised by my company and AWS
Below is of some other team https://github.com/user-attachments/assets/dc700014-aa9f-4bb1-8c80-4c478a261f60
Build Artificial Intelligence (AI) agents using Deep Reinforcement Learning and PyTorch
State:
Action:
Reward:
Agent:
Env:
Markov Decision Process (discrete finite time , stochastic (future is modified partialy) ctrl process - decision)
action modifies state, receives reward.
SARP (state space, actions, rewards by performing , probab of passing from state to state)
Next state visited depends on curr state. Process has no memory
markov decision process: many markov chains
Finite (like pacman) or infinite (car) decision process
Episodic (termiantes) or Continuing
Trajectory: elem generated when agent moves from state to another. tou = S0, A0, R1, A1
Episode: Trajectories to final state
Reward (maximize sum) vs Return (short term return may impact long -term reward)
Curve Fitting time concept - feedback loop of images - with view of future - not just any static function
imagine if we solved car race using Supervised learning Given an image, can you give it a target?
Only goal, no target
• The agent interfaces with the game (via the API)
game.start()
while not game.is_over()
state = game.getstate()
# do something intelligent
location = agent.pick_move(state)
# make the move game.move(symbol, location)
Episode=== game/round/match
non episodic: stock, online ads - infinite horizons
The agent will try to maximize its reward • E.g. -100 is better than -1 million • -100 can still mean you've solved the game
State could be represented by 4 frames instead of 1 frame . As it does not convey movement CNN of single image drawback.
States can be discrete or continuous • Discrete example: o Tic-tac-toe - state is a specific configuration of the board • Continuous example: Robot with sensors - camera, microphone, gyroscope, GPS, proximity sensor, etc.
Policy: yield an action - given a curr state
(no past state or reward), like a dict for non infinite , or probability
def get_action(s): if random() < epsilon: a = sample from action space else: a = fixed policy[s] return a
Policy param - W (shape is D x |A|)
π(a|S) = softmax(W^T s)
MVP - stepping stone (state transition probability)
Builds up before Q Learning
Dynamic system relies on opponent too
Reward: Maximize the sum of future gains. Not immediate gratification.
Discounting is used for infinite horizon
reward right now has higher pref
expected value: mean: mean and std dev
Returns are recursive
Bellman equation
pi is probability and represents agent/animal
V is also value function for policy pi
p is env
--
learning happens when max pi* gives max V(s) - optimal policy - control problem.
Bellman for Q -- action value - given 'action'
V is linear and Q is quadratic
best policy math function
at times enumerating all policies is not possible
|A| ^ |S|
use sample mean
states, rewards = play_episode_using_policy
returns = []
g = 0
returns.append (g)
for r in reversed (rewards) :
g = r+gamma*g
returns. append (g)
# returns are in reverse order, reverse them back
returns = reversed (returns)
Note:
len(states) = len(rewards) + 1
since initial state has no reward
Thus: len(states) = len(returns)
Pseudocode
Q = random, policy = random
for i in range (num_episodes) :
# replace policy evaluation with one episode only
states, actions, rewards = play_one_episode (policy)
returns = ... calculate as previously discussed ...
for s, a, g in zip (states, actions, returns) :
Q (s, a) = Q(s,a) + learning_rate * (g - Q(s,a)) # monte carlo trick
for s in Q.states(): # policy improvement step
policy [sl = argmax{ 0(s, :) }
Balance Explore-Exploit Dilemma
Monte carlo: problem is we need to wait for terminal state
video on the basics
Car provided and its features OCR on image: (1:18 4WD scale car Intel Atom processor • Intel distribution of OpenVINO tool!. • Front-tacing camera (4 megapixelss • System memory: 4 GB RAM 802.11ac Wi-Fl • Ubuntu 20.04 Focal Fos. ROS 2 Foxy Fitzroy AWS EleopPacer Evo Expansion Pad. • Second front-facing camera (stereo cameras/ • 260-degree,12-rneter scanning radius Lidar sensor OpenVINO )
3d racing simulator
Deep racer uses Reinforcement (image shows distinction between supervised/unsupervised/RL learning)
Agent - car
action taken by agent - reward with +ve or no or -ve reward
episode - start to end - or drives off the track
rewards Image shows how to incentivise centreline driving
exploration (may go off track)
exploitation (safer track boundary adherence)
speed, sterring angle - parameters
console has 15 to 20 tracks
reward functions
Image: How to edit lambda function in AWS console - similar to AWS lambda
image below explains input params
heading (angle from x axis)
all wheels on track - true (could be start reward)
distance from center (0 to 1)
default params -
vehicle performs action - move from a to b - state is updated.
image below shows how AWS leverages CNN to give us input parameters
action space Image above: discrete - tabular - but no fine tuning - but training time will converge faster
continuous action space - give freedom - training time is high
image: setup racer profile
example track: A to Z Speedway
clock wise is track direction
PPO - algo (2 NN)
Other algo is SAC
1 to 2 hours - model convergence
lap time should be minimal with car not leaving track
15 training hours per team
clone good models
at least 1 type should be in the track