Open kurtzace opened 5 days ago
Reinforcement learning algorithms are trained by repeated optimization of cumulative rewards. The model will learn which action (and then subsequent actions) will result in the highest cumulative reward on the way to the goal. Learning doesn’t just happen on the first go; it takes some iteration. First, the agent needs to explore and see where it can get the highest rewards, before it can exploit that knowledge.
Exploitation and Convergence With more experience, the agent gets better and eventually is able to reach the destination reliably. Depending on the exploration-exploitation strategy, the vehicle may still have a small probability of taking random actions to explore the environment.
wiki The parameters passed to the reward function describe various aspects of the state of the vehicle, such as its position and orientation on the track, its observed speed, steering angle and more. We will explore some of these parameters and how they describe the vehicle as it drives around the track:
Position on track (The parameters x and y describe the position of the vehicle in meters, measured from the lower-left corner of the environment.)
Heading (The heading parameter describes the orientation of the vehicle in degrees, measured counter-clockwise from the X-axis of the coordinate system.)
Waypoints (The waypoints parameter is an ordered list of milestones placed along the track center. Each waypoint in waypoints is a pair [x, y] of coordinates in meters, measured in the same coordinate system as the car's position.)
Track width (The track_width parameter is the width of the track in meters.)
Distance from center line (The distance_from_center parameter measures the displacement of the vehicle from the center of the track. The is_left_of_center parameter is a boolean describing whether the vehicle is to the left of the center line of the track.)
All wheels on track
Speed (The speed parameter measures the observed speed of the vehicle, measured in meters per second.)
Steering angle (The steering_angle parameter measures the steering angle of the vehicle, measured in degrees. This value is negative if the vehicle is steering right, and positive if the vehicle is steering left.)
x and y | The position of the vehicle on the track |
---|---|
heading | Orientation of the vehicle on the track |
waypoints | List of waypoint coordinates |
closest_waypoints | Index of the two closest waypoints to the vehicle |
progress | Percentage of track completed |
steps | Number of steps completed |
track_width | Width of the track |
distance_from_center | Distance from track center line |
is_left_of_center | Whether the vehicle is to the left of the center line |
all_wheels_on_track | Is the vehicle completely within the track boundary? |
speed | Observed speed of the vehicle |
steering_angle | Steering angle of the front wheels Range: -30:30 The negative sign (-) means steering to the right and the positive (+) sign means steering to the left. |
Type: Boolean
Range: (True:False)
A Boolean flag to indicate whether the agent has off track (True) or not (False) as a termination status.
Type: Boolean
Range: [True:False]
A Boolean flag to indicate if the agent is driving on clock-wise (True) or counter clock-wise (False).
It's used when you enable direction change for each episode.
Type: float
Range: -180:+180
Heading direction, in degrees, of the agent with respect to the x-axis of the coordinate system.
In this example, we give a high reward for when the car stays on the track, and penalize if the car deviates from the track boundaries. This example uses the all_wheels_on_track, distance_from_center and track_width parameters to determine whether the car is on the track, and give a high reward if so. Since this function doesn't reward any specific kind of behavior besides staying on the track, an agent trained with this function may take a longer time to converge to any particular behavior.
def reward_function(params):
'''
Example of rewarding the agent to stay inside the two borders of the track
'''
# Read input parameters
all_wheels_on_track = params['all_wheels_on_track']
distance_from_center = params['distance_from_center']
track_width = params['track_width']
# Give a very low reward by default
reward = 1e-3
# Give a high reward if no wheels go off the track and
# the agent is somewhere in between the track borders
if all_wheels_on_track and (0.5*track_width - distance_from_center) >= 0.05:
reward = 1.0
# Always return a float value
return float(reward)
. Follow Center Line In this example we measure how far away the car is from the center of the track, and give a higher reward if the car is close to the center line. This example uses the track_width and distance_from_center parameters, and returns a decreasing reward the further the car is from the center of the track. This example is more specific about what kind of driving behavior to reward, so an agent trained with this function is likely to learn to follow the track very well. However, it is unlikely to learn any other behavior such as accelerating or braking for corners.
def reward_function(params):
'''
Example of rewarding the agent to follow center line
'''
# Read input parameters
track_width = params['track_width']
distance_from_center = params['distance_from_center']
# Calculate 3 markers that are at varying distances away from the center line
marker_1 = 0.1 * track_width
marker_2 = 0.25 * track_width
marker_3 = 0.5 * track_width
# Give higher reward if the car is closer to center line and vice versa
if distance_from_center <= marker_1:
reward = 1.0
elif distance_from_center <= marker_2:
reward = 0.5
elif distance_from_center <= marker_3:
reward = 0.1
else:
reward = 1e-3 # likely crashed/ close to off track
return float(reward)
def reward_function(params):
'''
Example of penalize steering, which helps mitigate zig-zag behaviors
'''
# Read input parameters
distance_from_center = params['distance_from_center']
track_width = params['track_width']
abs_steering = abs(params['steering_angle']) # Only need the absolute steering angle
# Calculate 3 marks that are farther and father away from the center line
marker_1 = 0.1 * track_width
marker_2 = 0.25 * track_width
marker_3 = 0.5 * track_width
# Give higher reward if the car is closer to center line and vice versa
if distance_from_center <= marker_1:
reward = 1.0
elif distance_from_center <= marker_2:
reward = 0.5
elif distance_from_center <= marker_3:
reward = 0.1
else:
reward = 1e-3 # likely crashed/ close to off track
# Steering penality threshold, change the number based on your action space setting
ABS_STEERING_THRESHOLD = 15
# Penalize reward if the car is steering too much
if abs_steering > ABS_STEERING_THRESHOLD:
reward *= 0.8
return float(reward)
how to be fast tip
A to Z Speedway It’s easier for an agent to navigate this extra wide version of re:Invent 2018. Use it to get started with object avoidance and head-to-head race training.
Length: 16.64 m (54.59') Width: 107 cm (42")
Direction: Clockwise, Counterclockwise
when in anti clockwise
heading - 125
heading 178
-77 on way down
Random thoughts on What could an ideal reward function be?
Think in terms of percentages
clockwise way points
import matplotlib.pyplot as plt
import numpy as np
tracksPath = '~/Downloads/reInvent2019_wide_cw.npy'
# Track name
track_name = "A to Z Speedway"
# Location of tracks folder
absolute_path = "."
# Get waypoints from numpy file
waypoints = np.load(tracksPath)
# Get number of waypoints
print("Number of waypoints = " + str(waypoints.shape[0]))
# Plot waypoints
for i, point in enumerate(waypoints):
waypoint = (point[2], point[3])
plt.scatter(waypoint[0], waypoint[1])
plt.text(waypoint[0], waypoint[1], str(i), fontsize=9, ha='right')
print("Waypoint " + str(i) + ": " + str(waypoint))
# Display the plot
plt.xlabel('X Coordinate')
plt.ylabel('Y Coordinate')
plt.title(f'Waypoints for {track_name}')
plt.show()
is_reversed - give it negative reward
is_offtrack - give it negative reward
Prevent zig-zag (copy the reward function above)
is_crashed - give it negative reward
give it positive reward for speed to be always above 1.3
Positive reward: In general have a speed of 1.5 for turns and 3.5 to 4 for straight roads.
Heading should not vary by 50 degree from the previous heading that is sterring angle should not be more than 50 degree
when the speed is 4 - it must be very close to the center line
when car goes away from centre line by 15% then reduce speed to 3
when car goes away from centre line by 30% then reduce speed to 2
when car goes away from centre line by 50% then reduce speed to 1.4
Keep a very high reward for being on center line (copy from above reward function)
all wheels should be on track - high reward
We want the vehicle to drive toward the correct direction. By "correct direction" one obvious candidate is the waypoints that outline the center line of the track.
better waypoints for clockwise
Evaluation with limits of 1.5 to 3 speed
75% to 100% - speed of 4
60% to 74% - speed of 1.5
40% to 59% - speed of 3
25% to 39% - speed of 1.5
10% to 24% - speed of 3
0% to 9% - speed of 1.5
75% to 100% - follow center line (reward fuction above)
60% to 74% - turn right
40% to 59% - follow center line (reward fuction above)
25% to 39% - turn right
10% to 24% - speed of 3 follow center line (reward fuction above), but also mild turning to left
0% to 9% - speed of 1.5 - turn right
0% to 9% - could be right from center line by 50% 75% to 100% - speed of 4 - should be exect on center line
60% to 74% - speed of 1.5 - could be right from center line by 50%
35% to 40% - speed of 3 - could be right from center line by 50%
40% to 60% - speed of 3 - could be left from center line by 50%
25% to 39% - speed of 1.5 - could be right from center line by 50%
10% to 24% - speed of 3 - should be exect on center line
75% to 100% - heading could be 180 degree
40% to 59% - heading could be -55 degree
10% to 24% - heading could be 103 degree
25% to 39% - heading could be 0 degree
video
Car provided
3d racing simulator
Deep racer uses Reinforcement
Agent - car
action taken by agent - reward with +ve or no or -ve reward
episode - start to end - or drives off the track
rewards
exploration (may go off track)
exploitation (safer track boundary adherence)
speed, sterring angle - parameters
console has 15 to 20 tracks
reward functions
input params
heading (angle from x axis)
all wheels on track - true (could be start reward)
distance from center (0 to 1)
default params -
vehicle performs action - move from a to b - state is updated.
action space
discrete - tabular - but no fine tuning - but training time will converge faster
continuous action space - give freedom - training time is high
setup racer profile
example track: A to Z Speedway
clock wise is track direction
PPO - algo (2 NN)
Other algo is SAC
1 to 2 hours - model convergence
lap time should be minimal with car not leaving track
15 training hours per team
clone good models
at least 1 type should be in the track