kurtzace / diary-2024

0 stars 0 forks source link

AWS deep racer #14

Open kurtzace opened 1 month ago

kurtzace commented 1 month ago

video on the basics

Car provided and its features image OCR on image: (1:18 4WD scale car Intel Atom processor • Intel distribution of OpenVINO tool!. • Front-tacing camera (4 megapixelss • System memory: 4 GB RAM 802.11ac Wi-Fl • Ubuntu 20.04 Focal Fos. ROS 2 Foxy Fitzroy AWS EleopPacer Evo Expansion Pad. • Second front-facing camera (stereo cameras/ • 260-degree,12-rneter scanning radius Lidar sensor OpenVINO )

3d racing simulator

Deep racer uses Reinforcement image (image shows distinction between supervised/unsupervised/RL learning)

Agent - car

action taken by agent - reward with +ve or no or -ve reward

episode - start to end - or drives off the track

rewards image Image shows how to incentivise centreline driving

exploration (may go off track)

exploitation (safer track boundary adherence)

speed, sterring angle - parameters

console has 15 to 20 tracks

reward functions

image Image: How to edit lambda function in AWS console - similar to AWS lambda

image below explains input params image

heading (angle from x axis)

all wheels on track - true (could be start reward)

distance from center (0 to 1)

default params - image

vehicle performs action - move from a to b - state is updated.

image below shows how AWS leverages CNN to give us input parameters image

action space image Image above: discrete - tabular - but no fine tuning - but training time will converge faster

continuous action space - give freedom - training time is high

image: setup racer profile image

example track: A to Z Speedway

clock wise is track direction

PPO - algo (2 NN)

Other algo is SAC

1 to 2 hours - model convergence

lap time should be minimal with car not leaving track

15 training hours per team

clone good models

at least 1 type should be in the track

kurtzace commented 1 month ago

Reinforcement learning algorithms are trained by repeated optimization of cumulative rewards. The model will learn which action (and then subsequent actions) will result in the highest cumulative reward on the way to the goal. Learning doesn’t just happen on the first go; it takes some iteration. First, the agent needs to explore and see where it can get the highest rewards, before it can exploit that knowledge.

Exploitation and Convergence With more experience, the agent gets better and eventually is able to reach the destination reliably. Depending on the exploration-exploitation strategy, the vehicle may still have a small probability of taking random actions to explore the environment.

parameters

wiki The parameters passed to the reward function describe various aspects of the state of the vehicle, such as its position and orientation on the track, its observed speed, steering angle and more. We will explore some of these parameters and how they describe the vehicle as it drives around the track:

Important parameters

x and y The position of the vehicle on the track
heading Orientation of the vehicle on the track
waypoints List of waypoint coordinates
closest_waypoints Index of the two closest waypoints to the vehicle
progress Percentage of track completed
steps Number of steps completed
track_width Width of the track
distance_from_center Distance from track center line
is_left_of_center Whether the vehicle is to the left of the center line
all_wheels_on_track Is the vehicle completely within the track boundary?
speed Observed speed of the vehicle
steering_angle Steering angle of the front wheels Range: -30:30 The negative sign (-) means steering to the right and the positive (+) sign means steering to the left.

more parameters

Type: Boolean

Range: (True:False)

A Boolean flag to indicate whether the agent has off track (True) or not (False) as a termination status.

Type: Boolean

Range: [True:False]

A Boolean flag to indicate if the agent is driving on clock-wise (True) or counter clock-wise (False).

It's used when you enable direction change for each episode.

Heading

Type: float

Range: -180:+180

Heading direction, in degrees, of the agent with respect to the x-axis of the coordinate system.

Example

In this example, we give a high reward for when the car stays on the track, and penalize if the car deviates from the track boundaries. This example uses the all_wheels_on_track, distance_from_center and track_width parameters to determine whether the car is on the track, and give a high reward if so. Since this function doesn't reward any specific kind of behavior besides staying on the track, an agent trained with this function may take a longer time to converge to any particular behavior.

def reward_function(params):
    '''
    Example of rewarding the agent to stay inside the two borders of the track
    '''

    # Read input parameters
    all_wheels_on_track = params['all_wheels_on_track']
    distance_from_center = params['distance_from_center']
    track_width = params['track_width']

    # Give a very low reward by default
    reward = 1e-3

    # Give a high reward if no wheels go off the track and
    # the agent is somewhere in between the track borders
    if all_wheels_on_track and (0.5*track_width - distance_from_center) >= 0.05:
        reward = 1.0

    # Always return a float value
    return float(reward)

. Follow Center Line In this example we measure how far away the car is from the center of the track, and give a higher reward if the car is close to the center line. This example uses the track_width and distance_from_center parameters, and returns a decreasing reward the further the car is from the center of the track. This example is more specific about what kind of driving behavior to reward, so an agent trained with this function is likely to learn to follow the track very well. However, it is unlikely to learn any other behavior such as accelerating or braking for corners.

def reward_function(params):
    '''
    Example of rewarding the agent to follow center line
    '''

    # Read input parameters
    track_width = params['track_width']
    distance_from_center = params['distance_from_center']

    # Calculate 3 markers that are at varying distances away from the center line
    marker_1 = 0.1 * track_width
    marker_2 = 0.25 * track_width
    marker_3 = 0.5 * track_width

    # Give higher reward if the car is closer to center line and vice versa
    if distance_from_center <= marker_1:
        reward = 1.0
    elif distance_from_center <= marker_2:
        reward = 0.5
    elif distance_from_center <= marker_3:
        reward = 0.1
    else:
       reward = 1e-3  # likely crashed/ close to off track

    return float(reward)
  1. Prevent zig-zag This example incentivizes the agent to follow the center line but penalizes with lower reward if it steers too much, which will help prevent zig-zag behavior. The agent will learn to drive smoothly in the simulator and likely display the same behavior when deployed in the physical vehicle.
def reward_function(params):
    '''
    Example of penalize steering, which helps mitigate zig-zag behaviors
    '''
    # Read input parameters
    distance_from_center = params['distance_from_center']
    track_width = params['track_width']
    abs_steering = abs(params['steering_angle']) # Only need the absolute steering angle
    # Calculate 3 marks that are farther and father away from the center line
    marker_1 = 0.1 * track_width
    marker_2 = 0.25 * track_width
    marker_3 = 0.5 * track_width
    # Give higher reward if the car is closer to center line and vice versa
    if distance_from_center <= marker_1:
        reward = 1.0
    elif distance_from_center <= marker_2:
        reward = 0.5
    elif distance_from_center <= marker_3:
        reward = 0.1
    else:
        reward = 1e-3  # likely crashed/ close to off track
    # Steering penality threshold, change the number based on your action space setting
    ABS_STEERING_THRESHOLD = 15 
    # Penalize reward if the car is steering too much
    if abs_steering > ABS_STEERING_THRESHOLD:
        reward *= 0.8
    return float(reward)

how to be fast tip

image

https://youtu.be/wqf-dJyU_WA?si=B2DM-7RXUoc6FDNI

https://youtu.be/KBXMan0Dafw?si=YGjixuJoc7HwibZV

Here are the YouTube links wrapped in HTML tags:

More ref

image from above: way points counterclockwise

kurtzace commented 1 month ago

A to Z Speedway It’s easier for an agent to navigate this extra wide version of re:Invent 2018. Use it to get started with object avoidance and head-to-head race training.

Length: 16.64 m (54.59') Width: 107 cm (42")

Direction: Clockwise, Counterclockwise

kurtzace commented 1 month ago

Image wise representation of parameter heading

when in anti clockwise

heading - 125 image

heading 178 image

-77 on way down

kurtzace commented 1 month ago

Random thoughts on What could an ideal reward function be?

kurtzace commented 1 month ago

Image below: Think in terms of percentages, superimposed compass of percentage on track - clockwise image

kurtzace commented 1 month ago

clockwise way points, downloaded the track numpy from this site

and plotted the waypoints of clockwise track as I could not find one online.

image

import matplotlib.pyplot as plt
import numpy as np
tracksPath = '~/Downloads/reInvent2019_wide_cw.npy'
# Track name
track_name = "A to Z Speedway"

# Location of tracks folder
absolute_path = "."

# Get waypoints from numpy file

waypoints = np.load(tracksPath)

# Get number of waypoints
print("Number of waypoints = " + str(waypoints.shape[0]))

# Plot waypoints
for i, point in enumerate(waypoints):
    waypoint = (point[2], point[3])
    plt.scatter(waypoint[0], waypoint[1])
    plt.text(waypoint[0], waypoint[1], str(i), fontsize=9, ha='right')
    print("Waypoint " + str(i) + ": " + str(waypoint))

# Display the plot
plt.xlabel('X Coordinate')
plt.ylabel('Y Coordinate')
plt.title(f'Waypoints for {track_name}')
plt.show()
kurtzace commented 1 month ago

Simple reward

Eval

image image

kurtzace commented 1 month ago

clockwise way point

better waypoints for clockwise

Evaluation with limits of 1.5 to 3 speed

image image

kurtzace commented 1 month ago

percentage reward function

60% to 74% - speed of 1.5

40% to 59% - speed of 3

25% to 39% - speed of 1.5

10% to 24% - speed of 3

0% to 9% - speed of 1.5

60% to 74% - turn right

40% to 59% - follow center line (reward fuction above)

25% to 39% - turn right

10% to 24% - speed of 3 follow center line (reward fuction above), but also mild turning to left

0% to 9% - speed of 1.5 - turn right

60% to 74% - speed of 1.5 - could be right from center line by 50%

35% to 40% - speed of 3 - could be right from center line by 50%

40% to 60% - speed of 3 - could be left from center line by 50%

25% to 39% - speed of 1.5 - could be right from center line by 50%

10% to 24% - speed of 3 - should be exect on center line

40% to 59% - heading could be -55 degree

10% to 24% - heading could be 103 degree

25% to 39% - heading could be 0 degree

Eval

image image

kurtzace commented 1 month ago

CombinedWaypointsClockwiseAndSimple

image image

Actual race day organised by my company and AWS

Below is of some other team https://github.com/user-attachments/assets/dc700014-aa9f-4bb1-8c80-4c478a261f60

kurtzace commented 3 weeks ago

Reinforcement learning Basics from Udemy

Build Artificial Intelligence (AI) agents using Deep Reinforcement Learning and PyTorch

State:

Action:

Reward:

Agent:

Env:

Markov Decision Process (discrete finite time , stochastic (future is modified partialy) ctrl process - decision)

image

action modifies state, receives reward.

SARP (state space, actions, rewards by performing , probab of passing from state to state)

Next state visited depends on curr state. Process has no memory

markov decision process: many markov chains

Finite (like pacman) or infinite (car) decision process

Episodic (termiantes) or Continuing

Trajectory: elem generated when agent moves from state to another. tou = S0, A0, R1, A1

Episode: Trajectories to final state

Reward (maximize sum) vs Return (short term return may impact long -term reward)

from - Tensorflow 2.0: Deep Learning and Artificial Intelligence

Curve Fitting time concept - feedback loop of images - with view of future - not just any static function image

imagine if we solved car race using Supervised learning Given an image, can you give it a target?

Only goal, no target

tic tac toe analogy

• The agent interfaces with the game (via the API)

game.start() 
while not game.is_over() 
  state = game.getstate() 
  # do something intelligent 
  location = agent.pick_move(state) 
  # make the move game.move(symbol, location) 

Episode=== game/round/match

non episodic: stock, online ads - infinite horizons

The agent will try to maximize its reward • E.g. -100 is better than -1 million • -100 can still mean you've solved the game

State could be represented by 4 frames instead of 1 frame . As it does not convey movement CNN of single image drawback.

Policy param - W (shape is D x |A|)

π(a|S) = softmax(W^T s)

MVP - stepping stone (state transition probability) image

Builds up before Q Learning

Dynamic system relies on opponent too

image

Reward: Maximize the sum of future gains. Not immediate gratification.

image

Discounting is used for infinite horizon

image

reward right now has higher pref

expected value: mean: mean and std dev

image

Returns are recursive image

Bellman equation image

pi is probability and represents agent/animal

V is also value function for policy pi

p is env

--

learning happens when max pi* gives max V(s) - optimal policy - control problem.

Bellman for Q -- action value - given 'action'

image

V is linear and Q is quadratic

best policy math function image

at times enumerating all policies is not possible

|A| ^ |S|

use sample mean

states, rewards = play_episode_using_policy

returns = []
g = 0
returns.append (g)
for r in reversed (rewards) :
g = r+gamma*g
returns. append (g)

# returns are in reverse order, reverse them back
returns = reversed (returns)

Note: len(states) = len(rewards) + 1 since initial state has no reward

Thus: len(states) = len(returns)

Pseudocode

Q = random, policy = random
for i in range (num_episodes) :
  # replace policy evaluation with one episode only
  states, actions, rewards = play_one_episode (policy)
  returns = ... calculate as previously discussed ...

for s, a, g in zip (states, actions, returns) :
  Q (s, a) = Q(s,a) + learning_rate * (g - Q(s,a)) # monte carlo trick

for s in Q.states(): # policy improvement step
  policy [sl = argmax{ 0(s, :) }

Balance Explore-Exploit Dilemma

Monte carlo: problem is we need to wait for terminal state