Test the gym env:
Use the original data to test the env auto regressively
Choose 5 trips, transform the data, feed to the model, transform the outputs back to the original space and use as input for the next prediction
Compare by looking at figures and MSE for 5-10 different routes
Test Policy optimizer:
Use normal trips only
Input: X(t-torque, t) + disturbance
Rewards: R(t) vs. culmulative vs R(t-torque, t)
use R(t) should be enough for now
done reward: give a large reward (3 * normal) at done
termination: terminate at some time (e.g 125)
Epoch size: contain at least 50 trips per epoch, can consider mini batch to speed up the training
Compute the average overall rewards (with gamma)
Train at least five times then supervised learning
Look at overall reward to see how well the training works
Curriculum learning
Stage 1: imitation learning
mimic
about 7 epochs
Stage 2:
compare with best 1% trip for reward
about 7 epochs
Stage 3:
conceptual rewards only (fc reward and time reward)
Evaluate model:
Choose one of the top 10 percent trips and compare 1 by 1 to see if the optimizer work
Compute the average reward for all trips in the dataset for different epochs and compare them
Schedule:
Get the results of env testing by the end of today
Get the optimization results ready on Monday, and meet to discuss the results
Test the gym env: Use the original data to test the env auto regressively Choose 5 trips, transform the data, feed to the model, transform the outputs back to the original space and use as input for the next prediction Compare by looking at figures and MSE for 5-10 different routes Test Policy optimizer: Use normal trips only Input: X(t-torque, t) + disturbance Rewards: R(t) vs. culmulative vs R(t-torque, t) use R(t) should be enough for now done reward: give a large reward (3 * normal) at done termination: terminate at some time (e.g 125) Epoch size: contain at least 50 trips per epoch, can consider mini batch to speed up the training Compute the average overall rewards (with gamma) Train at least five times then supervised learning Look at overall reward to see how well the training works Curriculum learning Stage 1: imitation learning mimic about 7 epochs Stage 2: compare with best 1% trip for reward about 7 epochs
Stage 3: conceptual rewards only (fc reward and time reward) Evaluate model: Choose one of the top 10 percent trips and compare 1 by 1 to see if the optimizer work Compute the average reward for all trips in the dataset for different epochs and compare them Schedule: Get the results of env testing by the end of today Get the optimization results ready on Monday, and meet to discuss the results