ldoshi / rome-wasnt-built-in-a-day

0 stars 0 forks source link

Investigate epsilon and sweep hyperparameters for DQN #213

Open josephmaa opened 5 months ago

josephmaa commented 5 months ago

Trying to debug larger width environments (7 currently).

Things to try:

  1. Different metric (Average Q-value from 2015 paper https://arxiv.org/pdf/1312.5602.pdf).
5.1 Training and Stability
In supervised learning, one can easily track the performance of a model during training by evaluating
it on the training and validation sets. In reinforcement learning, however, accurately evaluating the
progress of an agent during training can be challenging. Since our evaluation metric, as suggested
by [3], is the total reward the agent collects in an episode or game averaged over a number of
games, we periodically compute it during training. The average total reward metric tends to be very
noisy because small changes to the weights of a policy can lead to large changes in the distribution of
states the policy visits . The leftmost two plots in figure 2 show how the average total reward evolves
during training on the games Seaquest and Breakout. Both averaged reward plots are indeed quite
noisy, giving one the impression that the learning algorithm is not making steady progress. Another,
more stable, metric is the policy’s estimated action-value function Q, which provides an estimate of
how much discounted reward the agent can obtain by following its policy from any given state. We
collect a fixed set of states by running a random policy before training starts and track the average
of the maximum2 predicted Q for these states.
  1. Run training for longer, they (https://arxiv.org/pdf/1312.5602.pdf) ran for 10,000 epochs.

  2. Decrease batch size?

  3. Re-read section 5 of 2015 paper/ copy configs

ldoshi commented 4 months ago

what if we change epsilon across the episode? (look for papers)

ldoshi commented 4 months ago

clean up graph logging finish impl average q metric

ldoshi commented 4 months ago

Overfit with very few states, eg only the relevant states to finish with width 5 or 6. no epsilon. basically, a supervised problem. Debug.

josephmaa commented 4 months ago

Add epsilon to logging, so can see decrease over steps

josephmaa commented 4 months ago
Screen Shot 2024-03-04 at 11 06 49 PM
ldoshi commented 4 months ago

Try gamma is pretty low.

Set G_T = 0 or something constant. See sutton and barto pg 77 under eq (9).

if this works, try this again with PPO.

josephmaa commented 4 months ago

Add episode length to environment

ldoshi commented 4 months ago
  1. Use episode length in env to set done bit. OR
  2. Run until n bricks are in the actual state (not attempted) AND Set done bit based on 'stuck in a repeated state, get a fixed very negative reward'
ldoshi commented 4 months ago

https://arxiv.org/abs/2109.15316

josephmaa commented 4 months ago

Decide whether to pursue monte carlo tree search, not sure if sota

  1. https://www.andrew.cmu.edu/course/10-703/textbook/BartoSutton.pdf
  2. https://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_9_model_based_rl.pdf
  3. https://discovery.ucl.ac.uk/id/eprint/10045895/1/agz_unformatted_nature.pdf
  4. https://jonathan-hui.medium.com/monte-carlo-tree-search-mcts-in-alphago-zero-8a403588276a
  5. https://gibberblot.github.io/rl-notes/single-agent/mcts.html try reading this
  6. https://arxiv.org/abs/2109.15316
josephmaa commented 3 months ago

A very nice article on mcts: https://joshvarty.github.io/AlphaZero/

ldoshi commented 3 months ago

MCTS detailed discussion https://link.springer.com/article/10.1007/s10462-022-10228-y#Sec37

ldoshi commented 3 months ago

Look up more about DQN + planning Look into multihead model training Debug why changing epsilon decay rate doesn't affect train loss

ldoshi commented 3 months ago

it seems that train_dataloader is making a copy of the replay_buffer and so the replay buffer does not get any updates

like we observed a bifurcation where sample claims the sumtree never changes after the initial memory count (in on_train_start) but set_value is also being called and observing a different total_tree_value

the total tree value in sample matches the number of initial memories created

... it's possible this has been happening since always and we've never online RL-ed anything O_o we have yet to find exactly where this happens (looking at dataloader code) and also come up with a remedy once iter() is called on our dataloader, we do a while True so it never refreshes from the underlying structure We need to figure out to refresh that iterator or something...

josephmaa commented 3 months ago

Think about MCTS replacing the policy, specifically for a given state, expand the MCTS tree. Re-use trees after selecting an action. Q values get used at model predict in MCTS run. Current policy is stateless, but may have to add state to policy in order to spit out probabilities. Back-propagation of MCTS happens at completed trajectory.

josephmaa commented 2 months ago
  1. Calculate action probabilities based on MCTS rollout as target, empirical rollout acts as ground truth for calculating loss.
  2. Consider whether calculation of td error can be used since we don't use Q(s, a) for calculation, we only take the value for a state with all actions
  3. Do we need to use MCTS? might need to wrangle the code to get it working with the replay buffer especially
  4. Think about how the network outputs probability distribution now, not q values, and how that affects the policy method get_probabilities. Might be able to just overload function and overwrite it
  5. Do we want to learn about mcts for fun? (like ppo) or do we think it'll actually be applicable to the problem
ldoshi commented 2 months ago

idea: deterministic q learning policy, change the environment with real or 'temp' obstructions to make the model build around them and then remove the obstructions to get interesting structures/variety.

policy seems to train with reward_scale of 1 but not 10 (ie smaller rewards). what's going on?

josephmaa commented 2 months ago

Try running with bridge_builder.py --inter-training-steps=0 --initialize-replay-buffer-strategy=single_column --debug --max-training-batches=7000 --initial-memories-count=0 with REWARD_SCALE = [1, 10] and modify if statement block = [-1, big if-statement]. REWARD_SCALE = 1 and if block = -1 working

ldoshi commented 2 months ago

Why are Q values the same but not TD errors when reward_scale = 10

josephmaa commented 1 month ago

sudo tensorboard --logdir lyric/rome-wasnt-built-in-a-day/checkpoints --bind_all --port 443 to run tensorboard on chimaera

josephmaa commented 1 month ago

time bridge_builder.py --debug --env-width=8 --max-training-batches=20000 --lr=.002 > /dev/null for faster run, also num-workers 0

ldoshi commented 1 month ago

lyric get experiment runner multi processing test fixed rerun 8 width on chimaera

ldoshi commented 1 month ago

read go-explore paper and figure out which ideas we should try from it first and second

josephmaa commented 1 month ago

read go-explore paper and figure out which ideas we should try from it first and second

https://lilianweng.github.io/posts/2020-06-07-exploration-drl/

https://arxiv.org/pdf/1901.10995

https://github.com/uber-research/go-explore

josephmaa commented 1 month ago

metadata: number of time cell has been chosen and number of time cell has been chosen since leading to discovery of new cell

ldoshi commented 1 month ago

https://github.com/ldoshi/rome-wasnt-built-in-a-day/pull/219

Bug in gym-bridges bfs_helper (just run go_explore_phase_1.py)

ldoshi commented 6 days ago

https://github.com/opendilab/awesome-exploration-rl

ldoshi commented 3 days ago

experimenting with width 8 (4 and 6 worked well) need to add support for multiple success entries add jitter

inter-training-steps values makes a big difference -- smaller helped --- consider running a sweep to characterize?

integrate scripts go-explore generation and training usage