Open josephmaa opened 5 months ago
what if we change epsilon across the episode? (look for papers)
clean up graph logging finish impl average q metric
Overfit with very few states, eg only the relevant states to finish with width 5 or 6. no epsilon. basically, a supervised problem. Debug.
Add epsilon to logging, so can see decrease over steps
Try gamma is pretty low.
Set G_T = 0 or something constant. See sutton and barto pg 77 under eq (9).
if this works, try this again with PPO.
Add episode length to environment
Decide whether to pursue monte carlo tree search, not sure if sota
A very nice article on mcts: https://joshvarty.github.io/AlphaZero/
MCTS detailed discussion https://link.springer.com/article/10.1007/s10462-022-10228-y#Sec37
Look up more about DQN + planning Look into multihead model training Debug why changing epsilon decay rate doesn't affect train loss
it seems that train_dataloader is making a copy of the replay_buffer and so the replay buffer does not get any updates
like we observed a bifurcation where sample claims the sumtree never changes after the initial memory count (in on_train_start) but set_value is also being called and observing a different total_tree_value
the total tree value in sample matches the number of initial memories created
... it's possible this has been happening since always and we've never online RL-ed anything O_o we have yet to find exactly where this happens (looking at dataloader code) and also come up with a remedy once iter() is called on our dataloader, we do a while True so it never refreshes from the underlying structure We need to figure out to refresh that iterator or something...
Think about MCTS replacing the policy, specifically for a given state, expand the MCTS tree. Re-use trees after selecting an action. Q values get used at model predict in MCTS run. Current policy is stateless, but may have to add state to policy in order to spit out probabilities. Back-propagation of MCTS happens at completed trajectory.
get_probabilities
. Might be able to just overload function and overwrite itidea: deterministic q learning policy, change the environment with real or 'temp' obstructions to make the model build around them and then remove the obstructions to get interesting structures/variety.
policy seems to train with reward_scale of 1 but not 10 (ie smaller rewards). what's going on?
Try running with bridge_builder.py --inter-training-steps=0 --initialize-replay-buffer-strategy=single_column --debug --max-training-batches=7000 --initial-memories-count=0
with REWARD_SCALE = [1, 10] and modify if statement block = [-1, big if-statement]. REWARD_SCALE = 1 and if block = -1 working
Why are Q values the same but not TD errors when reward_scale = 10
sudo tensorboard --logdir lyric/rome-wasnt-built-in-a-day/checkpoints --bind_all --port 443
to run tensorboard on chimaera
time bridge_builder.py --debug --env-width=8 --max-training-batches=20000 --lr=.002 > /dev/null
for faster run, also num-workers 0
lyric get experiment runner multi processing test fixed rerun 8 width on chimaera
read go-explore paper and figure out which ideas we should try from it first and second
read go-explore paper and figure out which ideas we should try from it first and second
https://lilianweng.github.io/posts/2020-06-07-exploration-drl/
metadata: number of time cell has been chosen and number of time cell has been chosen since leading to discovery of new cell
https://github.com/ldoshi/rome-wasnt-built-in-a-day/pull/219
Bug in gym-bridges bfs_helper (just run go_explore_phase_1.py)
experimenting with width 8 (4 and 6 worked well) need to add support for multiple success entries add jitter
inter-training-steps values makes a big difference -- smaller helped --- consider running a sweep to characterize?
integrate scripts go-explore generation and training usage
Trying to debug larger width environments (7 currently).
Things to try:
Run training for longer, they (https://arxiv.org/pdf/1312.5602.pdf) ran for 10,000 epochs.
Decrease batch size?
Re-read section 5 of 2015 paper/ copy configs