hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.14k stars 723 forks source link

DQN report. [QUESTION] #1180

Open smbrine opened 1 year ago

smbrine commented 1 year ago

Introduction My model acts like a compulsive masochist. Great beginning, innit? I will attach my parameters a bit further in a text but don't strictly orient on them bc I'm changing them all the time because of the following:

Describe the bug I have a very simple ping-pong env (custom one, not gym) and I sat up an agent without any issues except, probably, one. Potential problem is in the reward system but nevertheless it shouldn't act like it does. My reward system bases on a simple

if not done:
     reward = 1
 else:
     reward = 0

and probably he should try to get as many reward points as possible and it does so but only in the first 10k steps. Neither of the parameters affects on this occasion. Ofc hyperparams changes its performance but nothing more. After 10k it starts to dodge a ball but sometimes it gets about 5-10 points but dodges a 100 episodes afterwards. Code example I will throw everything important (imo) in a single logical sequence but i can invite in repo if needed. rew_mean looks like this. As you can see, it smashes after 10k. Btw, after learning starts parameter it smashes even lower and I don't know how's that even possible. Here's one more graph.

framebuffer = 5
learning_rate = 0.0001
total_timesteps = 10000000 # something like the infinity. I have a callback each 5k steps.
env = PingPongEnv()
env = DummyVecEnv([lambda: env])
env = VecTransposeImage(env)
env = VecFrameStack(env, n_stack=framebuffer)

model = DQN('CnnPolicy', env, verbose=1, tau=0.001, tensorboard_log=LOG_DIR, 
                            learning_rate=learning_rate, buffer_size=10000, learning_starts=100000, 
                            train_freq=1000, target_update_interval=20000, exploration_inital_eps=1, 
                            exploration_final_eps=0.00001, exploration fraction=0.001)

System Info Describe the characteristic of your environment:

Additional context Ping-pong is written on arcade by my brother but I'm not sure if it's useful info bc I'm not diving into his code, I use direct input instead. I use win32gui to grab images but it gives back about 150-200 images per second so its definitely not the problem.