Closed Michael-HK closed 2 years ago
Hey. Please fill in the issue template :). If there are no errors then everything worked out correctly. Maybe try training for longer.
@Miffyli Thank you very much for your reply.
My main problem is that the stable-baselines3 is not producing any result after successful code execution.
The execution code is:
# Instantiate the custom env
df = pd.read_excel('DRLmultidata.xlsx')
env = ZcmesEnv4(df, st=0, en = 8700, T_episode = 24)
log_dir = "/tmp/gym/SAC1"
os.makedirs(log_dir, exist_ok=True)
env = Monitor(env, log_dir)
callback = SaveOnBestTrainingRewardCallback(check_freq=1000, log_dir =
log_dir)`
policy_kwargs = dict(net_arch=dict(pi=[200, 300], qf=[200, 300]))
model = SAC("MlpPolicy", env, learning_rate=0.0003, buffer_size=60000, learning_starts=1500, batch_size=256,
tau=0.005, gamma=0.99, train_freq=1, gradient_steps=1, action_noise=None, replay_buffer_class=None,
replay_buffer_kwargs=None, optimize_memory_usage=False, ent_coef='auto', target_update_interval=1,
target_entropy='auto', use_sde=False, sde_sample_freq=- 1, use_sde_at_warmup=False, tensorboard_log=None,
create_eval_env=False, policy_kwargs=policy_kwargs, verbose=1, seed=None, device='auto', _init_setup_model=True)
model.learn(total_timesteps=10000, callback=callback) # log_interval=10)
model.save("sac_ZcmesEnv11")
Output:
Using cpu device
Wrapping the env in a DummyVecEnv
Without any other output
System information: Python - 3.8 stable-baselines3 - 2.10.2 tensorflow - 2.5.0
Thanks for your consideration.
Best Regards, Mich
Hard to say without full access to code, but if there are no exceptions or errors thrown, then training works as expected. It is likely that your logging intervals are too sparse for your training length. Try:
total_timesteps
log_interval
check_freq
@Miffyli Thank you very much for your reply. I have tried the suggested procedure but I still encountered the same thing (No error and no result).
The attached file is the script file and the data for the code. I will appreciate your perusal and suggestion.
Normally we wouldn't have time for custom tech support like this but I happened to have bit of time for this :)
After checking your code you have the return
statement in step
function inside a for
loop:
def step(self, action):
Time = self.Ep #24hrs timesteps for each episode running
for i in range(1, Time + 1):
....
if i > 24:
done = True
elif i < 24 or i == 24:
done = False
....
return self.state, reward, done, info # PRR, LA, LAT, MCM
This means done=True
never happens, episodes never end and thus nothing ends up to logs :)
Please double-check your codes and stuff. I won't have time for further tech support. You may close this issue unless bugs specific to SB3 come up.
Edit: I just realized this issue was raised in stable-baselines repo, not in stable-baselines3 :). Take care to put issues to right place next time.
@Miffyli Thank you very much, I have rectified the issue, the problem was the termination condition. However, I observed that for all the log intervals, I was seeing the same mean reward. as shown below. Please what may be wrong??
Your suggestion will be highly appreciated.
callback = SaveOnBestTrainingRewardCallback(check_freq=100, log_dir = log_dir)
policy_kwargs = dict(net_arch=dict(pi=[200, 300], qf=[200, 300]))
model = SAC("MlpPolicy", env, learning_rate=0.0003, buffer_size=60000, learning_starts=1500, batch_size=32,
tau=0.005, gamma=0.99, train_freq=1, gradient_steps=1, action_noise=None, replay_buffer_class=None,
replay_buffer_kwargs=None, optimize_memory_usage=False, ent_coef='auto', target_update_interval=1,
target_entropy='auto', use_sde=False, sde_sample_freq=- 1, use_sde_at_warmup=False, tensorboard_log=None,
create_eval_env=False, policy_kwargs=policy_kwargs, verbose=1, seed=None, device='auto', _init_setup_model=True)
model.learn(total_timesteps=10000, callback=callback, log_interval=100) # log_interval=10)
model.save("sac_ZcmesEnv11")
Output:
wrapping the env in a DummyVecEnv.
Num timesteps: 100
Best mean reward: -inf - Last mean reward per episode: -202056.16
Saving new best model to /tmp/gym/SAC1\best_model
Num timesteps: 200
Best mean reward: -202056.16 - Last mean reward per episode: -202056.16
Saving new best model to /tmp/gym/SAC1\best_model
Num timesteps: 300
Best mean reward: -202056.16 - Last mean reward per episode: -202056.16
Saving new best model to /tmp/gym/SAC1\best_model
Num timesteps: 400
Best mean reward: -202056.16 - Last mean reward per episode: -202056.16
Saving new best model to /tmp/gym/SAC1\best_model
Num timesteps: 500
Best mean reward: -202056.16 - Last mean reward per episode: -202056.16
Saving new best model to /tmp/gym/SAC1\best_model
Num timesteps: 600
Best mean reward: -202056.16 - Last mean reward per episode: -202056.16
Saving new best model to /tmp/gym/SAC1\best_model
Num timesteps: 700
.......
Saving new best model to /tmp/gym/SAC1\best_model
Num timesteps: 2100
Best mean reward: -202056.16 - Last mean reward per episode: -202056.16
Saving new best model to /tmp/gym/SAC1\best_model
Num timesteps: 2200
Best mean reward: -202056.16 - Last mean reward per episode: -202056.16
Saving new best model to /tmp/gym/SAC1\best_model
Num timesteps: 2300
Best mean reward: -202056.16 - Last mean reward per episode: -202056.16
Saving new best model to /tmp/gym/SAC1\best_model
Num timesteps: 2400
Best mean reward: -202056.16 - Last mean reward per episode: -202056.16
Saving new best model to /tmp/gym/SAC1\best_model
Glad you fixed it :). You might want to continue checking that the environment code is correct, but it will take for the agent to learn to new behaviour. One thing I see is that your reward's magnitude is way too high: the sum of episode rewards should not be large, otherwise learning will be unstable because of large updates. I suggest you multiply your reward with 0.001, and see if that helps. Beyond that, you need to try out different solutions to find what works.
Closing as resolved and "no tech support" (we can not provide extensive tech support).
Dear all,
Greetings!
I am new to stable-baselines3, but I have watched numerous tutorials on its implementation and the custom environment formulation.
After developing my model using gym and stable-baselines3 SAC algorithm, I applied (check_env) function to check for possible errors and everything is perfect. However, whenever I run the code, the only output I saw is:
"Using cpu device Wrapping the env in a DummyVecEnv."
And the training session will stop without any output or save the model into the directory file.
Please what can be wrong, I have already set the verbose to 1.
Best Regards, Mich