hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.16k stars 725 forks source link

Unable to see stable-baselines output #1151

Closed Michael-HK closed 2 years ago

Michael-HK commented 2 years ago

Dear all,

Greetings!

I am new to stable-baselines3, but I have watched numerous tutorials on its implementation and the custom environment formulation.

After developing my model using gym and stable-baselines3 SAC algorithm, I applied (check_env) function to check for possible errors and everything is perfect. However, whenever I run the code, the only output I saw is:

"Using cpu device Wrapping the env in a DummyVecEnv."

And the training session will stop without any output or save the model into the directory file.

Please what can be wrong, I have already set the verbose to 1.

Best Regards, Mich

Miffyli commented 2 years ago

Hey. Please fill in the issue template :). If there are no errors then everything worked out correctly. Maybe try training for longer.

Michael-HK commented 2 years ago

@Miffyli Thank you very much for your reply.

My main problem is that the stable-baselines3 is not producing any result after successful code execution.

The execution code is:

# Instantiate the custom env df = pd.read_excel('DRLmultidata.xlsx') env = ZcmesEnv4(df, st=0, en = 8700, T_episode = 24)

Create and wrap the environment

log_dir = "/tmp/gym/SAC1" os.makedirs(log_dir, exist_ok=True)

Logs will be saved in log_dir/monitor.csv

env = Monitor(env, log_dir)

Stabe-baselines3 execution code:

callback = SaveOnBestTrainingRewardCallback(check_freq=1000, log_dir =log_dir)`

policy_kwargs = dict(net_arch=dict(pi=[200, 300], qf=[200, 300]))

model = SAC("MlpPolicy", env, learning_rate=0.0003, buffer_size=60000, learning_starts=1500, batch_size=256, tau=0.005, gamma=0.99, train_freq=1, gradient_steps=1, action_noise=None, replay_buffer_class=None, replay_buffer_kwargs=None, optimize_memory_usage=False, ent_coef='auto', target_update_interval=1,
target_entropy='auto', use_sde=False, sde_sample_freq=- 1, use_sde_at_warmup=False, tensorboard_log=None, create_eval_env=False, policy_kwargs=policy_kwargs, verbose=1, seed=None, device='auto', _init_setup_model=True)

model.learn(total_timesteps=10000, callback=callback) # log_interval=10)
model.save("sac_ZcmesEnv11")

Output:

Using cpu device Wrapping the env in a DummyVecEnv Without any other output

System information: Python - 3.8 stable-baselines3 - 2.10.2 tensorflow - 2.5.0

Thanks for your consideration.

Best Regards, Mich

Miffyli commented 2 years ago

Hard to say without full access to code, but if there are no exceptions or errors thrown, then training works as expected. It is likely that your logging intervals are too sparse for your training length. Try:

Michael-HK commented 2 years ago

@Miffyli Thank you very much for your reply. I have tried the suggested procedure but I still encountered the same thing (No error and no result).

The attached file is the script file and the data for the code. I will appreciate your perusal and suggestion.

trial code.zip

Miffyli commented 2 years ago

Normally we wouldn't have time for custom tech support like this but I happened to have bit of time for this :)

After checking your code you have the return statement in step function inside a for loop:

    def step(self, action):
        Time = self.Ep                     #24hrs timesteps for each episode running
        for i in range(1, Time + 1):
            ....
            if i > 24:
                done = True
            elif i < 24 or i == 24:
                done = False
            ....
            return self.state, reward, done, info # PRR, LA, LAT, MCM

This means done=True never happens, episodes never end and thus nothing ends up to logs :)

Please double-check your codes and stuff. I won't have time for further tech support. You may close this issue unless bugs specific to SB3 come up.

Edit: I just realized this issue was raised in stable-baselines repo, not in stable-baselines3 :). Take care to put issues to right place next time.

Michael-HK commented 2 years ago

@Miffyli Thank you very much, I have rectified the issue, the problem was the termination condition. However, I observed that for all the log intervals, I was seeing the same mean reward. as shown below. Please what may be wrong??

Your suggestion will be highly appreciated.

callback = SaveOnBestTrainingRewardCallback(check_freq=100, log_dir = log_dir)

policy_kwargs = dict(net_arch=dict(pi=[200, 300], qf=[200, 300]))

model = SAC("MlpPolicy", env, learning_rate=0.0003, buffer_size=60000, learning_starts=1500, batch_size=32, tau=0.005, gamma=0.99, train_freq=1, gradient_steps=1, action_noise=None, replay_buffer_class=None, replay_buffer_kwargs=None, optimize_memory_usage=False, ent_coef='auto', target_update_interval=1, target_entropy='auto', use_sde=False, sde_sample_freq=- 1, use_sde_at_warmup=False, tensorboard_log=None, create_eval_env=False, policy_kwargs=policy_kwargs, verbose=1, seed=None, device='auto', _init_setup_model=True)

model.learn(total_timesteps=10000, callback=callback, log_interval=100) # log_interval=10)
model.save("sac_ZcmesEnv11")

Output: wrapping the env in a DummyVecEnv. Num timesteps: 100 Best mean reward: -inf - Last mean reward per episode: -202056.16 Saving new best model to /tmp/gym/SAC1\best_model Num timesteps: 200 Best mean reward: -202056.16 - Last mean reward per episode: -202056.16 Saving new best model to /tmp/gym/SAC1\best_model Num timesteps: 300 Best mean reward: -202056.16 - Last mean reward per episode: -202056.16 Saving new best model to /tmp/gym/SAC1\best_model Num timesteps: 400 Best mean reward: -202056.16 - Last mean reward per episode: -202056.16 Saving new best model to /tmp/gym/SAC1\best_model Num timesteps: 500 Best mean reward: -202056.16 - Last mean reward per episode: -202056.16 Saving new best model to /tmp/gym/SAC1\best_model Num timesteps: 600 Best mean reward: -202056.16 - Last mean reward per episode: -202056.16 Saving new best model to /tmp/gym/SAC1\best_model Num timesteps: 700

....... Saving new best model to /tmp/gym/SAC1\best_model Num timesteps: 2100 Best mean reward: -202056.16 - Last mean reward per episode: -202056.16 Saving new best model to /tmp/gym/SAC1\best_model Num timesteps: 2200 Best mean reward: -202056.16 - Last mean reward per episode: -202056.16 Saving new best model to /tmp/gym/SAC1\best_model Num timesteps: 2300 Best mean reward: -202056.16 - Last mean reward per episode: -202056.16 Saving new best model to /tmp/gym/SAC1\best_model Num timesteps: 2400 Best mean reward: -202056.16 - Last mean reward per episode: -202056.16 Saving new best model to /tmp/gym/SAC1\best_model

Miffyli commented 2 years ago

Glad you fixed it :). You might want to continue checking that the environment code is correct, but it will take for the agent to learn to new behaviour. One thing I see is that your reward's magnitude is way too high: the sum of episode rewards should not be large, otherwise learning will be unstable because of large updates. I suggest you multiply your reward with 0.001, and see if that helps. Beyond that, you need to try out different solutions to find what works.

Closing as resolved and "no tech support" (we can not provide extensive tech support).