Align max_episode_steps - agent & env

Webbah commented 3 years ago

https://github.com/upb-lea/openmodelica-microgrid-gym/blob/06eea638c305d0efdf37236ca481c11b7ea42f5e/openmodelica_microgrid_gym/env/modelica.py#L301-L303

Because of upcounting the time at the end of the step function, we do not execute the env at time max_episode_steps*delta_t. Example: max_episode_steps = 10; delta_t = 0.0001 s:

grafik 10 steps shown in the graphic, but the env.step() function is executed 9-times.

This currently leads to confusion using standard libaries like stablebaseline: model.learn(total_timesteps=10, callback=callback) because they seems to count the number of env steps. If we use total_timesteps=max_episode_steps the episode ends after 9 env.step() executions but the agent wants to learn till 10, so a second episode is started. This leads for example to 2 episode_returns in the Monitor-class, while the user would assume if total_timesteps=max_episode_steps that only one episode is executed leading to one return.

Change the definition in our env or how to deal with that? Any suggestions/oppinions/other ideas?

stheid commented 3 years ago

i think what would be better is to have 1 state from the reset and then 10 actions and 10 env steps. i think you can simply change the start or endtime of simulation accordingly such that the calculation is correct.

Webbah commented 3 years ago

Solved by increasing timesteps by 1 in env.init

https://github.com/upb-lea/openmodelica-microgrid-gym/blob/168b208a685cc1df12036b062403f3b950580862/openmodelica_microgrid_gym/env/modelica.py#L116-L117

upb-lea / openmodelica-microgrid-gym

Align max_episode_steps - agent & env #131