ugr-sail / sinergym

Gym environment for building simulation and control using reinforcement learning
https://ugr-sail.github.io/sinergym/
MIT License
127 stars 34 forks source link

[Question] [Help] #417

Closed LorenzoBianchi02 closed 2 months ago

LorenzoBianchi02 commented 3 months ago

Hello! I am working on my thesis and I am trying to include sinergym. My objective is to try and compare a custom written PPO to the stable-baselines3 PPO using sinergym environments, but I am having some complications and I would be eternally grateful if someone could help me.

My first problem is regarding stable-baselines3 PPO: I have it working but I am unsure whether or not it is improving. I am testing on Eplus-env-5zone-hot-continuous-stochastic-v1 and after ~1_000_000 timesteps I have the following results:

progres.csv:

episode_num,cumulative_reward,mean_reward
1,-8223.663515318243,-0.23469359347369415
2,-8134.867089571647,-0.23215944890330042
3,-8174.118337475087,-0.23327963291880957
...
28,-8200.962686520645,-0.23404573877056634

Are these possible results, and if so, how long should it take to train the model?

My second and biggest problem is with my own implementation. I have no problems with gymnasium environments such as Pendulum or LunarLanderContinuous, but when I try a sinergym env my model tries to take actions that are outside the action_space. (When printing the action_space I get Box([12. 23.25], [23.25 30. ], (2,), float32), so I imagine that the problem is probably in my implementation)

[ENVIRONMENT] (WARNING) : Step: The action [-8787.659    -146.22293] is not correct for the Action Space Box([12.   23.25], [23.25 30.  ], (2,), float32)

Here is some of my code:

def __init__(self, policy_class, env, **hyperparameters):
        self._init_hyperparameters(hyperparameters)

        # Extract environment information
        self.env = env
        self.obs_dim = env.observation_space.shape[0]
        self.act_dim = env.action_space.shape[0]

        # Initialize actor and critic networks
        self.actor = policy_class(self.obs_dim, self.act_dim)
        self.critic = policy_class(self.obs_dim, 1)

        # Initialize optimizers for actor and critic
        self.actor_optim = Adam(self.actor.parameters(), lr=self.lr)
        self.critic_optim = Adam(self.critic.parameters(), lr=self.lr)

        # Initialize the covariance matrix used to query the actor for actions
        self.cov_var = torch.full(size=(self.act_dim,), fill_value=0.5)
        self.cov_mat = torch.diag(self.cov_var)
def get_action(self, obs):
        # Query the actor network for a mean action
        mean = self.actor(obs)

        dist = MultivariateNormal(mean, self.cov_mat)

        # Sample an action from the distribution
        action = dist.sample()

        # Calculate the log probability for that action
        log_prob = dist.log_prob(action)

        # Return the sampled action and the log probability of that action in our distribution
        return action.detach().numpy(), log_prob.detach()

Do you guys have an example of an implementation somewhere?

Thank you so much in advance

Lorenzo Bianchi

kad99kev commented 3 months ago

Hi @Lorenzo69420, I do not work on this library but I can help you with my experience working with it.

For the first question, I would highly recommend plotting the rewards you have received. If you plot the two graphs of your custom and stable-baselines implementation, you could compare to see how the two agents improve (if they do) over time. If you do not see an improvement, that is, if the lines are randomly moving up and down, there could be a problem. I use weights and biases to track my runs, I have added an image to show you what a run from an agent with CleanRL looks like. image

As for your second question, it seems to be an implementation error. When you normalize actions into the range [-1,1], Sinergym will convert them into the unnormalized values, which are your action bounds. This means that values cannot exceed these bounds. So, if you take an action like -8787, it exceeds the -1 bound, and when Sinergym converts it back, it throws a warning saying that the action is invalid. Remember, if the normalized value is out of bounds, so will the unnormalized value.

I hope this help!

AlejandroCN7 commented 2 months ago

Hi @Lorenzo69420, I agree with @kad99kev. Regarding the first question, you should visualize the data to detect possible trends to convergence (or not) and from there decide what would be the next step. For example, try another algorithm, fine-tuning, widen the action spaces, modify the reward weights, etc.

As for the second question, with the code provided, I can't see what the particular Sinergym environment you are using (wrappers or other configuration), nor how you are pulling the action values using the actor. However, kad99kev points very well to the issue of normalization, since negative values appear. Is it possible that you are using the wrapper for normalization of actions and at the same time you are sending the actions to the environment without normalizing from the algorithm? If you normalize the environment you must pass the actions with values between [-1,1] and Sinergym will convert it to the correct value in the simulation.

Hope this is helpful and sorry for the late reply. Thank you very much @kad99kev for the help you have offered (and very good :smile:)