This project is created to provide a general heating system controller with Reinforcement Learning.
Main goals:
This is the most important part of the training for the predefined heat-model and data-driven model-based RL as well. The model is way too simple compared to a normal simulator. The reason behind this is that the simulator must be fast.
The collection of well known controlling tools and RL tools.
Controlling tools:
Install dependencies:
pip install -r requirements.txt
Train the methods and save the policy:
python train.py
Evaluate the policies and provide a graph about the performance:
python evaluate.py
Even if my domain knowledge would be outstanding, it is a really hard problem to create a real-world RL application.
Problems:
As this is the most important part of the RL pipeline, it is very important to be bug-free, so sufficient unit tests are needed.
The simulator is linear, so PID should solve the controlling nearly optimal. However, it doesn't use the information about the energy price --> it may suboptimal Problems
Otherwise, this is not too complicated as the heating system model is linear
This figure shows the PID control the temperature almost perfectly. The orange part shows the required inside temperature interval.
The goal of imitation learning was to mimic the PID controlling. It sounds like an easy supervised ML problem... well it isn't that easy... The NNs are designed to be smooth, but the PID controlling contains spikes because the target temperature contains step function.
I tried different loss functions:
According to the results MAE is chosen for being the initial point for model-free RL.
For Q function based method, it is essential to learn the critic as well. Another interesting problem with pretraining Actor-Critic architecture is training Critic with "optimal" policy's values causes discrepancy between the value function for learned policy and teacher policy. This problem can be eliminated by learning the policy first and use the learned policy's Q-values for the critic target. My intuition: MAE performs the best as the system is fully linear.
In my experience, the heating problem is way too complicated for model free-RL. That is why not converging not necessarily means that the implementation wrong. I used OpenAI gym inverted pendulum and continuous cartpole task to check convergence. The algorithms work for the given tasks. This can be the unit test of the implementation.
Without any further trick the model tends to predict one of the corner case of the valid interval. Due to the previous reasons, I decided to use pre-training for the models. The baseline is the PID and imitation learning is used as a pretrained-model can see in the previous section.
Implemented methods:
SAC: State-of-the-art model-free RL method handles the exploration as well and converges really fast on inverted pendulum problem. Pretraining is a bit more difficult in this case due to the 6 different networks.
PPO This is an on-policy algorithm with some constraint, which provides to not moving too far from the current policy. This is why it seems like a more stable method for pre-training.
The results show that even the pre-training model-free RL not working perfectly (yet), but pre-trained PPO is able to control the system more or less, but not optimally. Training the models further result that the control increases the inside temperature further, which is very odd. Furthermore, these methods tend to broke later and provide the maximum or the minimum heating power during the controlling process, which is very similar to the model-free RL from scratch case.
See the interactive plots here
iLQR method is really slow. The main advantage is able to converge faster than SAC, if not counting the model-learning steps, which are passive steps. Interesting discovery: TF2.0 can calculate the Hessian matrix (d cost/dd input), which is required, but the network must contain non-ReLu activation as well because the hessian of ReLu network will be zero matrix. Anyway, model-based RL is cool, but speeding up is required in inference time, which can be solved with Guided Policy Search.