nrontsis / PILCO

Bayesian Reinforcement Learning in Tensorflow
MIT License
313 stars 84 forks source link

Initialisation of ExponentialReward: eye(.) vs ones(.) #25

Closed ven-kyoshiro closed 5 years ago

ven-kyoshiro commented 5 years ago

The initialisation of the weights W of the exponential reward has a default initialisation of np.ones(.), as defined in the following line: https://github.com/nrontsis/PILCO/blob/6ebcc7df9a8190c542445f0c82d835c94b745c8e/pilco/rewards.py#L25

But when I calculate the reward mean concentrically centered on the target,

import matplotlib.pyplot as plt
from pilco.rewards import ExponentialReward
import tensorflow as tf
import numpy as np
with tf.Session(graph=tf.Graph()) as sess:
    R = ExponentialReward(state_dim=2, t=np.array([0.,0.]),W=np.ones((2,2)))
    muRs = lambda th:R.compute_reward(np.array([np.sin(th),np.cos(th)]), s=np.eye(2))
    left = [i/36*2*np.pi for i in range(36)]
    height = np.array([muRs(th)[0].eval()[0][0] for th in left])
plt.xlabel('rad')
plt.ylabel('mean of reward')
plt.plot(left, height,label='ones')
plt.legend()

the score is not a constant, as it can observed in the following plot: Unknown-1

So I think np.eye(state_dim) is better, i.e. having

self.W = Param(np.eye(state_dim), trainable=False) 

which gives the following score across theta: Unknown-2

nrontsis commented 5 years ago

Thanks @ven-kyoshiro for opening the issue.

@kyr-pol git blame shows that you added this initialisation. What was the rational behind it? Is this something that e.g. exists in the MATLAB implementation or helps in practice?

nrontsis commented 5 years ago

Also, @ven-kyoshiro can you elaborate on why would you want the cost to be constant across theta? Also, it would be helpful to make the same argument with a real, simple example (like an inverted pendulum).

ven-kyoshiro commented 5 years ago

Thanks for kind response!! I read this original paper . An equation (25) in the paper was defined by using L2-Norm, so I think the state which is same distance from target should return the same reward.

kyr-pol commented 5 years ago

Hey @ven-kyoshiro, thanks, that's a good catch. I agree eye(.) would be more appropriate as the default value. In most cases we expect the user to define a reward relevant to the task at hand, but eye(.) seems more reasonable as the standard choice, and even in our examples the rewards used are diagonal. I think I initially went with ones for debugging purposes (many zero values could hide mistakes), but that's not relevant now.