relates to ER-DQN - Githubissues

ddlau commented 3 years ago

hello? is this relates to ERDL, ER-DQN, or the deepmind paper "statistics and samples in distributional reinforcement learning"? if so, i'm quite curious, how to calculate the equation 7? what I can NOT get is, give some taus, and corresponding expectiles, the direct solver are the quantiles..? how this makes sense in practice? or, have you ever implemented these algorithms?

matomatical commented 3 years ago

Relation to EDRL

This repository is indirectly related to expectile-based distributional reinforcement learning.

Expectile statistics exist separately as a topic of mathematical study from their use in distributional reinforcement learning. This repository contains an algorithm for computing the sample expectiles of a finite sample of data points. According to my understanding, this algorithm is not used within EDRL, which can be seen as an alternative method for estimating the expectiles of a distribution online, based on incremental updates. It's ind of like how you can compute the sample mean/expected value of a distribution by using a formula (like my algorithm) or by using the TD learning update rule (like in traditional value-based RL).

Equation 7

If you mean how did I derive equation (7), please check that this is simply the right-hand side of equation (4) rewritten with the identities given in equations (5) and (6).

If you mean how can I compute the values of the function defined in equation (7), please see the steps on the next page or the Python implementation.

Your confusion

I'm sorry, I didn't quite understand your question, could you explain further what you mean by 'the direct solver are the quantiles'?

Have I implemented these algorithms

If you mean "have you ever implemented [the algorithms from the deepmind paper]", I have implemented tabular EDRL (not in this repository) but I have not implemented ER-DQN. If you mean "[the algorithms described in this repository]", yes, see the python code or the notebook.

Let me know if you have further questions, or I'm happy to provide more detailed answers.

ddlau commented 3 years ago

Thank you for your response. Sorry for my limited english.

Question 1 I believe Eq.7 and Eq.8 are both for the same goal, which is, given some samples of statistics, impute the corresponding distribution. am I right?

Question 2 Consider the situation when applying Eq.7 or Eq.8, which we have are some samples of statistics, i.e. epsilon_0, epsilon1, ..., epsilon{n-1}, which are output of network, we use these samples to calculate Eq.7 or Eq.8, the results, z0, z1, ..., z_{n-1}, are used to represent the imputed distribution. am I right?

Question 3 the inputs to Eq.7 or Eq.8 are N expectile samples (epsilon_0, epsilon_1, ...), the outputs should be N diracs (z0, z1, ...), but immediately, should not the results to be epsilon_0, epsilon_1, ...? I mean, z0=epsilon_0, z1=epsilon_1, ... would minimize these equations, if so, what are we impute for? I know I must be wrong, but don't know where and how.

Question 4 Suppose now we have imputed out a distribution, how to get the statistics values from it? can it be that, these z0, z1, ..., z_{n-1} are used?

Talk about theory, I guess I can more or less get the ideas in this paper, but in practice, I got stucked by these questions.

And, can I have a look at your tabular implementation? would you like to implement ER-DQN?

Besides, I guess this paper is fairly important, doesn't it? Do you have any idea why there's merely no people do it? I googled out nothing shows.

matomatical commented 3 years ago

Ah! Sorry, it's also my mistake---I assumed you were talking about equation 7 in my derivation document (in this repository) but now I see you are talking about the equation in the deepmind paper.

To be extra clear, I think imputing a set of sample points from a set of expectiles is an interesting problem, but this is the inverse of the problem solved by the algorithm in this repository. That is, imputation goes from expectiles to a sample. My algorithm goes from a sample to expectiles.

Anyway, I will be able to respond here with my thoughts on your further questions maybe tomorrow or over the weekend. I'll also see if I can find my tabular implementation, to show you. Sorry but I don't have time to work on an ER-DQN implementation at the moment.

Finally, regarding the popularity of this paper, I am not sure why. I thought maybe it could be because (1) it is quite recent work, so maybe few people have had time to follow up with it, or (2) it is a very expensive algorithm because of the imputation step (I have some thoughts on how to speed up the imputation step, but they are not ready to share yet).

Please let me know if anything is unclear.

ddlau commented 3 years ago

It's very nice, thank you again and looking forward to!

matomatical commented 3 years ago

I have prepared and published my implementation of EDRL (along with CDRL and QDRL) in a Jupyter Notebook in this other repository. You may find the OptBasedImputer class interesting. The method 'opt' implements equation (7) and the method 'min' implements equation (8) from the EDRL paper. The class EDRLAgent puts this together into a learning algorithm.

Sorry that my code is pretty unclear, I did not write it with the intention of showing it to others, just to understand the algorithms myself. Please feel free to raise an issue at that repository if you have any questions about my implementation.

Regarding your questions about the EDRL paper:

Question 1 I believe Eq.7 and Eq.8 are both for the same goal, which is, given some samples of statistics, impute the corresponding distribution. am I right?

That's right.

Question 2 Consider the situation when applying Eq.7 or Eq.8, which we have are some samples of statistics, i.e. epsilon_0, epsilon1, ..., epsilon{n-1}, which are output of network, we use these samples to calculate Eq.7 or Eq.8, the results, z0, z1, ..., z_{n-1}, are used to represent the imputed distribution. am I right?

That's right. The statistics are the output of a network in ER-DQN, anyway. In tabular EDRL, we just store the statistics in a table: For example one vector of statistics for each state/action combination.

Question 3 the inputs to Eq.7 or Eq.8 are N expectile samples (epsilon_0, epsilon_1, ...), the outputs should be N diracs (z0, z1, ...), but immediately, should not the results to be epsilon_0, epsilon_1, ...? I mean, z0=epsilon_0, z1=epsilon_1, ... would minimize these equations, if so, what are we impute for? I know I must be wrong, but don't know where and how.

Right, that's wrong. The results are not always z0=epsilon_0, z1=epsilon_1, ....

Actually solving equation (7) or (8) for z0, ... is tricky. I don't think you can do it analytically, that is, I don't think you can get a simple formula for the outputs. That's why the authors used numerical optimisation programs to find the solutions (and that's how I implemented it too).

If you want to convince yourself, you could try to solve for the z0 and z1 in this case:

n=3,
epsilon_0 = 0 (tau_0 = 0.1),
epsilon_1 = 1 (tau_1 = 0.5),
epsilon_2 = 2 (tau_2 = 0.9),
what is z_0, z_1, z_2?

You could start by checking if z_0 = 0, z_1 = 1, z_2 = 2 works. I don't think it will give you the same expectiles back. See also answer to next question.

Question 4 Suppose now we have imputed out a distribution, how to get the statistics values from it? can it be that, these z0, z1, ..., z_{n-1} are used?

If you want the expectiles of the sample you just got from imputation, you can use the algorithm in this repository, or compute the expectiles exactly by applying the definition again. If you did the imputation exactly correct (I don't think the numerical methods will always get the answer exactly correct) then you should get back epsilon_0, ...

In EDRL, we actually want to do an update first. So we take these sample points z_0, z_1, ... and then we scale them according to the discount factor and we add the reward until we get a sample from the previous state/action's reward distribution, then we apply one step of the expectile regression update (a step in the downward direction of the gradient of the expectile loss function). This moves the expectiles in the table for the previous state/action closer to the true expectiles. Or, if it's ER-DQN, it makes the next prediction from the network closer to the true expectiles for that state/action.

Hope that helps?

ddlau commented 3 years ago

Brave! It means a lot to me and I'll use some time to digest, later feedbacks, thank you!

A few details:

Question 1 In TabularDistRL.ipynb, you mentioned that "not using "Quantile Huber Loss" because not doing fn aprx." does "fn aprx" mean "function approximation"? if so, why such huber thing relates to function approximation? In my guess, fn aprx brings in random values which might be too large or too small, and that would hurt training. without approximation, even if values are too large or too small, they are reasonable, so no need for huber to cover. Any ideas?

Question 2 there's no so-called "huber expectile loss", is this meant to be? in my guess, we can simply replace those deltas (MAE errors) in huber quantile loss formula by corresponding MSE errors and obtain "huber expectile loss". to be exact, like follows

# Y: y_true, of shape [batch,N]
# P: y_pred, of shape [batch,N]
def make_expectile_loss( κ ):
    def expectile_loss( Y, P, τ ):
        Y = Y[ :, None, : ]
        P = P[ :, :, None ]
        u = tf.pow( Y - P, 2 )
        L = tf.where( Y > P, τ, 1 - τ ) * u
        L = tf.reduce_mean( tf.reduce_sum( tf.reduce_mean( L, axis=-1 ), axis=-1 ), axis=-1 )
        return L

    def expectile_huber_loss( Y, P, τ ):
        Y = Y[ :, None, : ]
        P = P[ :, :, None ]
        u = tf.pow( Y - P, 2 )
        L = tf.where( Y > P, τ, 1 - τ ) * tf.where( u < κ, 0.5 * tf.pow( u, 2 ), κ * (u - 0.5 * κ) ) / κ
        L = tf.reduce_mean( tf.reduce_sum( tf.reduce_mean( L, axis=-1 ), axis=-1 ), axis=-1 )
        return L

    return expectile_huber_loss if κ != 0 else expectile_loss

And it plays the same role in ER-DQN as huber quantile loss in QR-DQN. How do you think about this?

ddlau commented 3 years ago

After some plotting, I changed my mind about questoin 2 above. As stated in the QR-DQN paper, the reason they brought quantile huber loss in is, the direct quantile loss is not smooth at zero. when u->0+, the gradient is τ; when u->0-, the gradient is τ-1; but there's smply no such problem in expectile loss. when u->0+, the gradient is 2 u τ = 0; when u->0-, the gradient is 2 u (τ-1) =0; so there's no need to huber. Am I right?

ddlau commented 3 years ago

hello, after days of study I have some new questions about your tabular implementation and expect your answers, thank you!

question 1 inside method OptBasedImputer.sample, parameters uniform or bestof1000, just set different initializers to parameter x0 of scipy.optimize.root or scipy.optimize.minimize, but expectile regression loss is a convex function, so, the question is, do uniform or bestof1000 really matters?

question 2 again inside OptBasedImputer.sample, x = opt.root(self._grad_ε_ER_loss, x0=z0).x, it seems that you are calculating the gradients with respect to self.ε, instead of the arguments to self._grad_ε_ER_loss, I am pretty confused? would you please explain some?

Thank you again

ddlau commented 3 years ago

I made myself some explanation for question 2 above, but expecting your judgement.

In OptBasedImputer._grad_ε_ER_loss, self.ε are the expectiles and the argument X are samples, the result of OptBasedImputer._grad_ε_ER_loss are gradients w.r.t self.ε, the job here, is given some distribution, described by these expectiles, to draw some samples out.

matomatical commented 3 years ago

Question 1

I am not sure. It's a convex function according to the Statistical and Samples paper. Actually, perhaps it has no effect on the quality of the solution in this case, I haven't investigated this. But it's a trick I saw in DeepMind's implementation for the 2020 Nature paper "A distributional code for value in dopamine-based reinforcement learning" (there is a notebook in the supplemental information "Distribution Decoding Analysis.ipynb" which implements this trick and says it "significantly improves the minima found").

Question 2

Yes, you are right. We are not minimising the function w.r.t. its arguments with gradient descent. We want to find a point where this function is 0. The function just happens to represent a gradient, because that's the necessary condition connecting the sample and the expectiles (equation (7) from the Statistics and Samples paper). So you could say that the 'gradient' in the name is not related to the optimisation process at all.

Hope that helps.

matomatical commented 3 years ago

You had a few other questions from earlier. I'm sorry I missed those ones. I'll take another look in the next few weeks.

ddlau commented 3 years ago

thank you for your response. sorry for my limited english, so, in order to make myself clear, I have to be a little wordy.

for question 2 above, I think either root finding or minimizing, they are just methods in concrete. but the aim there, is given expectiles, we need to draw some samples out, it is the "statistic function" mentioned in deepmind paper, this is my point and where I am confused at. am I right?

if I was right, then the whole things add up. yes, centainly, expectile regression loss is convex, to emphasize, convex means that, given some samples, we can find out the one and only one set of expectiles. but OptBasedImputer.sample is not intended to find expectiles, it is used to draw samples, so there's no room for convex at all. given a set of expectiles, there might be countless sets of samples, which fit expectile regression loss function equally well, but would have different practical effects. "uniform" or "bestof1000" or any other initializers will result in different sets of samples. in this example tabular setting, "uniform" and "bestof1000" initializations have no obvious difference. but you can try initializing the x0 argument with k zeros, then after training process, the result is obvioulsly worse. and, better initializers might cost less time. so, the initializers actually matter. again, just this.

ddlau commented 3 years ago

...

matomatical commented 3 years ago

In TabularDistRL.ipynb, you mentioned that "not using "Quantile Huber Loss" because not doing fn aprx." does "fn aprx" mean "function approximation"?

Yes, function approximation. To be clear, by function approximation, I mean approximating the value function with a deep neural network. So tabular RL = no function approximation, deep RL = yes function approximation.

if so, why such huber thing relates to function approximation? In my guess, fn aprx brings in random values which might be too large or too small, and that would hurt training. without approximation, even if values are too large or too small, they are reasonable, so no need for huber to cover. Any ideas?

Great question. Actually, I think I was mistaken. Sorry for the confusion. This is my understanding:

Like you said above, in the QDRL paper, the authors introduce their Quantile Huber loss function instead of the normal Quantile Regression loss function because their Huber version is smoothly differentiable.
At time, I must have thought that the differentiability would only be important for training a network with gradient descent, and not for tabular learning.
Now I realise that also in tabular learning our update rules implement gradient descent, and there could be some instability caused by using the plain Quantile Regression loss function. Now I think I should have used the Huber loss.
Anyway, for my simple experiments, I didn't have any problems with instability.

there's no so-called "huber expectile loss", is this meant to be? ... After some plotting, I changed my mind about question 2 above. As stated in the QR-DQN paper, the reason they brought quantile huber loss in is, the direct quantile loss is not smooth at zero. ... but there's simply no such problem in expectile loss.

I completely agree.

I think either root finding or minimizing, they are just methods in concrete. but the aim there, is given expectiles, we need to draw some samples out ... am I right?

Yes, you are right.

yes, centainly, expectile regression loss is convex, to emphasize, convex means that, given some samples, we can find out the one and only one set of expectiles. but OptBasedImputer.sample is not intended to find expectiles, it is used to draw samples, so there's no room for convex at all. given a set of expectiles, there might be countless sets of samples, which fit expectile regression loss function equally well, but would have different practical effects. "uniform" or "bestof1000" or any other initializers will result in different sets of samples. in this example tabular setting, "uniform" and "bestof1000" initializations have no obvious difference. but you can try initializing the x0 argument with k zeros, then after training process, the result is obvioulsly worse. and, better initializers might cost less time. so, the initializers actually matter. again, just this.

Yes, I agree again.

ddlau commented 3 years ago

alright, i think i can call it done now, thank you again for your kind help.

matomatical / expectiles

relates to ER-DQN #1

Relation to EDRL

Equation 7

Your confusion

Have I implemented these algorithms

Question 1

Question 2