How do you calculate the NLL loss？

fengxin619 commented 6 years ago

Could you tell me where does the formula come from?

nachiket92 commented 5 years ago

We are calculating the -log-likelihood of the ground truth under the the bivariate normal distribution as given by the model outputs. The model outputs the means, standard deviations and the correlation coefficient of the bivariate normal distribution

RebornHugo commented 5 years ago

We are calculating the -log-likelihood of the ground truth under the the bivariate normal distribution as given by the model outputs. The model outputs the means, standard deviations and the correlation coefficient of the bivariate normal distribution

I found your calculation

out = -(torch.pow(ohr, 2) * (torch.pow(sigX, 2) * torch.pow(x - muX, 2) + torch.pow(sigY, 2) * torch.pow(y - muY,2) - 2 * rho * torch.pow(sigX, 1) * torch.pow(sigY, 1) * (x - muX) * (y - muY)) - torch.log(sigX * sigY * ohr))

is not matched with wikipedia:

Could you answer my doubts?

jmercat commented 5 years ago

This is the correct computation:

    eps_rho = 1e-6
    ohr = 1/(np.maximum(1 - rho * rho, eps_rho)) #avoid infinite values
    out = 0.5*ohr * (diff_x * diff_x / (sigX * sigX) + diff_y * diff_y / (sigY * sigY) -
            2 * rho * diff_x * diff_y / (sigX * sigY)) + np.log(
            sigX * sigY) - 0.5*np.log(ohr) + np.log(np.pi*2)

sigma values were inverted wich may be changed afterward by replacing output activations for sigma from exp(x) to exp(-x) but this does not affect the results if it is done consistently. There is an error though: the 0.5 factor and the constant value np.log(np.pi*2) were forgotten. The constant value does not affect the gradients so it is fine for the learning phase. However it should be fixed for the evaluation.

Haoran-SONG commented 5 years ago

We are calculating the -log-likelihood of the ground truth under the bivariate normal distribution as given by the model outputs. The model outputs the means, standard deviations and the correlation coefficient of the bivariate normal distribution

Actually the formula should be corrected by adding (+log(2pi)) for the NLL loss. It may not affect much in the learning phase but results in a deviation of log(2pi) in evaluation.

Of course, it's fair for comparing different methods. But, if we have a good enough prediction, then the NLL metric may finally less than zero, that's not right from the definition of "negative log-likelihood"

jmercat commented 5 years ago

NLL values can have negative values as the likelihood is not a probability and can have values greater than 1. Moreover, the 0.5 factor is another error, that is not only the addition of log(2pi).

zhanghm1819 commented 4 years ago

We are calculating the -log-likelihood of the ground truth under the bivariate normal distribution as given by the model outputs. The model outputs the means, standard deviations and the correlation coefficient of the bivariate normal distribution

Actually the formula should be corrected by adding (+log(2_pi)) for the NLL loss. It may not affect much in the learning phase but results in a deviation of log(2_pi) in evaluation.

Of course, it's fair for comparing different methods. But, if we have a good enough prediction, then the NLL metric may finally less than zero, that's not right from the definition of "negative log-likelihood"

hi，do you get the same rmse results as described in this paper(eg.5s 4.37m)?

Xiaoyu006 commented 4 years ago

We are calculating the -log-likelihood of the ground truth under the bivariate normal distribution as given by the model outputs. The model outputs the means, standard deviations and the correlation coefficient of the bivariate normal distribution

Actually the formula should be corrected by adding (+log(2_pi)) for the NLL loss. It may not affect much in the learning phase but results in a deviation of log(2_pi) in evaluation. Of course, it's fair for comparing different methods. But, if we have a good enough prediction, then the NLL metric may finally less than zero, that's not right from the definition of "negative log-likelihood"

hi，do you get the same rmse results as described in this paper(eg.5s 4.37m)?

Hi, Zhanghm1819. what rmse did you get? my evaluation mse loss is about (57), whose squre root is about 7.5. after one epoch.

Xiaoyu006 commented 4 years ago

@zhanghm1819

Haoran-SONG commented 4 years ago

We are calculating the -log-likelihood of the ground truth under the bivariate normal distribution as given by the model outputs. The model outputs the means, standard deviations and the correlation coefficient of the bivariate normal distribution

Actually the formula should be corrected by adding (+log(2_pi)) for the NLL loss. It may not affect much in the learning phase but results in a deviation of log(2_pi) in evaluation. Of course, it's fair for comparing different methods. But, if we have a good enough prediction, then the NLL metric may finally less than zero, that's not right from the definition of "negative log-likelihood"

hi，do you get the same rmse results as described in this paper(eg.5s 4.37m)?

Yes. The results are quite closed to what is reported in the paper if you use the author's implementation. BTW, the NLL loss should be corrected in training and evaluation as @jmercat suggested. (The correction of loss function does not affect the final RMSE much though )

jmercat commented 4 years ago

There is a unit problem too, the NLL is not unitless and should be computed in meters as is done with the RMSE not in feet. But indeed there is not much impact on RMSE (which in my view is not a good error measure though).

nachiket92 commented 4 years ago

Thanks @jmercat for pointing out the bug. Yes, the NLL expression needs two updates, a constant term added and a factor of 2. The RMSE values do not change significantly, nor do the trends in the NLL values. However, the actual values of NLL in the results table need an update.

nachiket92 commented 4 years ago

The units issue seems trickier in my opinion.

The likelihood in this case can be assigned a unit (say, meters^(-1) or feet^(-1)) depending on what we're using to represent det(Sigma)^(-0.5).

However assigning a unit to log-likelihood wouldn't really make sense, as log(1 meter) or log (1 foot) makes no physical sense.

Depending on the unit we use to represent det(Sigma)^(-0.5), all NLL values will get offset by some constant. I can't see a clean way to get around this other than being consistent for all models being compared. It's also another reason why I wouldn't attach too much meaning to the actual value of the negative log likelihoods, and just use the metric for comparison.

jmercat commented 4 years ago

I guess you are right that the unit is not really interpretable and it is only a constant offset but for comparisons with other datasets that are mostly in meters and for consistency with the RMSE I think that everything should be computed in meters.

nachiket92 commented 4 years ago

utils.py has been updated with the two changes. Thanks again @jmercat

danlouis3 commented 4 years ago

We are calculating the -log-likelihood of the ground truth under the the bivariate normal distribution as given by the model outputs. The model outputs the means, standard deviations and the correlation coefficient of the bivariate normal distribution

I found your calculation
out = -(torch.pow(ohr, 2) * (torch.pow(sigX, 2) * torch.pow(x - muX, 2) + torch.pow(sigY, 2) * torch.pow(y - muY,2) - 2 * rho * torch.pow(sigX, 1) * torch.pow(sigY, 1) * (x - muX) * (y - muY)) - torch.log(sigX * sigY * ohr))
is not matched with wikipedia:

Could you answer my doubts?

Could you please link the source for this equation? Thank you!

nachiket92 commented 4 years ago

It is using the same equation. Note that we're taking the negative of the log of f(x,y) shown above. Also sigX and sigY output by the model are the reciprocals of the standard deviations.

danlouis3 commented 4 years ago

It is using the same equation. Note that we're taking the negative of the log of f(x,y) shown above. Also sigX and sigY output by the model are the reciprocals of the standard deviations.

Thanks for the reply. I think I have most of this figured out, but I have two questions (both may stem from ignorance/a fundamental misunderstanding of the basic concepts).

1) In the output activation, why is standard_devX,Y= e^(sigX,Y)? Reading Graves, I'm not sure I understand why to do this.

2) Is the standard deviation used in the maximum likelihood supposed to come from the "real" variables (the actual future trajectories) rather than the standard deviation of the predicted variables? If so, then why use a model output for standard deviation? Wouldn't that just be based on the prediction rather than the "real" variables?

Thanks again.

stratomaster31 commented 3 years ago

Following up this interesting topic, I have another concern:

Does trhe output of the decoder ensure that the predicted covariance matrix of the future trajectory time-steps is positive semi-definite?

Given the predicted:

$\sigma_x$ , $\sigma_y$ and

$\rho$

it is ensured that?:

$v^T\Sigma v \geq 0 \forall v \in R^2, v \neq 0$ , with $\Sigma$ the covariance matrix

Shoulnd't be $\rho$ clipped to [-1, 1]? Shoulnd't be $\sigma_x$ > 0? Shoulnd't be $\sigma_y$ > 0?

Even if we ensure that the covariance matrix is positive semi-definite, how we deal with singular covariances: when $\rho \to 1$ or $\rho \to -1$ ? In such cases, won't the NLL be NaN?

jmercat commented 3 years ago

Good questions Sigma > 0 is insured by exponential activations, rho in [-1, 1] by tanh activations As for the stability issues when |rho| is close to 1, I have questionned this and found a simple solution in my PhD that will be released soon. I will post a link here then.

stratomaster31 commented 3 years ago

@jmercat I'll be very interested in your work:) until you release your results, I'm wondering if it would be better to work with rho=0

@jmercat I have another conceptual question... what uncertainty the Gaussian decoder is modeling? Epistemic, aleatoric, both?

jmercat commented 3 years ago

Thanks for your interest. It is modeling both. This is also a result that I show. With a trained model that predicts only one mode as a Gaussian, the covariance of the error computed on many sequences is almost equal to the average predicted covariance that the model estimated for its own error.

stratomaster31 commented 3 years ago

@jmercat very interrsting, I have to process this información though... I'm working with a multimodal model (CVAE) so I will find your work very suitable for me

Given the deep understanding you have on the field I"d like to drop another question: on inference time when no ground truth is availabme, is the likelihood of the predicitons evaluated with the vector of means, so the exponential part of the likelihood cancels out?

Thank's so much for your time, I really appreacite not only the formal contribution a PhD researcher delivers in the form of papers and thesis, but taking the time to sahre knowledge through informal ways like this :)

jmercat commented 3 years ago

Thanks, my pleasure. I am not sure I understand your question: You do not estimate the likelihood of the prediction. It cannot be measured. You have to reverse your perspective: You predict a distribution such that the truth should be likely (for that predicted distribution). So you estimate the likelihood of the truth for the given distribution. In the case of a gaussian, this is done by computing the Gaussian expression. x, y being the ground truth, rho, sigma_x, sigma_y, mu_x, mu_y are your prediction:

stratomaster31 commented 3 years ago

Yes, this is true and I think is a typical issue missleading the concept of the likelihood... I'm meaming, given a multimodal Gaussian decoder, which predicitons should be considered as better? Those with the snaller mean covaitance matrix along the T future time steps?

jmercat commented 3 years ago

Oh ok I get what you mean. No actually, it is not because your prediction is more or less scattered that it is more or less likely. So you also need to predict a probability score for each mode.

stratomaster31 commented 3 years ago

Then, should I need to add another output to the network with softmax activation and add the cross-entropy loss to the loss of the model?

The I will be very concerned about the calibration of the model... which yield us to the fascinating field of bayesian neural networks, hahaha :)

jmercat commented 3 years ago

ahah yes for the new output and the softmax but you can use the log-sum-exp of the NLL (this is the NLL of a Gaussian mixture) instead of the cross-entropy.

stratomaster31 commented 3 years ago

This is a fantastic piece of information! :) In this equation, K is for the mode? what is (i)? and n_mix is the numner of future time steps?

I was thinking about using cross-entropy loss generating hard labels by assigning a probabilitu of 1 to the prediciton with lower MAE/MSE during training (evaluated with the means of the estimated Gaussians) Do you think this could work too?

jmercat commented 3 years ago

The k and (i) here can be forgotten. k it is the time step. (i) stands for the i^th sequence in the dataset. There are n_mix number of modes (number of mixture components in the Gaussian mixture).

Your idea could also work and I think this is what is done in the code of Deo here. It might loose its meaning as the maximization of the likelihood... but it allows you to train only the most probable mode which improves the mode diversity in some cases.

stratomaster31 commented 3 years ago

You're right, the Deo work is multimodal in the sense of the maneuver, and he's masking with the most probable predicted maneuver... it is not the same multimodality of a generative model, but It helps to understand the formula for the NLL of the mxiture of Gaussians. It is worth to mention that my CAVE is not very diverse... I'm training as a deterministic decoder with MAE loss...

@jmercat thanks so much, i won't take more of yout time, at the moment... I'll process all of this information and look forward you publish your work! Thank you, It has been very elucidating! Congrats! Have you already published any paper?

stratomaster31 commented 3 years ago

Hi again,

Thanks, my pleasure. I am not sure I understand your question: You do not estimate the likelihood of the prediction. It cannot be measured. You have to reverse your perspective: You predict a distribution such that the truth should be likely (for that predicted distribution). So you estimate the likelihood of the truth for the given distribution. In the case of a gaussian, this is done by computing the Gaussian expression. x, y being the ground truth, rho, sigma_x, sigma_y, mu_x, mu_y are your prediction:

Just for clarification, this Likelihood that is being maximized in the training of the DNN is:

$p(Y|\theta)$ , or
$p(\theta|Y)$ ,

with $Y$ the ground truth and $\theta$ the output vecotr of the DNN .

From your comment it seems to be the first one...

jmercat commented 3 years ago

If Y is the ground truth then it is $p(Y|\theta, X)$ because knowing your model and its input defined with $\theta, X$ , you describe the distribution of the future (the prediction) and the truth Y should be likely for that distribution.

stratomaster31 commented 3 years ago

Thx again, I'm just wondering how complex (provided it makes sense) would be to compute a posterior probability $p(\theta, X | Y)$ using conjugate priors... what do you think about that?

jmercat commented 3 years ago

I do not think this can be computed and even if you did, I don’t believe it would benefit you... Y is only the one future while you would want to find the weights $\theta$ that describe all the future distribution.
Moreover, $p(\theta, X|Y) = \frac{p(Y|\theta, X) p(\theta, X)}{p(Y)}$ but $p(Y)$ and $P(\theta, X)$ are unknown and I don’t see how to evaluate them.

stratomaster31 commented 3 years ago

It is complex yes... just I was wondering... :)

stratomaster31 commented 3 years ago

And lastly, until you releases your PhD results..

ahah yes for the new output and the softmax but you can use the log-sum-exp of the NLL (this is the NLL of a Gaussian mixture) instead of the cross-entropy.

In order to compute the batch-wise loss? Should I sum or average through the axis time? For the batch axis, obviously average

jmercat commented 3 years ago

I chose to average on my part but this should not have any impact if you change the learning rate accordingly.

stratomaster31 commented 3 years ago

Thanks a lot! I'm just looking at your work at Google Scholar, which is indeed very very interesting. I'll deep into your papers :) Thanks for your time again!

Xiejc97 commented 3 years ago

This is the correct computation:
    eps_rho = 1e-6
    ohr = 1/(np.maximum(1 - rho * rho, eps_rho)) #avoid infinite values
    out = 0.5*ohr * (diff_x * diff_x / (sigX * sigX) + diff_y * diff_y / (sigY * sigY) -
            2 * rho * diff_x * diff_y / (sigX * sigY)) + np.log(
            sigX * sigY) - 0.5*np.log(ohr) + np.log(np.pi*2)
sigma values were inverted wich may be changed afterward by replacing output activations for sigma from exp(x) to exp(-x) but this does not affect the results if it is done consistently. There is an error though: the 0.5 factor and the constant value np.log(np.pi*2) were forgotten. The constant value does not affect the gradients so it is fine for the learning phase. However it should be fixed for the evaluation.

Hello! Thank you for offering the correct computation, but I still have the following questions:

In the outputActivation function in https://github.com/nachiket92/conv-social-pooling/blob/master/utils.py#L152, by using the following code, why the sigX is converted to the reciprocal of the standard sigma(1/sigX)? https://github.com/nachiket92/conv-social-pooling/blob/d1abe198d61fef0f0dd4a80aabf74067de9990e0/utils.py#L152
Because the sigX and sigY output by the model are the reciprocals of the standard deviations, so the division by sigma in out in the formula should be replaced with a multiplication sign:

out = 0.5*ohr * (diff_x * diff_x * sigX * sigX + diff_y * diff_y * sigY * sigY - 2 * rho * diff_x * diff_y / * sigX * sigY) - np.log( sigX * sigY) - 0.5*np.log(ohr) + np.log(np.pi*2)

I modified out based on your formula, I can get the rmse results closed to the paper, but the null value is quite different from the original paper. After I run the program, the result of the null value is 5.3911(5s), while the paper is 4.22(5s). The null values in 1s,2s,3s,4s are also quite different from the original paper, so I want to ask what is the reason for this. And can you get the null value as the paper shows?

The question about null has troubled me for a long time, I hope you can help me answer the above question. Thanks very much!

jmercat commented 3 years ago

Hi, it might be late but here is a link to my PhD manuscript where I write about this: https://jean-mercat.netlify.app/media/PhD.pdf

xingh15 commented 3 years ago

utils.py has been updated with the two changes. Thanks again @jmercat

Hi, I think this line of code is incorrect https://github.com/nachiket92/conv-social-pooling/blob/d1abe198d61fef0f0dd4a80aabf74067de9990e0/utils.py#L199 0.5*torch.pow(sigY, 2)*torch.pow(y-muY, 2) - rho*torch.pow(sigX, 1)*torch.pow(sigY, 1)*(x-muX)*(y-muY) This part should be changed to torch.pow(sigY, 2)*torch.pow(y-muY, 2) - 2*rho*torch.pow(sigX, 1)*torch.pow(sigY, 1)*(x-muX)*(y-muY) Same as line 170. @nachiket92 @jmercat

ultimatedigiman commented 2 years ago

Hi, guys

Could you point out how -0.5160 in the line below is calculated?

https://github.com/nachiket92/conv-social-pooling/blob/d1abe198d61fef0f0dd4a80aabf74067de9990e0/utils.py#L227-L228

Chris-ymx commented 2 years ago

It is using the same equation. Note that we're taking the negative of the log of f(x,y) shown above. Also sigX and sigY output by the model are the reciprocals of the standard deviations.

Thanks for the reply. I think I have most of this figured out, but I have two questions (both may stem from ignorance/a fundamental misunderstanding of the basic concepts).

In the output activation, why is standard_devX,Y= e^(sigX,Y)? Reading Graves, I'm not sure I understand why to do this.

Is the standard deviation used in the maximum likelihood supposed to come from the "real" variables (the actual future trajectories) rather than the standard deviation of the predicted variables? If so, then why use a model output for standard deviation? Wouldn't that just be based on the prediction rather than the "real" variables?

Thanks again. for the question 1, the reason of e^(sigX,Y) is for sigX,Y > 0

zjysteven commented 1 year ago

Hi, it might be late but here is a link to my PhD manuscript where I write about this: https://jean-mercat.netlify.app/media/PhD.pdf

What's even more helpful (especially to someone like me who just entered this field) than this informative issue discussion is your dissertation. Thank you very much for linking it here @jmercat and impressive work.

nachiket92 / conv-social-pooling

How do you calculate the NLL loss？ #7