jonbarron / robust_loss_pytorch

A pytorch port of google-research/google-research/robust_loss/
Apache License 2.0
656 stars 88 forks source link

[Question] Regarding optimizing the adaptive loss function: optimizer always increments c to reduce loss while trying to overfit. #19

Open Somdyuti2 opened 3 years ago

Somdyuti2 commented 3 years ago

Hi,

I was trying out your work and was trying to overfit my model on a single example first to get a feel of how it works, and what I can expect. I might have missed something, but from the functional form of the robust loss function proposed in the corresponding paper, it seems that the optimizer can always "cheat" and increase c to reduce the overall loss, without even bothering to change the actual network parameters. In fact, when c approaches infinity, the residual error given by x simply vanishes, and the loss becomes independent of the residual error. In fact, while trying to overfit, this is the trend that I am observing as alpha keeps decreasing and c keeps increasing for some time until it converges to an incorrect solution. The network parameters seem to have changed minimally. Even in the example that you have shown in the repo, both alpha and c reduces monotonically, while in such optimization problems we would expect the parameters to wiggle around, at least to some extent.

For the parameter alpha, I think that the paper addresses the monotonicity issue as follows:

In later experiments we will use the NLL of our general distribution − log(p(·|α, c)) as the loss for training our neural networks, not our general loss ρ (·, α, c). Critically, using the NLL allows us to treat α as a free parameter, thereby allowing optimization to automatically determine the degree of robustness that should be imposed by the loss being used during training. To understand why the NLL must be used for this, consider a training procedure in which we simply minimize ρ (·, α, c) with respect to α and our model weights. In this scenario, the monotonicity of our general loss with respect to α (Eq. 12) means that optimization can trivially minimize the cost of outliers by setting α to be as small as possible.

However, what about the monotonicity of c? even though the adaptive loss function returns the NLL value, the monotonicity persists in my case. It might be that because I am trying to overfit the model, the distribution is meaningless. Still it is not clear to me. I am simply trying to transform image A to image B, and I am resizing the residual to (WxH,1). Is this how we are supposed to use it for images or more needs to be done? Please let me know, thanks.

jonbarron commented 3 years ago

I'm not entirely sure what you're describing, so if you have a colab that reproduces this issue feel free to link it.

I would strongly advise against overfitting to a single datapoint to get a feel for this code/math. The likelihood of a datapoint is going to be maximized by shrinking the loss/distribution into a delta function, which isn't representative of how things will work in practice. This is generally true for all models: if you fit a Gaussian to a single datapoint, you get unreasonably small scale parameters as output.

There may be some confusion about multiple meanings of "monotonicity". In the paper, when I say something is "monotonic", I'm referring to the relationship of the loss with the shape parameter: if the shape parameter increases, the loss (at all values of x) must increase. This is entirely independent of the "monotonicity" of the total loss of your training data decreasing over time. There's no reason for the loss, or for alpha or c to increase or decrease monotonically as optimization proceeds --- sometimes they'll go down, sometimes they'll go up, sometimes they'll wobble around, depending on the problem you're solving.