Resolve precision issues

sniklaus / softmax-splatting

an implementation of softmax splatting for differentiable forward warping using PyTorch

468 stars 58 forks source link

Resolve precision issues #23

Closed holynski closed 4 years ago

holynski commented 4 years ago

When dividing the accumulated output tensor by the summed weights, an epsilon is added, presumably to prevent division by zero, which happens when no input values splat to a particular destination pixel.

However, if there are splatted values to a particular destination pixel, but the metric values are small enough (say -50 as an input metric, which becomes e^-50 ~= 1e-44), the addition of eps=0.0000001 prior to division causes precision errors, effectively clobbering the accumulated metric.

With this change, the logic for preventing divide-by-zero is only applied to pixels which have zero splatted metric.

I've tested the change, it seems to produce the correct result for all metric values. There may be a minor performance hit on certain machines due to the conditional assignment on line 353, but in my tests (B=32, H,W=512, C=64 -- 2080 Ti), the timing measurements actually seem to be negligibly faster with this change, not sure why. I've also experimented with the alternative of creating a mask tensor and only evaluating the quotient at non-zero pixels, but that ends up being slightly slower.

sniklaus commented 4 years ago

Great contribution, thank you Aleksander! I have been using the following for a while now for this exact reason.

        tenNormalize = tenOutput[:, -1:, :, :]
        tenNormalize[tenNormalize.abs() < 0.0000001] = 1.0
        tenOutput = tenOutput[:, :-1, :, :] / tenNormalize

The only difference to yours is that I use tenNormalize.abs() < 0.0000001 whereas you use tenNormalize == 0.0. Do you think one is preferable over the other?

holynski commented 4 years ago

I think the inequality will still have issues for accumulated metrics smaller than your defined epsilon. If the magnitude of the accumulated metric (tenNormalize in your code) is in range [0.0000001, inf), your version should solve the problem. But if it's is in range (0,0.0000001), it's being set to 1 -- meaning you'll be dividing by something much larger than the actual accumulated metric, which prevents tenOutput from being scaled back to its correct value.

The edge case I'm imagining is:

Input image values all equal 100 tensorMetric.exp() values all equal 1e-100 Flow is identity (all zeros)

You'd expect the identity tranformation, i.e. to get an image with values 100. With the inequality version, you'll get (100 * 1e-100) / 1 ~= 1e-98.

I think what we want here is just to protect against the divide-by-zero. There's a TensorFlow function for this, but I haven't seen a PyTorch equivalent: https://www.tensorflow.org/api_docs/python/tf/math/divide_no_nan

sniklaus commented 4 years ago

That all makes sense to me, thanks again for such a great contribution! :+1: