`webgpu:shader,validation,expression,call,builtin,distance:values:type="f16";*` test failures

ben-clayton commented 2 months ago

https://github.com/gpuweb/cts/pull/3585 introduced distance() validation tests.

These tests fail for f16 with the error:

  - EXCEPTION: Error: Unexpected validation error occurred: Tint program failure: :4:18 error: '-65504.0 - 0.062469482421875' cannot be represented as 'f16'
    var<private> v = distance(f16(o0), f16(o1));
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^

    :4:18 note: when calculating distance
    var<private> v = distance(f16(o0), f16(o1));

The finite representable range for f16 is [-65504.0 .. +65504.0], so this looks like a valid failure (Tint is correct, CTS is wrong).

The distance() CTS attempts to detect the non-representable intermediate with:

https://github.com/gpuweb/cts/blob/afef918253ae6ba28bee20dfbf5b4e980103949f/src/webgpu/shader/validation/expression/call/builtin/distance.spec.ts#L65-L74

For Type.f16, quantizeFn() calls quantizeToF16(), which in turn calls hfround(), which is implemented as: https://github.com/gpuweb/cts/blob/afef918253ae6ba28bee20dfbf5b4e980103949f/src/external/petamoriken/float16/float16.js#L1219-L1226 https://github.com/gpuweb/cts/blob/afef918253ae6ba28bee20dfbf5b4e980103949f/src/external/petamoriken/float16/float16.js#L344-L349

This does not appear to convert values outside of the finite range to infinity.

Interestingly, the petamoriken/float16 library on GitHub appears substantially different, and does treat these as infinities:

https://github.com/petamoriken/float16/blob/60551c22d2eba68c829e04da8956a917786e7a92/src/_util/converter.mjs#L32-L63

@zoddicus - Should we try to update our version of petamoriken/float16 ?

toji commented 2 months ago

Well, for one I just realized that my Windows box at work doesn't have F16 support, which surprises me (and is why I missed this initially.)

Despite that, I was investigating this issue yesterday as part of the mix validation tests and found exactly what you pointed out above. I'll add to it, though, that quantizeToF16(65503) and lower returns 65472, quantizeToF16(65504) through quantizeToF16(65535) all return 65504, and quantizeToF16(65536) and higher will return Infinity. That feels like it could be a reasonable interpretation of the term "Quantize", but it obviously differs from what Tint is expecting.

toji commented 2 months ago

Interestingly, the petamoriken/float16 library on GitHub appears substantially different, and does treat these as infinities:

FYI: This linked to the implementation of roundToFloat16, which is distinct from the roundToFloat16Bits function that we use (roundToFloat16Bits is still in the float16 library and is unchanged).

That doesn't change the question of "should we update our version of the lib", but it does mean that simply pulling in the new version won't fix it. We have to update the call site too.

toji commented 2 months ago

@zoddicus, what was the process used for building the float16.js found in external/float16? I was looking at updating it but it's not a file from the releases that the author provides.

zoddicus commented 2 months ago

I agree that we should pull in an update to float16, since it has been ~2 years since it was done.

Been a while since that I did this, but I am pretty sure that float16.js is the browser/float16.js file that is generated via running yarn run build from the instructions in the f16 repo.

float16.d.ts I think is just a renamed version of index.d.ts

toji commented 2 months ago

Hm... well, updating to the newer float16 lib got us closer to Tint's behavior, but it's still not quite the same.

The updated library uses roundToFloat16 for it's hfround behavior, which I didn't realize but is theoretically what we want. However, calling hfround with up to 65519 still returns 65504, while 65520 and higher return Infinity. This is due to this portion of the function, which is called before the bounds check:

const temp = (1 + FLOAT16_EPSILON_DEVIDED_BY_EPSILON) * absolute;
const result = temp - (temp - absolute);

This means that we still fail on cases like a=-65504;b=0.062469482421875

But if I add an explicit bounds check to quantizeToF16 like so:

export function quantizeToF16(num: number): number {
  if (num > kValue.f16.positive.max) { return Infinity; }
  if (num < kValue.f16.negative.min) { return -Infinity; }
  return hfround(num);
}

The we become overly aggressive and start expecting failures on cases like a=-65504;b=5.960464477539063e-8 where Tint doesn't produce any errors.

sigh. Closer, but not exact.

zoddicus commented 2 months ago

I think you may have hit one of these may or may not succeed cases, as I discussed offline.

Specifically the infinitely precise result of -65504.0 - 0.062469482421875 is -65504.06246948242, which is within the overfloat limit, -(2^(EMAX(T)+1), where EMAX(f16) is 15 => -65536 (https://www.w3.org/TR/WGSL/#floating-point-overflow). So an implementation can choose to overflow or round into the end.

zoddicus commented 2 months ago

So I think this is specifically a case we shouldn't be testing in the CTS, because an implementation may or may not pass validation for it.

toji commented 2 months ago

Do you have a recommendation for how to detect and avoid such cases?

ben-clayton commented 2 months ago

@zoddicus 's guidance here is to hand-craft the cases so that you avoid the ambiguous zones. That would mean building a bunch of cases with hard-coded numbers with a hard coded expected pass / expected fail.

zoddicus commented 2 months ago

(per our oob conversation) You also might be able to do something like modify youe existing code to distinguish by, is if the values always stay in bounds for the specific precision if should alway pass and if the values ever go outside of +/-2^(EMAX(T) + 1), (or does something nasty like div by 0) then it must always fail. Else just skip that case, since it might be a maybe.

But I am not sure if that is going to be easier to get right then just hand calculate values.

gpuweb / cts

`webgpu:shader,validation,expression,call,builtin,distance:values:type="f16";*` test failures #3623