Performs element-wise conversion of operand to another floating-point type
that uses exponent_bits and mantissa_bits and back to the original
floating-point type and produces an output tensor.
More formally:
The mantissa bits of the original value are updated to round the original
value to the nearest value representable with mantissa_bits using
roundToIntegralTiesToEven semantics.
Then, if mantissa_bits are smaller than the number of mantissa bits of
the original value, the mantissa bits are truncated to mantissa_bits.
Then, if the exponent bits of the intermediate result don't fit into the
range provided by exponent_bits, the intermediate result overflows to
infinity using the original sign or underflows to zero using the
original sign.
The initial description of what occurs (converting to another floating-point type then converting back) is not equivalent to what is described in the "More formally" section (First the value is rounded based on reducing the mantissa bits, then overflow/underflow is checked based on the exponent bits). This is because rounding based on the mantissa_bits can introduce underflow to zero that can be prevented if the number of exponent bits is higher than in the input type.
For example, given the f8e5m2 input 2**-16 (represented with bits 0 00000 01), if reduce_precision is called with exponent_bits=6, mantissa_bits=1, the output should still be 2**-16, since a hypothetical IEEE-like f8e6m1 type has a high enough dynamic range to represent that number. However, the "More formally" section would first round based on the number of mantissa bits, which would round the number to 0. f8e5m2 is used in this example since the example is easier to understand with a narrow type, but the same issue applies to all floating-point types.
CC @ghpvnist. Since the semantics of convert state that "the behavior is TBD" for a float-to-float conversion, I would say similarly here since the semantics will be identical to the float-to-float case (at least when the destination float type has an Inf representation). Or simply remove the "More formally" part.
Thanks @reedwm for the issue.
I am planning to fix this sometime in Q2'23. As per the offline follow up, it seems that the resolution can delay a bit.
upd: un-assigning for now.
Request description
The spec for
reduce_precision
states:The initial description of what occurs (converting to another floating-point type then converting back) is not equivalent to what is described in the "More formally" section (First the value is rounded based on reducing the mantissa bits, then overflow/underflow is checked based on the exponent bits). This is because rounding based on the mantissa_bits can introduce underflow to zero that can be prevented if the number of exponent bits is higher than in the input type.
For example, given the
f8e5m2
input2**-16
(represented with bits0 00000 01
), ifreduce_precision
is called withexponent_bits=6
,mantissa_bits=1
, the output should still be2**-16
, since a hypothetical IEEE-likef8e6m1
type has a high enough dynamic range to represent that number. However, the "More formally" section would first round based on the number of mantissa bits, which would round the number to 0.f8e5m2
is used in this example since the example is easier to understand with a narrow type, but the same issue applies to all floating-point types.CC @ghpvnist. Since the semantics of
convert
state that "the behavior is TBD" for a float-to-float conversion, I would say similarly here since the semantics will be identical to the float-to-float case (at least when the destination float type has an Inf representation). Or simply remove the "More formally" part.