msgpack / msgpack-c

MessagePack implementation for C and C++ / msgpack.org[C/C++]
Other
3.01k stars 875 forks source link

Unpacking float32 does not preserve signaling NaN, unlike float64 #1087

Open fumoboy007 opened 1 year ago

fumoboy007 commented 1 year ago

Describe the bug float32/float64 NaNs have a quiet vs. signaling state. This state is encoded using a couple bits in the significand.

Unpacking a float64 signaling NaN using msgpack-c 6.0.0 preserves the quiet vs. signaling state. However, unpacking a float32 signaling NaN always results in a quiet NaN even if the packed value is a signaling NaN. The behavior is inconsistent between float32 and float64.

To Reproduce

static bool is_quiet_nan(double nan_val) {
    uint64_t bit_pattern = reinterpret_cast<uint64_t&>(nan_val);
    int is_quiet_bit_index = DBL_MANT_DIG - 2;
    return (bit_pattern >> is_quiet_bit_index) & 1;
}

TEST(MSGPACKC, simple_buffer_float_signaling_nan)
{
    if (!numeric_limits<float>::has_signaling_NaN) {
        return;
    }

    float val = numeric_limits<float>::signaling_NaN();

    msgpack_sbuffer sbuf;
    msgpack_sbuffer_init(&sbuf);
    msgpack_packer pk;
    msgpack_packer_init(&pk, &sbuf, msgpack_sbuffer_write);
    msgpack_pack_float(&pk, val);

    msgpack_zone z;
    msgpack_zone_init(&z, 2048);
    msgpack_object obj;
    msgpack_unpack_return ret =
        msgpack_unpack(sbuf.data, sbuf.size, NULL, &z, &obj);

    ASSERT_EQ(MSGPACK_UNPACK_SUCCESS, ret);
    ASSERT_EQ(MSGPACK_OBJECT_FLOAT32, obj.type);

    EXPECT_TRUE(isnan(obj.via.f64));
    EXPECT_FALSE(is_quiet_nan(obj.via.f64));
#if defined(MSGPACK_USE_LEGACY_NAME_AS_FLOAT)
    EXPECT_TRUE(isnan(obj.via.dec));
    EXPECT_FALSE(is_quiet_nan(obj.via.dec));
#endif // MSGPACK_USE_LEGACY_NAME_AS_FLOAT

    msgpack_zone_destroy(&z);
    msgpack_sbuffer_destroy(&sbuf);
}

TEST(MSGPACKC, simple_buffer_double_signaling_nan)
{
    if (!numeric_limits<double>::has_signaling_NaN) {
        return;
    }

    double val = numeric_limits<double>::signaling_NaN();

    msgpack_sbuffer sbuf;
    msgpack_sbuffer_init(&sbuf);
    msgpack_packer pk;
    msgpack_packer_init(&pk, &sbuf, msgpack_sbuffer_write);
    msgpack_pack_double(&pk, val);

    msgpack_zone z;
    msgpack_zone_init(&z, 2048);
    msgpack_object obj;
    msgpack_unpack_return ret =
        msgpack_unpack(sbuf.data, sbuf.size, NULL, &z, &obj);

    ASSERT_EQ(MSGPACK_UNPACK_SUCCESS, ret);
    ASSERT_EQ(MSGPACK_OBJECT_FLOAT64, obj.type);
    ASSERT_EQ(MSGPACK_OBJECT_FLOAT, obj.type);
#if defined(MSGPACK_USE_LEGACY_NAME_AS_FLOAT)
    ASSERT_EQ(MSGPACK_OBJECT_DOUBLE, obj.type);
#endif // MSGPACK_USE_LEGACY_NAME_AS_FLOAT

    EXPECT_TRUE(isnan(obj.via.f64));
    EXPECT_FALSE(is_quiet_nan(obj.via.f64));
#if defined(MSGPACK_USE_LEGACY_NAME_AS_FLOAT)
    EXPECT_TRUE(isnan(obj.via.dec));
    EXPECT_FALSE(is_quiet_nan(obj.via.dec));
#endif // MSGPACK_USE_LEGACY_NAME_AS_FLOAT

    msgpack_zone_destroy(&z);
    msgpack_sbuffer_destroy(&sbuf);
}

Expected behavior Either both tests should pass or both tests should fail. Currently, simple_buffer_double_signaling_nan passes but simple_buffer_float_signaling_nan fails.

Root cause The unpacker returns float32 values as double. This conversion implicitly causes the float32 signaling NaN to become a float64 quiet NaN.

Possible solution The unpacker could return float32 values as float instead of converting them to double. This might be a breaking change because callers may need to update their usage to check a different msgpack_object_union field. However, in addition to resolving this issue, it could also be more convenient for callers because they wouldn’t need to cast the double field back to float.

redboltz commented 1 year ago

Thank you for reporting the issue. I got confused. The code seems that C not C++. But your link to Root cause is C++ code. Now C and C++ are completely different branch and different implementation.

Which one has the isssue C or C++, or both? It seems that code is for C.

redboltz commented 1 year ago

I study about signaling/quiet NaN. Now I understand the issue. It is common in C and C++.

I consider fixing the issue with minimal code change.

This code demonstrates the issue (signaling NaN becomes quiet NaN when convert from float to double), and another possible fix. https://godbolt.org/z/Yehe6GMGG

Your link indicates the issue is caused by the following function. I agree.

inline void unpack_float(float d, msgpack::object& o)
{ o.type = msgpack::type::FLOAT; o.via.f64 = d; }

When I apply the minimal fix, then the code would be as follows:

inline void unpack_float(float d, msgpack::object& o)
{ 
    o.type = msgpack::type::FLOAT; 
    o.via.f64 = d;

    if (isnan(d)) { // could be `if (d != d) { ` for porablility
        if (!is_quiet_nan(d)) {
            clear_quiet(o.via.f64);
        }
    }
}

What do you think ?

Bitimage print version: https://godbolt.org/z/rqaPx4b34

redboltz commented 1 year ago

I ran my poc code on windows/msvc2019.

Here is the result:

has check
float has_quiet_NaN      :1
float has_signaling_NaN  :1
double has_quiet_NaN     :1
double has_signaling_NaN :1

float quiet/signaling check
is_quiet_nan(fqn)        :1
is_quiet_nan(fsn)        :1

double quiet/signaling check
is_quiet_nan(dqn)        :1
is_quiet_nan(dsn)        :1

float to double converted quiet/signaling check
isnan(dqn)               :1
isnan(dsn)               :1
is_quiet_nan(dqn)        :1
is_quiet_nan(dsn)        :1 invalid
bitimage fsn             :01111111110000000000000000000001
bitimage fixed_dsn before:0111111111111000000000000000000000100000000000000000000000000000
bitimage fixed_dsn after :0111111111111000000000000000000000100000000000000000000000000000
isnan(fixed_dsn)         :1
is_quiet_nan(fixed_dsn)  :1 valid

Linux result:

has check
float has_quiet_NaN      :1
float has_signaling_NaN  :1
double has_quiet_NaN     :1
double has_signaling_NaN :1

float quiet/signaling check
is_quiet_nan(fqn)        :1
is_quiet_nan(fsn)        :0

double quiet/signaling check
is_quiet_nan(dqn)        :1
is_quiet_nan(dsn)        :0

float to double converted quiet/signaling check
isnan(dqn)               :1
isnan(dsn)               :1
is_quiet_nan(dqn)        :1
is_quiet_nan(dsn)        :1 invalid
bitimage fsn             :01111111101000000000000000000000
bitimage fixed_dsn before:0111111111111100000000000000000000000000000000000000000000000000
bitimage fixed_dsn after :0111111111110100000000000000000000000000000000000000000000000000
isnan(fixed_dsn)         :1
is_quiet_nan(fixed_dsn)  :0 valid

float signaling NaN


SExp     Fraction(Significand)
01111111110000000000000000000000 Windows quiet     NaN
01111111110000000000000000000000 Linux   quiet     NaN
01111111110000000000000000000001 Windows signaling NaN
01111111101000000000000000000000 Linux   signaling NaN
if Exponent is all 1
   if Fraction is 0
      infinity (Sign is still enabled)
   else
      NaN *1
else
   normal value

*1

It seems that only Fraction != 0 is defined. So there are many variations.

I think that your solution, introducing float type to object union, is better than my approach.

However, there is still problem.

  1. Create signaling NaN on Linux
  2. pack it
  3. send it to Windows
  4. unpack it on Windows, then got quiet NaN *2

similarlly

  1. Create signaling NaN on Windows
  2. pack it
  3. send it to Linux
  4. unpack it on Linux, then got quiet NaN *2

*2

I'm still not sure how to distinguish quiet or signal.

I think that the following code is not portable. Works well on Linux but doesn't work on Windows:

inline bool is_quiet_nan(float nan_val) {
    std::uint32_t bit_pattern = reinterpret_cast<std::uint32_t&>(nan_val);
    int is_quiet_bit_index = FLT_MANT_DIG - 2;
    return (bit_pattern >> is_quiet_bit_index) & 1;
}
redboltz commented 1 year ago

Perhaps msgpack-c could have a database of double/float signaling/quiet NaN bit representation on common operating systems, like Linux, macosx, Windows. If they doesn't have conflicts, then unpack function could treat them well.

Conflict means OS-A quiet NaN bit representation is the same as OS-B signaling NaN bit representation.

Is there any information about quiet/signaling NaN bit repserentation ?

redboltz commented 1 year ago

I testes on msvc2022 too. Then got the same result as msvc2019.

I found a document: https://learn.microsoft.com/en-us/cpp/build/ieee-floating-point-representation?view=msvc-170#nan---not-a-number

Quiet NaNs have a leading one in the significand, and get propagated through an expression. They represent an indeterminate value, such as the result of dividing by infinity, or multiplying an infinity by zero. Signaling NaNs have a leading zero in the significand.

SExp     Fraction(Significand)
01111111110000000000000000000000 Windows quiet     NaN
01111111110000000000000000000001 Windows signaling NaN

It seems that the document and result are not mached.

redboltz commented 1 year ago

I wrote a PR to fix this issue on C++. Then some of CI doesn't pass. They are 32bit built using -m32 option. So I've checked the behavior using godbolt.

32bit

https://godbolt.org/z/8ETYd3Gz4

                         :SExp     Fraction(Significand)
bitimage fqn             :01111111110000000000000000000000
bitimage fsn             :01111111111000000000000000000000
                         :SExp        Fraction(Significand)
bitimage dqn             :0111111111111000000000000000000000000000000000000000000000000000
bitimage dsn             :0111111111111100000000000000000000000000000000000000000000000000

64bit

https://godbolt.org/z/aa8Wxoaao

                         :SExp     Fraction(Significand)
bitimage fqn             :01111111110000000000000000000000
bitimage fsn             :01111111101000000000000000000000
                         :SExp        Fraction(Significand)
bitimage dqn             :0111111111111000000000000000000000000000000000000000000000000000
bitimage dsn             :0111111111110100000000000000000000000000000000000000000000000000

64 bit is expected behavior. The left most bit of Fraction(Significand) indicates quiet(1) or signaling(0). But 32 bit is both 1. But the 2nd left most bit seems to indicate quiet(0) or signaling(1). It is very inconsintent for me. Am I missing something? My printing way or reinterrupt_cast would be undefined behavior ??

redboltz commented 1 year ago

https://en.wikipedia.org/wiki/IEEE_754#2019

Binary

... For NaNs, quiet NaNs and signaling NaNs are distinguished by using the most significant bit of the trailing significand field exclusively,[d] and the payload is carried in the remaining bits.

Perhaps the encoding rule for signaling/quiet NaN is very recently decided. If it is, there are various kinds of encoding... So msgpack-c pack/unpack using its platform one. So introducing f32 is good.

How do I write test for that ...

try rising signal ??

redboltz commented 1 year ago

I checked signal using fetestexcept.

signaling NaN with floating operation results:

Compiler arch float double godbolt
clang++ 64 raised raised https://godbolt.org/z/nP697jvTG
clang++ 32 not raised not raised https://godbolt.org/z/hhs56dnnz
g++ 64 raised raised https://godbolt.org/z/x18svh5Yr
g++ 32 not raised not raised https://godbolt.org/z/384YddYqn
arch type Fraction(Significand) MSB2bit
64bit signaling 01
64bit quiet 10
32bit signaling 11
32bit quiet 10
fumoboy007 commented 1 year ago

Thanks for looking into this @redboltz!

I think the Windows issue you encountered may just be an issue with the Windows C++ standard library implementation: https://developercommunity.visualstudio.com/t/stdnumeric-limitssignaling-nan-returns-quiet-nan/155064. Perhaps a floating point operation that generates an sNAN still has the correct bit pattern even if numeric_limits::signaling_NaN() does not?

Is the 32-bit CPU issue you encountered the same as the above Windows issue?

Your solution of clearing the is_quiet bit seems like it should work well. That being said, introducing an f32 field seems like the cleanest solution to me.

fumoboy007 commented 1 year ago

Actually, the Windows issue is a 32-bit CPU issue, not the other way around! More details here.

From what I understood, if a function returns an sNAN, it will always get converted to a qNAN (and raise a floating point exception) because the calling conventions on x86 specify that floating point values are returned in the x87 registers, which are inside the FPU. Hence, numeric_limits::signaling_NaN() always returns a qNAN on x86.