rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.38k stars 896 forks source link

[BUG] TYPES_TEST failure with gcc7.3 #1544

Closed pradghos closed 5 years ago

pradghos commented 5 years ago

During Build, I have observed build failure with commit - c81152e424871b5f00a777d6bca6ccde3301f536 Build log snippet:

ll/include -I$BUILD_PREFIX/include -I$SRC_DIR/cpp/build/include -I$SRC_DIR/cpp/include -I$SRC_DIR/cpp -I$SRC_DIR/cpp/src -I$SRC_DIR/cpp/thirdparty/cub -I$SRC_DIR/cpp/thirdparty/jitify -I$SRC_DIR/cpp/thirdparty/dlpack/include -I$SRC_DIR/cpp/build/arrow/install/include -I$SRC_DIR/cpp/build/arrow/build/flatbuffers_ep-prefix/src/flatbuffers_ep-install/include -I$PREFIX/include  -fvisibility-inlines-hidden  -fmessage-length=0 -mcpu=power8 -mtune=power8 -mpower8-fusion -mpower8-vector -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O3 -pipe -I$PREFIX/include -fdebug-prefix-map=$SRC_DIR=/usr/local/src/conda/libcudf-0.7.0.dev0 -fdebug-prefix-map=$PREFIX=/usr/local/src/conda-prefix -Werror -O3 -DNDEBUG -fPIE   -std=gnu++14 -o CMakeFiles/TYPES_TEST.dir/types/types_test.cpp.o -c $SRC_DIR/cpp/tests/types/types_test.cpp
/work/cpp/tests/types/types_test.cpp: In member function 'virtual void WrappersTestBool8_BooleanOperatorsBool8_Test::TestBody()':

/work/cpp/tests/types/types_test.cpp:243:23: error: narrowing conversion of '-42' from 'int' to 'char' inside { } [-Wnarrowing]
     cudf::bool8 w4{-42};
                       ^
/work/cpp/tests/types/types_test.cpp: In member function 'virtual void WrappersTestBool8_CastArithmeticTest_Test::TestBody()':
/opt/anaconda3/conda-bld/libcudf_1556628699139/work/cpp/tests/types/types_test.cpp:276:23: error: narrowing conversion of '-42' from 'int' to 'char' inside { } [-Wnarrowing]
     cudf::bool8 w2{-42};
                       ^

I have added -Wno-narrowing in cpp/CMakeLists.txt and able to build with the change. However I can see test failures -

      Start 20: TYPES_TEST
20/29 Test #20: TYPES_TEST .......................***Failed    2.49 sec

Detailed Log for the failures-

[       OK ] WrappersTest/0.UnwrapFundamentalTest (0 ms)
[ RUN      ] WrappersTest/0.ArithmeticOperators
[       OK ] WrappersTest/0.ArithmeticOperators (1 ms)
[ RUN      ] WrappersTest/0.CompoundAssignmentOperators
/opt/anaconda3/conda-bld/libcudf_1556630025815/work/cpp/tests/types/types_test.cpp:313: Failure
      Expected: t0
      Which is: 3
To be equal to: static_cast<UnderlyingType>(w0.value)
      Which is: 209652396
/opt/anaconda3/conda-bld/libcudf_1556630025815/work/cpp/tests/types/types_test.cpp:313: Failure
      Expected: t0
      Which is: 0
To be equal to: static_cast<UnderlyingType>(w0.value)
      Which is: 924231285
/opt/anaconda3/conda-bld/libcudf_1556630025815/work/cpp/tests/types/types_test.cpp:313: Failure
      Expected: t0
      Which is: 0
To be equal to: static_cast<UnderlyingType>(w0.value)
      Which is: 441365315
/opt/anaconda3/conda-bld/libcudf_1556630025815/work/cpp/tests/types/types_test.cpp:313: Failure
      Expected: t0
      Which is: 0
To be equal to: static_cast<UnderlyingType>(w0.value)
      Which is: 192771779
/opt/anaconda3/conda-bld/libcudf_1556630025815/work/cpp/tests/types/types_test.cpp:313: Failure
      Expected: t0
      Which is: 2
To be equal to: static_cast<UnderlyingType>(w0.value)
      Which is: -327900151
/opt/anaconda3/conda-bld/libcudf_1556630025815/work/cpp/tests/types/types_test.cpp:313: Failure
      Expected: t0
      Which is: 1
To be equal to: static_cast<UnderlyingType>(w0.value)
      Which is: 626610453

Please suggest if this is already a known issue. Thank you !

jrhemstad commented 5 years ago

This looks like a legitimate issue.

cudf::bool8 w4{-42}; is rightfully complaining about trying to initialize with a larger type (narrowing conversion). We treat warnings as errors, so I'm not sure how that warning slipped past CI.

This could be fixed a few ways: cudf::bool8 w4 = -42; // cast is implicit cudf::bool8 w4{static_cast<bool8::value_type>(-42)}; // cast is explicit

However, none of that should affect the correctness of the tests, so those failures are concerning. What OS and gcc versions are you using?

pradghos commented 5 years ago

Its is using GCC 7.3.0 conda tool chain on Ubuntu 18.04 (ppc64le)

pradghos commented 5 years ago

I can see another warning during work/cpp/src/utilities/device_atomics.cuh(955): warning: invalid narrowing conversion from "signed char" to "char" But its not stopping us.

pradghos commented 5 years ago

@jrhemstad Thanks for your suggestion - I am able to build successfully.

However, test case is still failing.

20/29 Test #20: TYPES_TEST .......................***Failed    2.42 sec

Code change:-

diff --git a/cpp/tests/types/types_test.cpp b/cpp/tests/types/types_test.cpp
index 95e793ea..59075fc2 100644
--- a/cpp/tests/types/types_test.cpp
+++ b/cpp/tests/types/types_test.cpp
@@ -240,7 +240,7 @@ TEST_F(WrappersTestBool8, BooleanOperatorsBool8)
     EXPECT_TRUE(w2 >= w3);
     EXPECT_TRUE(w2 <= w3);

-    cudf::bool8 w4{-42};
+    cudf::bool8 w4{static_cast<cudf::bool8::value_type>(-42)};
     cudf::bool8 w5{43};

     EXPECT_TRUE(w4 == w4);
@@ -273,10 +273,10 @@ TEST_F(WrappersTestBool8, BooleanOperatorsBool8)
 TEST_F(WrappersTestBool8, CastArithmeticTest)
 {
     cudf::bool8 w1{42};
-    cudf::bool8 w2{-42};
+    cudf::bool8 w2{static_cast<cudf::bool8::value_type>(-42)};

     bool t1{42};

is there any PR going to include this change or I will create a separate PR to update fix. Please let me know. Thanks!

pradghos commented 5 years ago

Summery of the failure for TYPES_TEST

[       OK ] DispatcherTest.UnsuportedTypesTest (0 ms)
[----------] 4 tests from DispatcherTest (1 ms total)

[----------] Global test environment tear-down
[==========] 58 tests from 24 test cases ran. (30400 ms total)
[  PASSED  ] 54 tests.
[  FAILED  ] 4 tests, listed below:
[  FAILED  ] WrappersTest/0.CompoundAssignmentOperators, where TypeParam = cudf::detail::wrapper<int, (gdf_dtype)11>
[  FAILED  ] WrappersTest/1.CompoundAssignmentOperators, where TypeParam = cudf::detail::wrapper<long, (gdf_dtype)10>
[  FAILED  ] WrappersTest/2.CompoundAssignmentOperators, where TypeParam = cudf::detail::wrapper<int, (gdf_dtype)8>
[  FAILED  ] WrappersTest/3.CompoundAssignmentOperators, where TypeParam = cudf::detail::wrapper<long, (gdf_dtype)9>

 4 FAILED TESTS
jrhemstad commented 5 years ago

is there any PR going to include this change or I will create a separate PR to update fix. Please let me know. Thanks!

Hopefully @harrism will be able to take a look at why these tests are failing and include the fix for the narrowing warning along with whatever is required to fix the test failures.

harrism commented 5 years ago

Hi @pradghos thanks for the report. I'm working on this, since I wrote the test recently that is failing. I need to fire up my Ubuntu 18 machine to test the failures. I'm not able to repro on Ubuntu 16 with g++ 5.4

harrism commented 5 years ago

@pradghos I'm unable to reproduce this on an Intel CPU with Ubuntu 18.04 GCC 7.4.

It compiles fine as-is, and the tests all pass. Suspect this is an issue on Power only?

pradghos commented 5 years ago

@harrism : Thanks for looking into the issue. I will try to take look further. is this feature and test case added recently. could you please point me the PR or commit id? That would be great.

jrhemstad commented 5 years ago

@harrism : Thanks for looking into the issue. I will try to take look further. is this feature and test case added recently. could you please point me the PR or commit id? That would be great.

The particular tests that are failing for you have existed for a while, but they were modified slightly in 0ba835b0b73cbe3284a24dfc3d7fca9691004e7b which was introduced by https://github.com/rapidsai/cudf/pull/1142

Specifically:

EXPECT_EQ(t0, w0.value);

was changed to:

EXPECT_EQ(t0, static_cast<UnderlyingType>(w0.value));

The purpose of these tests is verify the correct functionality of our wrapper types. See https://github.com/rapidsai/cudf/blob/branch-0.7/cpp/src/utilities/wrapper_types.hpp#L31

These wrappers are effectively a "strong" type definition for a new type that works just like a scalar, and implements the usual operators you'd expect for a scalar. Each wrapper wraps a single element of a fundamental type, e.g., an int or a double.

These particular tests are failing for you verify that operator+= works as expected. I can't think of a reason why adding these casts would cause the tests to fail.

kevingerman commented 5 years ago

The test fails for me when I try to build CUDF inside of a nightly devel image using a modified build.sh and environment to work with ASV benchmarking tests.

The simplest reproduction I could come up with...

Inside of nvidia-docker run -ti rapidsai/rapidsai-nightly:cuda10.0-devel-ubuntu18.04 bash

Clone a clean copy of cudf( the one at /rapids/cudf would probably work as well)

some shenanigans to make ci/gpu/build.sh work outside of gpuci

Then run the build

harrism commented 5 years ago

@kevingerman did the build fail for you , or just the TYPES_TEST? I was not able to reproduce the build failure in a fresh build yesterday. Will try again.

harrism commented 5 years ago

After second sed command I get

sed: -e expression #1, char 35: invalid reference \1 on `s' command's RHS

No speaky sed

harrism commented 5 years ago

I looked into this. @jrhemstad I'm stumped, can you have look at the following? On the failing test built with GCC 7.x, This is the line that fails, and only for the wrapper types other than cudf::bool8 (e.g. datetime64):

https://github.com/rapidsai/cudf/blob/b4ca36791679554d28830ab2fd15b3a57fea85fd/cpp/tests/types/types_test.cpp#L313

If I remove the static_cast<UnderlyingType>(), it passes:

EXPECT_EQ(t0, w0.value);

I don't understand this. UnderlyingType should be TypeParam::value_type, shouldn't it?

kevingerman commented 5 years ago

Apologies for the obscure commands. Those are pasted from my history. Probably some cut and paste mangling happened.

First sed command is changing the default conda environment from gdf to rapids

The second sed command is postfixing, rather than prefixing, the PATH with /conda/bin and /usr/local/cuda/bin so that the python used to build CUDF is from the intended conda env, rather than the default shipped with conda.

The third sed command is stripping the “.*” off the back of the env var CUDA_REL so that when that var is used later in the script to install NVStrings and RMM from the nightly repo it finds the intended repo.

These are all just to get the script to work in a nightly devel image. From there, and with the correct toolkit for that image, installed in the current conda env the build script exits at that unit test. It is the only unit test to fail in this scenario.

jrhemstad commented 5 years ago

I looked into this. @jrhemstad I'm stumped, can you have look at the following? On the failing test built with GCC 7.x, This is the line that fails, and only for the wrapper types other than cudf::bool8 (e.g. datetime64):

https://github.com/rapidsai/cudf/blob/b4ca36791679554d28830ab2fd15b3a57fea85fd/cpp/tests/types/types_test.cpp#L313

If I remove the static_cast<UnderlyingType>(), it passes:

EXPECT_EQ(t0, w0.value);

I don't understand this. UnderlyingType should be TypeParam::value_type, shouldn't it?

This looks totally fine to me.

I don't have the means to look into it further since I can't reproduce the issue.

kevingerman commented 5 years ago

So if it works for most build configurations (nightly, developer machines...) It might be that myself and op have an issue in our build environment that is creating the issue. Though it seems like something else should fail if its a version mismatch in libraries and or tools.

jrhemstad commented 5 years ago

Okay, thanks to help from @kevingerman I'm able to reproduce on the cuda9.2-devel-ubuntu18.04 container. I'll start diving in to see what the issue is.

@pradghos I assume since you're running on POWER that you're running on bare metal, correct?

jrhemstad commented 5 years ago

Been digging into this all day and haven't made much progress.

To get a reproducible and compact failure, I've been running:

./gtests/TYPES_TEST --gtest_break_on_failure

Which yields:

...
[ RUN      ] WrappersTest/0.CompoundAssignmentOperators
/rapids/cudf/cpp/tests/types/types_test.cpp:319: Failure
      Expected: t0
      Which is: 3
To be equal to: static_cast<UnderlyingType>(w0.value)
      Which is: 209652396

Here is the failure: https://github.com/rapidsai/cudf/blob/branch-0.7/cpp/tests/types/types_test.cpp#L313

This is the concise version of the code in question:

        UnderlyingType t0{this->rand()};
        UnderlyingType const t1{this->rand()};

        TypeParam w0{t0};
        TypeParam const w1{t1};

        if( 0 != t1) {
            t0/=t1;
            w0/=w1;
            EXPECT_EQ(t0, static_cast<UnderlyingType>(w0.value));
        }

Printing the values of t0 and t1 shows:

t0 == 1209609620
t1 == 398764591

The correct result of integer division of t0/t1 is 3. However, the result we're seeing is: 209652396

w0/=w1 is invoking:

https://github.com/rapidsai/cudf/blob/branch-0.7/cpp/src/utilities/wrapper_types.hpp#L181

I added print statements inside this function, and both lhs.value and rhs.value match what is expected, but the result is erroneous: 209652396.

At my wits end, I created an isolated version of the code that is currently failing.

See: https://wandbox.org/permlink/WHMg29N799OPxBgq

I can't reproduce no matter what compiler version I choose.

I also tried compiling that code in the cudf_dev environment on cuda9.2-devel-ubuntu18.04 with both g++7.3 and nvcc9.2.

I am as of yet unable to reproduce outside of running the TYPES_TEST directly in the cuda9.2-devel-ubuntu18.04 container.

Oh, and using a DEBUG build makes all of the TYPES_TEST tests pass.

jrhemstad commented 5 years ago

Note that @rlratzel is seeing the same behavior in the CentOS nightly build using gcc7.2.

harrism commented 5 years ago

I tried the master branch in the container and found that TYPES_TEST did not fail in the previous release. So I'm currently bisecting to find the point of failure (likely in my GDF_BOOL8 PR, but it's not clear how yet).

harrism commented 5 years ago

This is the commit where I believe I introduced the failure. https://github.com/rapidsai/cudf/pull/1142/commits/0ba835b0b73cbe3284a24dfc3d7fca9691004e7b

Here I added the "TypeToUse" template helper for the tests, and also the static_cast<UnderlyingType> in the CompoundAssignmentTest. That cast is required in order to make the test pass for the cudf::bool8 wrapper. I don't know why the cast makes it fail for other wrapper types (and only on gcc 7.x).

harrism commented 5 years ago

I fixed the failure by doing a different comparison for bool8 and the other wrapper types. https://github.com/rapidsai/cudf/pull/1631

jrhemstad commented 5 years ago

@eyalroz reports the same failure on gcc6.3 https://github.com/rapidsai/cudf/issues/1623#issuecomment-489303657

eyalroz commented 5 years ago

@jrhemstad : Actually, this may not be the same bug as mine - since I don't get warnings about narrowing conversions when building TYPES_TEST.

harrism commented 5 years ago

The test failures were not related to the narrowing conversions. Nobody has repro'd the compiler warnings for narrowing conversions. I think your test failures were the same, @eyalroz, try the latest branch-0.7 for a fix.


From: Eyal Rozenberg notifications@github.com Sent: Sunday, May 5, 2019 12:17:10 AM To: rapidsai/cudf Cc: Mark Harris; Mention Subject: Re: [rapidsai/cudf] [BUG] TYPES_TEST failure with gcc7.3 (#1544)

@jrhemstadhttps://github.com/jrhemstad : Actually, this may not be the same bug as mine - since I don't get warnings about narrowing conversions when building TYPES_TEST.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/rapidsai/cudf/issues/1544#issuecomment-489330931, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAF7FXPIHAYSUU6O4SUVZPLPTWLGNANCNFSM4HJL5GWA.


This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

pradghos commented 5 years ago

Thank you looking into failures. Now I have picked up the latest release 0.7.1 and TYPE_TEST worked fine for me. However, I did below changes to avoid error: narrowing conversion of '-42' from 'int' to 'char' inside during build which was not reproducible in other environment.

      Start 21: TYPES_TEST
21/31 Test #21: TYPES_TEST .......................   Passed    1.66 sec
-    cudf::bool8 w4{-42};
+    cudf::bool8 w4{static_cast<cudf::bool8::value_type>(-42)};
     cudf::bool8 w5{43};

     EXPECT_TRUE(w4 == w4);
@@ -273,7 +273,7 @@ TEST_F(WrappersTestBool8, BooleanOperatorsBool8)
 TEST_F(WrappersTestBool8, CastArithmeticTest)
 {
     cudf::bool8 w1{42};
-    cudf::bool8 w2{-42};
+    cudf::bool8 w2{static_cast<cudf::bool8::value_type>(-42)};

@jrhemstad @harrism : Please me know if I can create PR for this change. Thank you !

jrhemstad commented 5 years ago

However, I did below changes to avoid error: narrowing conversion of '-42' from 'int' to 'char' inside during build which was not reproducible in other environment.

I believe I know the reason for the narrowing error.

This is because GDF_BOOL8 uses a char for it's underlying storage. char resolves to a signed char on x86 whereas it's unsigned char on POWER. We should be more explicitly setting the type to one or the other to avoid these issues.

pradghos commented 5 years ago

Thanks for the response @jrhemstad @harrism . I will update any other similar warning in - https://github.com/rapidsai/cudf/issues/1751