llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.81k stars 11.91k forks source link

Clang-12 cannot compile CUB and Thrust libraries in CUDA 11.4.1 #51099

Closed srogatch closed 3 years ago

srogatch commented 3 years ago
Bugzilla Link 51757
Resolution WONTFIX
Resolved on Sep 07, 2021 19:12
Version 12.0
OS Linux
CC @zygoloid,@Artem-B

Extended Description

When #include <cub/cub.cuh> or #include <thrust/copy.h> is present in a CUDA source file compiled with clang-12 with CUDA Toolkit 11.4.1, compilation errors appear as in https://stackoverflow.com/questions/69064115/compilation-of-cub-library-with-clang-as-the-cuda-compiler and also the following error for thrust:

In file included from /usr/local/cuda-11.4/include/thrust/copy.h:512: In file included from /usr/local/cuda-11.4/include/thrust/detail/copy_if.h:74: In file included from /usr/local/cuda-11.4/include/thrust/detail/copy_if.inl:20: In file included from /usr/local/cuda-11.4/include/thrust/system/detail/generic/copy_if.h:63: In file included from /usr/local/cuda-11.4/include/thrust/system/detail/generic/copy_if.inl:31: In file included from /usr/local/cuda-11.4/include/thrust/scan.h:1563: In file included from /usr/local/cuda-11.4/include/thrust/detail/scan.inl:29: In file included from /usr/local/cuda-11.4/include/thrust/system/detail/adl/scan_by_key.h:42: In file included from /usr/local/cuda-11.4/include/thrust/system/cuda/detail/scan_by_key.h:34: In file included from /usr/local/cuda-11.4/include/thrust/system/cuda/execution_policy.h:51: In file included from /usr/local/cuda-11.4/include/thrust/system/cuda/detail/partition.h:42: In file included from /usr/local/cuda-11.4/include/thrust/partition.h:1438: In file included from /usr/local/cuda-11.4/include/thrust/detail/partition.inl:26: In file included from /usr/local/cuda-11.4/include/thrust/system/detail/generic/partition.h:169: In file included from /usr/local/cuda-11.4/include/thrust/system/detail/generic/partition.inl:26: In file included from /usr/local/cuda-11.4/include/thrust/sort.h:1361: In file included from /usr/local/cuda-11.4/include/thrust/detail/sort.inl:27: In file included from /usr/local/cuda-11.4/include/thrust/system/detail/adl/sort.h:42: In file included from /usr/local/cuda-11.4/include/thrust/system/cuda/detail/sort.h:36: In file included from /usr/local/cuda-11.4/include/cub/device/device_radix_sort.cuh:40: In file included from /usr/local/cuda-11.4/include/cub/device/dispatch/dispatch_radix_sort.cuh:40: In file included from /usr/local/cuda-11.4/include/cub/device/dispatch/../../agent/agent_radix_sort_histogram.cuh:38: /usr/local/cuda-11.4/include/cub/block/radix_rank_sort_operations.cuh:124:20: error: explicit qualification required to use member 'ProcessFloatMinusZero' from dependent base class return BFE(ProcessFloatMinusZero(key), bit_start, num_bits); ^ /usr/local/cuda-11.4/include/cub/block/block_radix_rank.cuh:413:50: note: in instantiation of member function 'cub::BFEDigitExtractor::Digit' requested here unsigned int digit = digit_extractor.Digit(keys[ITEM]); ^ /usr/local/cuda-11.4/include/cub/device/dispatch/../../block/block_radix_sort.cuh:228:72: note: in instantiation of function template specialization 'cub::BlockRadixRank<256, 6, false, true, cub::BLOCK_SCAN_WARP_SCANS, cudaSharedMemBankSizeFourByte, 1, 1, 750>::RankKeys<unsigned int, 19, cub::BFEDigitExtractor>' requested here AscendingBlockRadixRank(temp_storage.asending_ranking_storage).RankKeys( ^ /usr/local/cuda-11.4/include/cub/device/dispatch/../../block/block_radix_sort.cuh:367:13: note: in instantiation of member function 'cub::BlockRadixSort<unsigned int, 256, 19, unsigned int, 6, true, cub::BLOCK_SCAN_WARP_SCANS, cudaSharedMemBankSizeFourByte, 1, 1, 750>::RankKeys' requested here RankKeys(unsigned_keys, ranks, digit_extractor, is_descending); ^ /usr/local/cuda-11.4/include/cub/device/dispatch/dispatch_radix_sort.cuh:327:40: note: in instantiation of function template specialization 'cub::BlockRadixSort<unsigned int, 256, 19, unsigned int, 6, true, cub::BLOCK_SCAN_WARP_SCANS, cudaSharedMemBankSizeFourByte, 1, 1, 750>::SortBlockedToStriped<0, 0>' requested here BlockRadixSortT(temp_storage.sort).SortBlockedToStriped( ^ /usr/local/cuda-11.4/include/cub/device/dispatch/dispatch_radix_sort.cuh:1562:17: note: in instantiation of function template specialization 'cub::DeviceRadixSortSingleTileKernel<cub::DeviceRadixSortPolicy<unsigned int, unsigned int, int>::Policy800, false, unsigned int, unsigned int, int>' requested here DeviceRadixSortSingleTileKernel<MaxPolicyT, IS_DESCENDING, KeyT, ValueT, OffsetT>); ^ /usr/local/cuda-11.4/include/cub/device/dispatch/../../util_device.cuh:706:28: note: (skipping 12 contexts in backtrace; use -ftemplate-backtrace-limit=0 to see all) return op.template Invoke(); ^ /usr/local/cuda-11.4/include/thrust/system/cuda/detail/sort.h:1672:19: note: in instantiation of function template specialization 'thrust::cuda_cub::__smart_sort::smart_sort<thrust::detail::integral_constant<bool, true>, thrust::detail::integral_constant<bool, false>, thrust::cuda_cub::tag, thrust::detail::normal_iterator<thrust::device_ptr>, thrust::detail::normal_iterator<thrust::device_ptr>, thrust::less>' requested here smart_sort::smart_sort<thrust::detail::true_type, thrust::detail::false_type>( ^ /usr/local/cuda-11.4/include/thrust/system/cuda/detail/sort.h:1740:13: note: in instantiation of function template specialization 'thrust::cuda_cub::sort_by_key<thrust::cuda_cub::tag, thrust::detail::normal_iterator<thrust::device_ptr>, thrust::detail::normal_iterator<thrust::device_ptr>, thrust::less>' requested here cuda_cub::sort_by_key(policy, keys_first, keys_last, values, less()); ^ /usr/local/cuda-11.4/include/thrust/detail/sort.inl:98:10: note: in instantiation of function template specialization 'thrust::cuda_cub::sort_by_key<thrust::cuda_cub::tag, thrust::detail::normal_iterator<thrust::device_ptr>, thrust::detail::normal_iterator<thrust::device_ptr>>' requested here return sort_by_key(thrust::detail::derived_cast(thrust::detail::strip_const(exec)), keys_first, keys_last, values_first); ^ /usr/local/cuda-11.4/include/thrust/detail/sort.inl:285:18: note: in instantiation of function template specialization 'thrust::sort_by_key<thrust::cuda_cub::tag, thrust::detail::normal_iterator<thrust::device_ptr>, thrust::detail::normal_iterator<thrust::device_ptr>>' requested here return thrust::sort_by_key(select_system(system1,system2), keys_first, keys_last, values_first); ^ some_source.cu:2659:3: note: in instantiation of function template specialization 'thrust::sort_by_key<thrust::detail::normal_iterator<thrust::device_ptr>, thrust::detail::normal_iterator<thrust::device_ptr>>' requested here sort_by_key(key.begin(), key.begin() + size, val.begin()); ^ /usr/local/cuda-11.4/include/cub/block/radix_rank_sort_operations.cuh:98:52: note: member is declared here static device forceinline__ UnsignedBits ProcessFloatMinusZero(UnsignedBits key)

srogatch commented 3 years ago

I posted 2 issues within this bug report. For one, I provided the Stackoverflow link. For the other, I provided the compiler output. Thanks for commenting on Stackoverflow.

Artem-B commented 3 years ago

The log you've posted does not mention anything about the textures.

The SO appears to be about a different issue, AFAICT. Texture/surface lookup support is indeed not implemented in clang.

However, in this case clang appears to be complaining about a different problem:

error: explicit qualification required to use member 'ProcessFloatMinusZero' from dependent base class return BFE(ProcessFloatMinusZero(key), bit_start, num_bits);

This one does not have anything to do with the textures and it does appear to be an upstream portability issue.

srogatch commented 3 years ago

They confirm what is said on Stackoverflow (at the link in my first comment here): that texture reference from device function to a host texture variable is valid in CUDA. But Clang fails to compile that.

Artem-B commented 3 years ago

I can not access the bug you've referenced. What's their rationale?

srogatch commented 3 years ago

NVIDIA in https://developer.nvidia.com/nvidia_bug/3373556 claims it's not their bug either.

Artem-B commented 3 years ago

It appears to be a cub/thrust portability issue, not an issue with clang.

It would not be the first one -- AFAICT, "Compiles with NVCC" is their largest concern. NVCC is somewhat more lax than clang so this king of errors tends to creep in all the time.

The compilation failure is unfortunate, but there's little we can practically do in clang to help. Relaxing c++ diagnostics in order to be bug-compatible with NVCC is not likely to happen.

This should be fixed upstream in the thrust/cub repos. NVIDIA does accept patches and we did get similar issues fixed in the past. E.g: https://github.com/NVIDIA/thrust/commit/51363575f6d3edffac57d664fc3883cfa0e6d5c7

srogatch commented 3 years ago

assigned to @Artem-B