ledatelescope / bifrost

A stream processing framework for high-throughput applications.
BSD 3-Clause "New" or "Revised" License
64 stars 29 forks source link

Build fails with Thrust 2.1: pinned_allocator.h removed #202

Open torrance opened 1 year ago

torrance commented 1 year ago

pinned_allocator.h was removed as part of pull request https://github.com/NVIDIA/thrust/pull/1611, and a commit referenced from there mentions "Remove thrust::system::cuda::experimental::pinned_allocator.h, which has been deprecated for a long time."

I have no idea what it has been deprecated in favour of. An old issue suggests universal_host_pinned_allocator but this doesn't seem to actually exist anywhere.

What should it be replaced with?

benbarsdell commented 1 year ago

Apparently this is the replacement:

#include <thrust/system/cuda/memory.h>

using pinned_allocator = thrust::mr::stateless_resource_allocator<
    T, thrust::system::cuda::universal_host_pinned_memory_resource>;
torrance commented 1 year ago

At your suggestion, I've made the following change:

diff --git a/src/fft.cu b/src/fft.cu
index eeace96..26cd458 100644
--- a/src/fft.cu
+++ b/src/fft.cu
@@ -44,7 +44,7 @@
 #include "ArrayIndexer.cuh"
 #include <thrust/device_vector.h>
 #include <thrust/host_vector.h>
-#include <thrust/system/cuda/experimental/pinned_allocator.h>
+#include <thrust/system/cuda/memory.h>

 #include <cufft.h>
 #include <cufftXt.h>
@@ -63,9 +63,9 @@ class BFfft_impl {
        bool             _using_load_callback;
        thrust::device_vector<char> _dv_tmp_storage;
        thrust::device_vector<CallbackData> _dv_callback_data;
-       typedef thrust::cuda::experimental::pinned_allocator<CallbackData> pinned_allocator_type;
+       using pinned_allocator_type = thrust::mr::stateless_resource_allocator<CallbackData, thrust::universal_host_pinned_memory_resource>;
        thrust::host_vector<CallbackData, pinned_allocator_type> _hv_callback_data;

And that builds.

However, all FFT-related tests currently fail, specifically those using fftshift, which seems to be exactly where this host_vector is used (?). In these cases, the odata array is all zeros, suggesting something failed to transfer from host to device, or vice versa.

I can't be certain that's the cause, since this is my first time trying to build bifrost, but seems likely.

torrance commented 1 year ago

A little more investigation, and it turns out the entire callback that performs that fftshift isn't running.

What's incredible is that if I add an empty print statement to post_fftshift the callback works and is called:

diff --git a/src/fft_kernels.cu b/src/fft_kernels.cu
index 9aefa89..7ec352c 100644
--- a/src/fft_kernels.cu
+++ b/src/fft_kernels.cu
@@ -28,6 +28,7 @@

 #include "fft_kernels.h"
 #include "cuda.hpp"
+#include "stdio.h"

 __device__
 inline size_t pre_fftshift(size_t        offset,
@@ -56,6 +57,8 @@ inline Complex post_fftshift(size_t        offset,
        // For forward transforms with apply_fftshift=true, we cyclically shift
        //   the output data by phase-rotating the input data here.
        if( cb->do_fftshift && !cb->inverse ) {
+               if (offset == 0) printf("");
+
                for( int d=0; d<cb->ndim; ++d ) {
                        // Compute the index of this element along dimension d
                        // **TODO: 64-bit indexing support

What's more incredible, is that if I add this print statement to the parent function only, in this case callback_load_cf32, it doesn't work and nothing is printed. Only if the print is added to post_fftshift do both print statements print anything at all.

I have no idea what's going on here.

jaycedowell commented 1 year ago

I've updated the self-hosted runner to Ubuntu 20.04 and CUDA 12.0 and I'm now seeing this in the CI. I'm also getting a 'cuda/stream.hpp(85): error: namespace "cuda::std" has no member "runtime_error"' error there as well. Working through those locally, I get Bifrost to build, and I am seeing that all of the test_fft tests are failing with a lot of zero filled results.

I played around with this a little bit and ended up with fewer errors if I changed the declaration of CallbackData in fft_kernels.h to be a struct __attribute__((packed)) CallbackData. I'm not sure why this would matter but I now only get errors on the complex-to-real transform tests.

jaycedowell commented 1 year ago

I think my complex-to-real errors are from an older version of the test suite (I've been testing on "ibverb-support"). As of https://github.com/ledatelescope/bifrost/commit/abee49a98094143d90cd146427822ac893ee3d2f CI looks to be ok.

torrance commented 1 year ago

I'm also getting a 'cuda/stream.hpp(85): error: namespace "cuda::std" has no member "runtime_error"' error there as well

Yes, I got that too and had to make it an absolute import.

I played around with this a little bit and ended up with fewer errors if I changed the declaration of CallbackData in fft_kernels.h to be a struct __attribute__((packed)) CallbackData. I'm not sure why this would matter but I now only get errors on the complex-to-real transform tests.

I can confirm this works for me too, however the compiler complains:

fft_kernels.h:109:13: warning: ignoring packed 
   attribute because of unpacked non-POD field ‘int_fastdiv 
   CallbackData::istrides [3]’
  109 |  int_fastdiv istrides[3]; // Note: Elements, not bytes

...so I'm not sure why it works, especially since the compiler is telling me it's being ignored (!).

jaycedowell commented 1 year ago

I also got that compiler warning. I'm hesitant to call this "solved" since it's not clear why this change makes any difference. But it does seem to yield the correct FFT results and it doesn't appear to cause any problems with earlier versions of CUDA. Maybe this is a "works for me".