hughperkins / coriander

Build NVIDIA® CUDA™ code for OpenCL™ 1.2 devices
Apache License 2.0
842 stars 88 forks source link

Thrust support? #6

Open andrewcorrigan opened 8 years ago

andrewcorrigan commented 8 years ago

Do you have any plans to get Thrust compiling with cuda-on-cl?

hughperkins commented 8 years ago

Yes! It is a pre-requisite for (re-)porting Torch, which uses it a lot. I will probably look at that after porting Tensorflow across.

I provisionally plan to use Boost.Compute for this. I had a good experience with this earlier https://github.com/boostorg/compute/issues/646 @haahh was exceedingly patient and helpful.

andrewcorrigan commented 8 years ago

Thank you for your response (and this amazing project). That's great news.

Forgive me if I am being obtuse, but I'm not sure I understand what you mean by "I provisionally plan to use Boost.Compute for this". If cuda-on-cl can compile Thrust directly, how would that involve Boost.Compute?

hughperkins commented 8 years ago

Ah. When I say 'compile Thrust', I mean, 'allow programs that use Thrust to run on OpenCL'. Thrust is a proprietary library as far as I know, and no source code is available?

So, I need to re-implement the Thrust APIs somehow, eg using Boost.Compute. As far as I know?

andrewcorrigan commented 8 years ago

Thrust is open source, and code is available. If cuda-on-cl enables compilation of arbitrary CUDA code, then it should just work. But I wouldn't be surprised if Thrust uses advanced or obscure or architecture-specific techniques (especially in the internal library CUB), which might be more challenging to support?

Taking the other route, of re-implementing Thrust in terms of an existing OpenCL library, I guess the crucial issue would be to support user-defined operators. In Thrust, it accepts C++ function objects, in Boost.Compute, my understanding is that it ultimately stringizes it down to OpenCL C code, and hence is much more restrictive.

hughperkins commented 8 years ago

Ah, interesting.

In that case, if Thrust is opensource and source code is available, then compiling it directly using cuda-on-cl seems plausible. I dont suppose it will be quite as fast as a native implementation, but I guess if cuda-on-cl can handle Eigen reductions and so on, it should be able to handle almost the same thing in Thrust?

I guess you could try it, on a very simple Thrust test case, and see what happens?

andrewcorrigan commented 8 years ago

Will do, and I would be happy to report the results back here. I don't have an Ubuntu 16.04 system, so building cuda-on-cl didn't work right out of the box. I'm sure I can figure it out though.

hughperkins commented 8 years ago

Ok, let me know if you have any questions/comments. By the way, what system do you have, if you dont mind my asking?

andrewcorrigan commented 8 years ago

Mac laptop, and various CentOS/RHEL 7 workstation and servers

hughperkins commented 8 years ago

Cool. Thanks!

hughperkins commented 8 years ago

Note that recent merge to master radically overhauls/refactorizes what is working. You may find that a bunch of things that werent working before work fractionally more easily now? (You'll probalby also find some radically new broken stuff :-P Please let me know about anything broken, along with as much output as you have, eg output, bulid output, opencl source, IR source, etc)

andrewcorrigan commented 8 years ago

Very exciting. It's really easy to compile now on a Mac. I just had to brew install llvm38 and set in CMake "CLANG_HOME /path/to/homebrew/Cellar/llvm38/3.8.1/lib/llvm-3.8".

I'll be testing out Thrust as soon as possible.

hughperkins commented 8 years ago

Cool :-) Great! :-)

hughperkins commented 8 years ago

(added the mac build info at 3a64332 https://github.com/hughperkins/cuda-on-cl/blob/3a643326d37d0da4d664ef07e96b7b7e4561d8d8/README.md#install-clangllvm-38 )

andrewcorrigan commented 8 years ago

I couldn't get a simple Thrust code to compile, and took a step back and tried out your sample, but that gave the error:

➜  cocl -c main.cu
Assertion failed: (!F.isDeclaration() && "Cannot verify external functions"), function verifyFunction, file /private/tmp/llvm38-20161124-51071-1if8vyu/llvm-3.8.1.src/lib/IR/Verifier.cpp, line 4190.
/cocl/bin/cocl_wrapped: line 385: 91690 Abort trap: 6           ${COCL_BIN}/patch-hostside --hostrawfile ${OUTPUTBASEPATH}-hostraw.ll --devicellfile ${OUTPUTBASEPATH}-device.ll --hostpatchedfile ${OUTPUTBASEPATH}-hostpatched.ll
hughperkins commented 8 years ago

Hmmm... I've never seen an error like that. I wonder where to start? The travis build compiles and runs this ok. What if you use clang-3.8.0, rather than 3.8.1? Does that change anything?

hughperkins commented 8 years ago

So, I'm not sure why cuda_sample.cu is not building for you... I started dabbling in building simple things in thrust. The following gets at least 0.001second into compilation before failing:

git clone git@github.com:hughperkins/thrust.git
cd thrust/examples
cocl -DCUDA_VERSION=7000 -I .. sum.cu

Note that I had to also do:

cd /usr/local/include/cocl
sudo cp cuda_runtime.h cuda_runtime_api.h
sudo cp cuda_runtime.h driver_types.h

This gets me as far as:

../thrust/system/cuda/error.h:57:40: error: use of undeclared identifier 'cudaErrorMissingConfiguration'
  missing_configuration              = cudaErrorMissingConfiguration,
                                       ^
../thrust/system/cuda/error.h:58:40: error: use of undeclared identifier 'cudaErrorMemoryAllocation'
  memory_allocation                  = cudaErrorMemoryAllocation,
...

These errors should probably be added eg to cocl_error.h

andrewcorrigan commented 8 years ago

The example does indeed compile and run with the 3.8.0 binary.

hughperkins commented 8 years ago

Ok, cool. Updated README instructions: https://github.com/hughperkins/cuda-on-cl/commit/b8166c109169066073b1cc9e0ee39b817748a178

hughperkins commented 8 years ago

Added a bunhc of error symbols, in c47751d

This gets me as far as:

In file included from ../thrust/system/cuda/detail/bulk/detail/cuda_launcher/cuda_launch_config.hpp:24:
../thrust/detail/minmax.h:26:1: error: unknown type name '__host__'
__host__ __device__
^
../thrust/detail/minmax.h:26:10: warning: variable templates are a C++14
      extension [-Wc++14-extensions]
__host__ __device__
         ^
../thrust/detail/minmax.h:26:20: error: expected ';' at end of declaration
__host__ __device__
                   ^
                   ;
../thrust/detail/minmax.h:27:3: error: unknown type name 'T'
  T min THRUST_PREVENT_MACRO_SUBSTITUTION (const T &lhs, const T &rhs, B...
  ^
../thrust/detail/minmax.h:27:50: error: unknown type name 'T'
  T min THRUST_PREVENT_MACRO_SUBSTITUTION (const T &lhs, const T &rhs, B...
                                                 ^
../thrust/detail/minmax.h:27:64: error: unknown type name 'T'
  T min THRUST_PREVENT_MACRO_SUBSTITUTION (const T &lhs, const T &rhs, B...
                                                               ^
../thrust/detail/minmax.h:27:72: error: unknown type name 'BinaryPredicate'
  T min THRUST_PREVENT_MACRO_SUBSTITUTION (const T &lhs, const T &rhs, BinaryPre...
hughperkins commented 8 years ago

The error about __host__ is kind of weird, since it's defined right at the top of cocl.h, first thing we do.... https://github.com/hughperkins/cuda-on-cl/blob/dev/include/cocl/cocl.h#L10-L14

#if defined(__CUDACC__) || defined(__CUDA_ARCH__)
#define __device__ __attribute__((device))
#define __host__ __attribute__((host))
#else
#include <stdexcept>
#define __host__
#define __device__
#endif

__CUDA_ARCH__ means "are we building device code?' => if defined, we are building device code. Otherwies we are building hostside code. Since every .cu file is built twice: once for device-side, once for host-side, and then joined together.

hughperkins commented 8 years ago

Well... __CUDACC__ means something like "are we building using nvcc?", so that probably means we enter the first block, but that would still mean __host__ should be magically appearing to the compiler as __attribute__((host)), not __host__

hughperkins commented 8 years ago

(made the build commands appear in the output again, to facilitate analysis 4a0084a )

hughperkins commented 8 years ago

Hmmmm, maybe something to do with https://github.com/thrust/thrust/blob/master/thrust/detail/config/host_device.h#L35-L37

hughperkins commented 8 years ago

Ah: https://github.com/thrust/thrust/blob/master/thrust/system/cuda/detail/guarded_cuda_runtime_api.h#L28-L34

hughperkins commented 8 years ago

Created a thrust branch. https://github.com/hughperkins/cuda-on-cl/compare/thrust?expand=1

hughperkins commented 8 years ago

Added in a few more fixes, https://github.com/hughperkins/cuda-on-cl/compare/thrust?expand=1 . You can see the general approach:

hughperkins commented 8 years ago

(by the way, very important, please do NOT look at the cuda toolkit, headerfiles etc, as and when you look at dabbling in this. As a general rule, implementation details should be only what can be obtained by looking at the client-side source-code, ie in this case, thrust. Specific example:

More generally, if some call doesnt tell us much about a type, just use some random type for now, like eg size_t for return codes, which is almost certainly wrong. As and when we hit some client-code constraint, that defines additional constraints on this, like its signed, or 32-bit, or something, we can refine the type at that point.

)

hughperkins commented 7 years ago

Updates:

cd third_party/thrust/examples
cocl -D__CUDACC__ -D__thrust_hd_warning_disable__ -DCUDA_VERSION=3000 -I . -I .. fill_copy_sequence.cu
./fill_copy_sequence
screen shot 2017-06-03 at 9 57 28 am
hughperkins commented 7 years ago

Update:

With full COCL_SPAM enabled:

screen shot 2017-06-09 at 5 26 41 am

Command to reproduce: from coriander repo root:

cd third_party/thrust/examples
cocl -g -D__CUDACC__ -D__thrust_hd_warning_disable__ -DCUDA_VERSION=3000 -I . -I .. fill_copy_sequence.cu
./fill_copy_sequence
hughperkins commented 7 years ago

So. It turns out that thrust passes the function into the kernel by value. Along with all the parameters, as some kind of Closure.

The kernel:

template<unsigned int block_size, typename Function>
__global__
__bulk_launch_bounds__(block_size, 0)
void launch_by_value(Function f)
{
  f();
}

Making a closure:

template<typename Function, typename Tuple>
class closure
{
  public:
    typedef Function function_type;

    typedef Tuple arguments_type;

    __host__ __device__
    closure(function_type f, const arguments_type &args)
      :f(f),
       args(args)
    {}

    __host__ __device__
    void operator()()
    {
      apply_from_tuple(f,args);
    }

    __host__ __device__
    function_type function() const
    {
      return f;
    }

    __host__ __device__
    arguments_type arguments() const
    {
      return args;
    }

  private:
    function_type   f;
    arguments_type args;
}; // end closure

...

template<typename Function, typename Arg1, typename Arg2, typename Arg3, typename Arg4>
__host__ __device__
closure<
  Function,
  thrust::tuple<Arg1,Arg2,Arg3,Arg4>
>
  make_closure(Function f, const Arg1 &a1, const Arg2 &a2, const Arg3 &a3, const Arg4 &a4)
{
  return closure<Function,thrust::tuple<Arg1,Arg2,Arg3,Arg4> >(f, thrust::make_tuple(a1,a2,a3,a4));
}

Passing the function closure into the runtime api, as just a big by-value block:

    typedef Function task_type;

    inline __host__ __device__
    void launch(unsigned int num_blocks, unsigned int block_size, size_t num_dynamic_smem_bytes, cudaStream_t stream, task_type task)
    {
      struct workaround
      {
        __host__ __device__
        static void supported_path(unsigned int num_blocks, unsigned int block_size, size_t num_dynamic_smem_bytes, cudaStream_t stream, task_type task)
        {
          cudaConfigureCall(dim3(num_blocks), dim3(block_size), num_dynamic_smem_bytes, stream);
          cudaSetupArgument(task, 0);
...

That sounds ... challenging.