Thrust support? - Githubissues

andrewcorrigan commented 8 years ago

Do you have any plans to get Thrust compiling with cuda-on-cl?

hughperkins commented 8 years ago

Yes! It is a pre-requisite for (re-)porting Torch, which uses it a lot. I will probably look at that after porting Tensorflow across.

I provisionally plan to use Boost.Compute for this. I had a good experience with this earlier https://github.com/boostorg/compute/issues/646 @haahh was exceedingly patient and helpful.

andrewcorrigan commented 8 years ago

Thank you for your response (and this amazing project). That's great news.

Forgive me if I am being obtuse, but I'm not sure I understand what you mean by "I provisionally plan to use Boost.Compute for this". If cuda-on-cl can compile Thrust directly, how would that involve Boost.Compute?

hughperkins commented 8 years ago

Ah. When I say 'compile Thrust', I mean, 'allow programs that use Thrust to run on OpenCL'. Thrust is a proprietary library as far as I know, and no source code is available?

So, I need to re-implement the Thrust APIs somehow, eg using Boost.Compute. As far as I know?

andrewcorrigan commented 8 years ago

Thrust is open source, and code is available. If cuda-on-cl enables compilation of arbitrary CUDA code, then it should just work. But I wouldn't be surprised if Thrust uses advanced or obscure or architecture-specific techniques (especially in the internal library CUB), which might be more challenging to support?

Taking the other route, of re-implementing Thrust in terms of an existing OpenCL library, I guess the crucial issue would be to support user-defined operators. In Thrust, it accepts C++ function objects, in Boost.Compute, my understanding is that it ultimately stringizes it down to OpenCL C code, and hence is much more restrictive.

hughperkins commented 8 years ago

Ah, interesting.

In that case, if Thrust is opensource and source code is available, then compiling it directly using cuda-on-cl seems plausible. I dont suppose it will be quite as fast as a native implementation, but I guess if cuda-on-cl can handle Eigen reductions and so on, it should be able to handle almost the same thing in Thrust?

I guess you could try it, on a very simple Thrust test case, and see what happens?

andrewcorrigan commented 8 years ago

Will do, and I would be happy to report the results back here. I don't have an Ubuntu 16.04 system, so building cuda-on-cl didn't work right out of the box. I'm sure I can figure it out though.

hughperkins commented 8 years ago

Ok, let me know if you have any questions/comments. By the way, what system do you have, if you dont mind my asking?

andrewcorrigan commented 8 years ago

Mac laptop, and various CentOS/RHEL 7 workstation and servers

hughperkins commented 8 years ago

Cool. Thanks!

hughperkins commented 8 years ago

Note that recent merge to master radically overhauls/refactorizes what is working. You may find that a bunch of things that werent working before work fractionally more easily now? (You'll probalby also find some radically new broken stuff :-P Please let me know about anything broken, along with as much output as you have, eg output, bulid output, opencl source, IR source, etc)

andrewcorrigan commented 8 years ago

Very exciting. It's really easy to compile now on a Mac. I just had to brew install llvm38 and set in CMake "CLANG_HOME /path/to/homebrew/Cellar/llvm38/3.8.1/lib/llvm-3.8".

I'll be testing out Thrust as soon as possible.

hughperkins commented 8 years ago

Cool :-) Great! :-)

hughperkins commented 8 years ago

(added the mac build info at 3a64332 https://github.com/hughperkins/cuda-on-cl/blob/3a643326d37d0da4d664ef07e96b7b7e4561d8d8/README.md#install-clangllvm-38 )

andrewcorrigan commented 8 years ago

I couldn't get a simple Thrust code to compile, and took a step back and tried out your sample, but that gave the error:

➜  cocl -c main.cu
Assertion failed: (!F.isDeclaration() && "Cannot verify external functions"), function verifyFunction, file /private/tmp/llvm38-20161124-51071-1if8vyu/llvm-3.8.1.src/lib/IR/Verifier.cpp, line 4190.
/cocl/bin/cocl_wrapped: line 385: 91690 Abort trap: 6           ${COCL_BIN}/patch-hostside --hostrawfile ${OUTPUTBASEPATH}-hostraw.ll --devicellfile ${OUTPUTBASEPATH}-device.ll --hostpatchedfile ${OUTPUTBASEPATH}-hostpatched.ll

hughperkins commented 8 years ago

Hmmm... I've never seen an error like that. I wonder where to start? The travis build compiles and runs this ok. What if you use clang-3.8.0, rather than 3.8.1? Does that change anything?

hughperkins commented 8 years ago

So, I'm not sure why cuda_sample.cu is not building for you... I started dabbling in building simple things in thrust. The following gets at least 0.001second into compilation before failing:

git clone git@github.com:hughperkins/thrust.git
cd thrust/examples
cocl -DCUDA_VERSION=7000 -I .. sum.cu

Note that I had to also do:

cd /usr/local/include/cocl
sudo cp cuda_runtime.h cuda_runtime_api.h
sudo cp cuda_runtime.h driver_types.h

This gets me as far as:

../thrust/system/cuda/error.h:57:40: error: use of undeclared identifier 'cudaErrorMissingConfiguration'
  missing_configuration              = cudaErrorMissingConfiguration,
                                       ^
../thrust/system/cuda/error.h:58:40: error: use of undeclared identifier 'cudaErrorMemoryAllocation'
  memory_allocation                  = cudaErrorMemoryAllocation,
...

These errors should probably be added eg to cocl_error.h

andrewcorrigan commented 8 years ago

The example does indeed compile and run with the 3.8.0 binary.

hughperkins commented 8 years ago

Ok, cool. Updated README instructions: https://github.com/hughperkins/cuda-on-cl/commit/b8166c109169066073b1cc9e0ee39b817748a178

hughperkins commented 8 years ago

Added a bunhc of error symbols, in c47751d

This gets me as far as:

In file included from ../thrust/system/cuda/detail/bulk/detail/cuda_launcher/cuda_launch_config.hpp:24:
../thrust/detail/minmax.h:26:1: error: unknown type name '__host__'
__host__ __device__
^
../thrust/detail/minmax.h:26:10: warning: variable templates are a C++14
      extension [-Wc++14-extensions]
__host__ __device__
         ^
../thrust/detail/minmax.h:26:20: error: expected ';' at end of declaration
__host__ __device__
                   ^
                   ;
../thrust/detail/minmax.h:27:3: error: unknown type name 'T'
  T min THRUST_PREVENT_MACRO_SUBSTITUTION (const T &lhs, const T &rhs, B...
  ^
../thrust/detail/minmax.h:27:50: error: unknown type name 'T'
  T min THRUST_PREVENT_MACRO_SUBSTITUTION (const T &lhs, const T &rhs, B...
                                                 ^
../thrust/detail/minmax.h:27:64: error: unknown type name 'T'
  T min THRUST_PREVENT_MACRO_SUBSTITUTION (const T &lhs, const T &rhs, B...
                                                               ^
../thrust/detail/minmax.h:27:72: error: unknown type name 'BinaryPredicate'
  T min THRUST_PREVENT_MACRO_SUBSTITUTION (const T &lhs, const T &rhs, BinaryPre...

hughperkins commented 8 years ago

The error about __host__ is kind of weird, since it's defined right at the top of cocl.h, first thing we do.... https://github.com/hughperkins/cuda-on-cl/blob/dev/include/cocl/cocl.h#L10-L14

#if defined(__CUDACC__) || defined(__CUDA_ARCH__)
#define __device__ __attribute__((device))
#define __host__ __attribute__((host))
#else
#include <stdexcept>
#define __host__
#define __device__
#endif

__CUDA_ARCH__ means "are we building device code?' => if defined, we are building device code. Otherwies we are building hostside code. Since every .cu file is built twice: once for device-side, once for host-side, and then joined together.

hughperkins commented 8 years ago

Well... __CUDACC__ means something like "are we building using nvcc?", so that probably means we enter the first block, but that would still mean __host__ should be magically appearing to the compiler as __attribute__((host)), not __host__

hughperkins commented 8 years ago

(made the build commands appear in the output again, to facilitate analysis 4a0084a )

hughperkins commented 8 years ago

Hmmmm, maybe something to do with https://github.com/thrust/thrust/blob/master/thrust/detail/config/host_device.h#L35-L37

hughperkins commented 8 years ago

Ah: https://github.com/thrust/thrust/blob/master/thrust/system/cuda/detail/guarded_cuda_runtime_api.h#L28-L34

hughperkins commented 8 years ago

Created a thrust branch. https://github.com/hughperkins/cuda-on-cl/compare/thrust?expand=1

added some stuff to handle the __host__ errors

next up:

../thrust/system/cuda/detail/bulk/future.hpp:148:37: error: use of undeclared id
entifier 'cudaEventDisableTiming'
static const int create_flags = cudaEventDisableTiming;
                                ^
../thrust/system/cuda/detail/bulk/future.hpp:151:5: error: unknown type name 'cudaEvent_t'
cudaEvent_t m_event;
^
../thrust/system/cuda/detail/bulk/future.hpp:169:17: error: unknown type name 'cudaEvent_t'
inline static cudaEvent_t event(const future<void> &f)
            ^

These should probably be defined in cocl_event.h and cocl_event.cc (or similar sounding names)

hughperkins commented 8 years ago

Added in a few more fixes, https://github.com/hughperkins/cuda-on-cl/compare/thrust?expand=1 . You can see the general approach:

simply define whatever functions and symbols we need for now, to get it to compile.
We'll look at implementing some of these functions and symbols later, either by:
- defining them eg in https://github.com/hughperkins/cuda-on-cl/blob/master/src/cocl_events.cpp or https://github.com/hughperkins/cuda-on-cl/blob/master/src/cocl_memory.cpp (for example),
- or by completing replacing the call, in the opencl code, during the opencl generation process, as defined by:
- https://github.com/hughperkins/cuda-on-cl/blob/master/src/new_instruction_dumper.cpp#L821 , and in conjunction with
- https://github.com/hughperkins/cuda-on-cl/blob/master/src/function_names_map.cpp

hughperkins commented 8 years ago

(by the way, very important, please do NOT look at the cuda toolkit, headerfiles etc, as and when you look at dabbling in this. As a general rule, implementation details should be only what can be obtained by looking at the client-side source-code, ie in this case, thrust. Specific example:

in the thrust code we have symbols like cudaEventDisableTiming, and so on
- but we dont have their actual numeric values
- and we dont need them :-) since thrust doesnt use them directly
so we create this symbol, but give it some arbitrary number, almost certainly different from the original

More generally, if some call doesnt tell us much about a type, just use some random type for now, like eg size_t for return codes, which is almost certainly wrong. As and when we hit some client-code constraint, that defines additional constraints on this, like its signed, or 32-bit, or something, we can refine the type at that point.

)

hughperkins commented 7 years ago

Updates:

added latest thrust to coriander repo
fill_copy_sequence.cu compiles into bytecode now, though doesnt run yet. You can reproduce this result by doing, from the root of the coriander repo:

cd third_party/thrust/examples
cocl -D__CUDACC__ -D__thrust_hd_warning_disable__ -DCUDA_VERSION=3000 -I . -I .. fill_copy_sequence.cu
./fill_copy_sequence

hughperkins commented 7 years ago

Update:

Using adding-thrust branch, fill_copy_sequence.cu does at least compile and run now :-)
What's missing is that the kernel launch mechanism used by thrust is not currently patched, so no kernels actually run currenlty :-P

With full COCL_SPAM enabled:

Command to reproduce: from coriander repo root:

cd third_party/thrust/examples
cocl -g -D__CUDACC__ -D__thrust_hd_warning_disable__ -DCUDA_VERSION=3000 -I . -I .. fill_copy_sequence.cu
./fill_copy_sequence

hughperkins commented 7 years ago

So. It turns out that thrust passes the function into the kernel by value. Along with all the parameters, as some kind of Closure.

The kernel:

template<unsigned int block_size, typename Function>
__global__
__bulk_launch_bounds__(block_size, 0)
void launch_by_value(Function f)
{
  f();
}

Making a closure:

template<typename Function, typename Tuple>
class closure
{
  public:
    typedef Function function_type;

    typedef Tuple arguments_type;

    __host__ __device__
    closure(function_type f, const arguments_type &args)
      :f(f),
       args(args)
    {}

    __host__ __device__
    void operator()()
    {
      apply_from_tuple(f,args);
    }

    __host__ __device__
    function_type function() const
    {
      return f;
    }

    __host__ __device__
    arguments_type arguments() const
    {
      return args;
    }

  private:
    function_type   f;
    arguments_type args;
}; // end closure

...

template<typename Function, typename Arg1, typename Arg2, typename Arg3, typename Arg4>
__host__ __device__
closure<
  Function,
  thrust::tuple<Arg1,Arg2,Arg3,Arg4>
>
  make_closure(Function f, const Arg1 &a1, const Arg2 &a2, const Arg3 &a3, const Arg4 &a4)
{
  return closure<Function,thrust::tuple<Arg1,Arg2,Arg3,Arg4> >(f, thrust::make_tuple(a1,a2,a3,a4));
}

Passing the function closure into the runtime api, as just a big by-value block:

    typedef Function task_type;

    inline __host__ __device__
    void launch(unsigned int num_blocks, unsigned int block_size, size_t num_dynamic_smem_bytes, cudaStream_t stream, task_type task)
    {
      struct workaround
      {
        __host__ __device__
        static void supported_path(unsigned int num_blocks, unsigned int block_size, size_t num_dynamic_smem_bytes, cudaStream_t stream, task_type task)
        {
          cudaConfigureCall(dim3(num_blocks), dim3(block_size), num_dynamic_smem_bytes, stream);
          cudaSetupArgument(task, 0);
...

That sounds ... challenging.

hughperkins / coriander

Thrust support? #6