Open andrewcorrigan opened 8 years ago
Yes! It is a pre-requisite for (re-)porting Torch, which uses it a lot. I will probably look at that after porting Tensorflow across.
I provisionally plan to use Boost.Compute for this. I had a good experience with this earlier https://github.com/boostorg/compute/issues/646 @haahh was exceedingly patient and helpful.
Thank you for your response (and this amazing project). That's great news.
Forgive me if I am being obtuse, but I'm not sure I understand what you mean by "I provisionally plan to use Boost.Compute for this". If cuda-on-cl can compile Thrust directly, how would that involve Boost.Compute?
Ah. When I say 'compile Thrust', I mean, 'allow programs that use Thrust to run on OpenCL'. Thrust is a proprietary library as far as I know, and no source code is available?
So, I need to re-implement the Thrust APIs somehow, eg using Boost.Compute. As far as I know?
Thrust is open source, and code is available. If cuda-on-cl enables compilation of arbitrary CUDA code, then it should just work. But I wouldn't be surprised if Thrust uses advanced or obscure or architecture-specific techniques (especially in the internal library CUB), which might be more challenging to support?
Taking the other route, of re-implementing Thrust in terms of an existing OpenCL library, I guess the crucial issue would be to support user-defined operators. In Thrust, it accepts C++ function objects, in Boost.Compute, my understanding is that it ultimately stringizes it down to OpenCL C code, and hence is much more restrictive.
Ah, interesting.
In that case, if Thrust is opensource and source code is available, then compiling it directly using cuda-on-cl seems plausible. I dont suppose it will be quite as fast as a native implementation, but I guess if cuda-on-cl can handle Eigen reductions and so on, it should be able to handle almost the same thing in Thrust?
I guess you could try it, on a very simple Thrust test case, and see what happens?
Will do, and I would be happy to report the results back here. I don't have an Ubuntu 16.04 system, so building cuda-on-cl didn't work right out of the box. I'm sure I can figure it out though.
Ok, let me know if you have any questions/comments. By the way, what system do you have, if you dont mind my asking?
Mac laptop, and various CentOS/RHEL 7 workstation and servers
Cool. Thanks!
Note that recent merge to master radically overhauls/refactorizes what is working. You may find that a bunch of things that werent working before work fractionally more easily now? (You'll probalby also find some radically new broken stuff :-P Please let me know about anything broken, along with as much output as you have, eg output, bulid output, opencl source, IR source, etc)
Very exciting. It's really easy to compile now on a Mac. I just had to brew install llvm38 and set in CMake "CLANG_HOME /path/to/homebrew/Cellar/llvm38/3.8.1/lib/llvm-3.8".
I'll be testing out Thrust as soon as possible.
Cool :-) Great! :-)
(added the mac build info at 3a64332 https://github.com/hughperkins/cuda-on-cl/blob/3a643326d37d0da4d664ef07e96b7b7e4561d8d8/README.md#install-clangllvm-38 )
I couldn't get a simple Thrust code to compile, and took a step back and tried out your sample, but that gave the error:
➜ cocl -c main.cu
Assertion failed: (!F.isDeclaration() && "Cannot verify external functions"), function verifyFunction, file /private/tmp/llvm38-20161124-51071-1if8vyu/llvm-3.8.1.src/lib/IR/Verifier.cpp, line 4190.
/cocl/bin/cocl_wrapped: line 385: 91690 Abort trap: 6 ${COCL_BIN}/patch-hostside --hostrawfile ${OUTPUTBASEPATH}-hostraw.ll --devicellfile ${OUTPUTBASEPATH}-device.ll --hostpatchedfile ${OUTPUTBASEPATH}-hostpatched.ll
Hmmm... I've never seen an error like that. I wonder where to start? The travis build compiles and runs this ok. What if you use clang-3.8.0, rather than 3.8.1? Does that change anything?
So, I'm not sure why cuda_sample.cu
is not building for you... I started dabbling in building simple things in thrust. The following gets at least 0.001second into compilation before failing:
git clone git@github.com:hughperkins/thrust.git
cd thrust/examples
cocl -DCUDA_VERSION=7000 -I .. sum.cu
Note that I had to also do:
cd /usr/local/include/cocl
sudo cp cuda_runtime.h cuda_runtime_api.h
sudo cp cuda_runtime.h driver_types.h
This gets me as far as:
../thrust/system/cuda/error.h:57:40: error: use of undeclared identifier 'cudaErrorMissingConfiguration'
missing_configuration = cudaErrorMissingConfiguration,
^
../thrust/system/cuda/error.h:58:40: error: use of undeclared identifier 'cudaErrorMemoryAllocation'
memory_allocation = cudaErrorMemoryAllocation,
...
These errors should probably be added eg to cocl_error.h
The example does indeed compile and run with the 3.8.0 binary.
Ok, cool. Updated README instructions: https://github.com/hughperkins/cuda-on-cl/commit/b8166c109169066073b1cc9e0ee39b817748a178
Added a bunhc of error symbols, in c47751d
This gets me as far as:
In file included from ../thrust/system/cuda/detail/bulk/detail/cuda_launcher/cuda_launch_config.hpp:24:
../thrust/detail/minmax.h:26:1: error: unknown type name '__host__'
__host__ __device__
^
../thrust/detail/minmax.h:26:10: warning: variable templates are a C++14
extension [-Wc++14-extensions]
__host__ __device__
^
../thrust/detail/minmax.h:26:20: error: expected ';' at end of declaration
__host__ __device__
^
;
../thrust/detail/minmax.h:27:3: error: unknown type name 'T'
T min THRUST_PREVENT_MACRO_SUBSTITUTION (const T &lhs, const T &rhs, B...
^
../thrust/detail/minmax.h:27:50: error: unknown type name 'T'
T min THRUST_PREVENT_MACRO_SUBSTITUTION (const T &lhs, const T &rhs, B...
^
../thrust/detail/minmax.h:27:64: error: unknown type name 'T'
T min THRUST_PREVENT_MACRO_SUBSTITUTION (const T &lhs, const T &rhs, B...
^
../thrust/detail/minmax.h:27:72: error: unknown type name 'BinaryPredicate'
T min THRUST_PREVENT_MACRO_SUBSTITUTION (const T &lhs, const T &rhs, BinaryPre...
The error about __host__
is kind of weird, since it's defined right at the top of cocl.h, first thing we do.... https://github.com/hughperkins/cuda-on-cl/blob/dev/include/cocl/cocl.h#L10-L14
#if defined(__CUDACC__) || defined(__CUDA_ARCH__)
#define __device__ __attribute__((device))
#define __host__ __attribute__((host))
#else
#include <stdexcept>
#define __host__
#define __device__
#endif
__CUDA_ARCH__
means "are we building device code?' => if defined, we are building device code. Otherwies we are building hostside code. Since every .cu
file is built twice: once for device-side, once for host-side, and then joined together.
Well... __CUDACC__
means something like "are we building using nvcc?", so that probably means we enter the first block, but that would still mean __host__
should be magically appearing to the compiler as __attribute__((host))
, not __host__
(made the build commands appear in the output again, to facilitate analysis 4a0084a )
Hmmmm, maybe something to do with https://github.com/thrust/thrust/blob/master/thrust/detail/config/host_device.h#L35-L37
Created a thrust
branch. https://github.com/hughperkins/cuda-on-cl/compare/thrust?expand=1
__host__
errors../thrust/system/cuda/detail/bulk/future.hpp:148:37: error: use of undeclared id
entifier 'cudaEventDisableTiming'
static const int create_flags = cudaEventDisableTiming;
^
../thrust/system/cuda/detail/bulk/future.hpp:151:5: error: unknown type name 'cudaEvent_t'
cudaEvent_t m_event;
^
../thrust/system/cuda/detail/bulk/future.hpp:169:17: error: unknown type name 'cudaEvent_t'
inline static cudaEvent_t event(const future<void> &f)
^
These should probably be defined in cocl_event.h
and cocl_event.cc
(or similar sounding names)
Added in a few more fixes, https://github.com/hughperkins/cuda-on-cl/compare/thrust?expand=1 . You can see the general approach:
(by the way, very important, please do NOT look at the cuda toolkit, headerfiles etc, as and when you look at dabbling in this. As a general rule, implementation details should be only what can be obtained by looking at the client-side source-code, ie in this case, thrust. Specific example:
cudaEventDisableTiming
, and so on
More generally, if some call doesnt tell us much about a type, just use some random type for now, like eg size_t for return codes, which is almost certainly wrong. As and when we hit some client-code constraint, that defines additional constraints on this, like its signed, or 32-bit, or something, we can refine the type at that point.
)
Updates:
thrust
to coriander
repofill_copy_sequence.cu
compiles into bytecode now, though doesnt run yet. You can reproduce this result by doing, from the root of the coriander
repo:cd third_party/thrust/examples
cocl -D__CUDACC__ -D__thrust_hd_warning_disable__ -DCUDA_VERSION=3000 -I . -I .. fill_copy_sequence.cu
./fill_copy_sequence
Update:
With full COCL_SPAM
enabled:
Command to reproduce: from coriander repo root:
cd third_party/thrust/examples
cocl -g -D__CUDACC__ -D__thrust_hd_warning_disable__ -DCUDA_VERSION=3000 -I . -I .. fill_copy_sequence.cu
./fill_copy_sequence
So. It turns out that thrust passes the function into the kernel by value. Along with all the parameters, as some kind of Closure.
The kernel:
template<unsigned int block_size, typename Function>
__global__
__bulk_launch_bounds__(block_size, 0)
void launch_by_value(Function f)
{
f();
}
Making a closure:
template<typename Function, typename Tuple>
class closure
{
public:
typedef Function function_type;
typedef Tuple arguments_type;
__host__ __device__
closure(function_type f, const arguments_type &args)
:f(f),
args(args)
{}
__host__ __device__
void operator()()
{
apply_from_tuple(f,args);
}
__host__ __device__
function_type function() const
{
return f;
}
__host__ __device__
arguments_type arguments() const
{
return args;
}
private:
function_type f;
arguments_type args;
}; // end closure
...
template<typename Function, typename Arg1, typename Arg2, typename Arg3, typename Arg4>
__host__ __device__
closure<
Function,
thrust::tuple<Arg1,Arg2,Arg3,Arg4>
>
make_closure(Function f, const Arg1 &a1, const Arg2 &a2, const Arg3 &a3, const Arg4 &a4)
{
return closure<Function,thrust::tuple<Arg1,Arg2,Arg3,Arg4> >(f, thrust::make_tuple(a1,a2,a3,a4));
}
Passing the function closure into the runtime api, as just a big by-value block:
typedef Function task_type;
inline __host__ __device__
void launch(unsigned int num_blocks, unsigned int block_size, size_t num_dynamic_smem_bytes, cudaStream_t stream, task_type task)
{
struct workaround
{
__host__ __device__
static void supported_path(unsigned int num_blocks, unsigned int block_size, size_t num_dynamic_smem_bytes, cudaStream_t stream, task_type task)
{
cudaConfigureCall(dim3(num_blocks), dim3(block_size), num_dynamic_smem_bytes, stream);
cudaSetupArgument(task, 0);
...
That sounds ... challenging.
Do you have any plans to get Thrust compiling with cuda-on-cl?