headmyshoulder / odeint-v2

odeint - solving ordinary differential equations in c++ v2
http://headmyshoulder.github.com/odeint-v2/
Other
344 stars 101 forks source link

odeint with VexCL: implicit instantiation of vector_space_reduce template #92

Closed ds283 closed 9 years ago

ds283 commented 11 years ago

Apologies if this is a dumb question.

I am attempting to use odeint to solve an ensemble of ODEs using the VexCL. To do so, I'd like to use a dense output stepper with a VexCL object as the state type.

However, I am struggling to make this work. I believe I ought to be able to write something like

auto stepper = make_dense_output(1e-06, 1e-06,
  runge_kutta_dopri5< state_type, double, state_type, double,
    boost::numeric::odeint::vector_space_algebra,
    boost::numeric::odeint::default_operations>() );

where for my application, state_type would be a vex::multivector.

However, when I do this, clang complains

/opt/local/include/boost/numeric/odeint/algebra/vector_space_algebra.hpp:145:58:
Implicit instantiation of undefined template
'boost::numeric::odeint::vector_space_reduce<vex::multivector<double, 20, true> >'

Making an instance of runge_kutta_dopri5 parametrized by state_type works perfectly well, as in the VexCL example in the documentation, except that it does not adjust the step size. However, looking at the corresponding CUDA example makes me believe it ought to be possible to wrap it as a controlled or dense-output stepper. Am I doing something wrong, or is this not possible?

headmyshoulder commented 11 years ago

Yes, vector_space_reduce is not yet implemented for vexcl, so it will not work with the controlled and dense output stepper, since the require vector_space_reduce. We will put this on our ToDo list. Maybe you can even implement this specialization for vector_space_reduce? Nevertheless, I think a dense output stepper is not much better then a stepper like rk5 or dopri5. The dense output stepper wraps a controlled stepper which performs the step size control on the whole ensemble.

ddemidov commented 11 years ago

It seems that vector_space_reduce is no longer used in github version of odeint (c3ef77dff54dfda0c4d9933950349003a6902f0f suggests that, and I also grepped for vector_space_reduce). Apparently vector_space_norm_inf is used for the purpose now. Karsten, if that is correct, I could make a pull request for specification of vector_space_norm_inf for vexcl types.

As for the boost version of odeint, I am not sure if it is possible to implement vector_space_reduce efficiently. It seems that Op here takes its arguments by value, which would result in unnecessary copies of vexcl vectors.

headmyshoulder commented 11 years ago

@ds283 :Can you check, if your code is now working correctly?

ds283 commented 11 years ago

Thanks for your helpful responses, which are very much appreciated.

With the current versions of odeint and VexCL pulled from github, compilation still fails due do

In file included from /Users/ds283/Documents/Code/odeint-v2/boost/numeric/odeint.hpp:35:
/Users/ds283/Documents/Code/odeint-v2/boost/numeric/odeint/stepper/controlled_runge_kutta.hpp:86:24:
error: no matching member function for call to 'norm_inf'
        return algebra.norm_inf( x_err );
               ~~~~~~~~^~~~~~~~

Although from looking at the source, it isn't clear to me why that doesn't work.

ddemidov commented 11 years ago

Did you include boost/numeric/odeint/external/vexcl/vexcl_norm_inf.hpp?

ds283 commented 11 years ago

I see. It does compile if I include that file. The integration is slow, but that is a different issue.

ddemidov commented 11 years ago

You'll probably need an ensemble of size 1e5 -- 1e6 ODEs and a decent GPU to get some acceleration from OpenCL. OpenCL has high initialization overhead (see e.g. http://arxiv.org/abs/1212.6326), so it requires large problems to hide that.

ds283 commented 11 years ago

Thanks for all your helpful responses, which are very much appreciated.

I had looked at your arXiv paper. Currently, I'm simply experimenting to work out whether GPU methods can help us. It may eventually run on GPU nodes on a HPC cluster, at which point it probably would involve an ensemble of ODEs of the size you quote.

I had understood that the overhead came from compiling the OpenCL kernels at runtime. But at the moment I am finding it much slower during the course of the integration too, when I would have naively expected that all data would be available on the GPU and there would be no further overheads. Should I also have anticipated this? (The solution is initially rapidly oscillatory and the integrator probably needs to shorten its stride; I was wondering whether there is any data transfer between the GPU and CPU while adjusting the step size.)

ddemidov commented 11 years ago

There are other overheads in opencl (and in cuda for that matter) besides initialization. If you look at the plots in the paper, you'll notice that Thrust on CPU (which is openmp in disguise) shows much better performance than either opencl or cuda solutions for smaller problems. You really need to saturate your gpu to get any performance benefits from it.

Vexcl allows you to create a monolithic kernel for an odeint stepper (see slides from meeting c++ 2012, section "kernel builder": https://speakerdeck.com/ddemidov/vexcl-at-meeting-c-plus-plus-2012). This should perform about an order of magnitude faster, but the option is only available for steppers with constant time steps. On Jul 6, 2013 4:35 PM, "ds283" notifications@github.com wrote:

Thanks for all your helpful responses, which are very much appreciated.

I had looked at your arXiv paper. Currently, I'm simply experimenting to work out whether GPU methods can help us. It may eventually run on GPU nodes on a HPC cluster, at which point it probably would involve an ensemble of ODEs of the size you quote.

I had understood that the overhead came from compiling the OpenCL kernels at runtime. But at the moment I am finding it much slower during the course of the integration too, when I would have naively expected that all data would be available on the GPU and there would be no further overheads. Should I have anticipated this to? (The solution is initially rapidly oscillatory and the integrator probably needs to shorten its stride; I was wondering whether there is any data transfer between the GPU and CPU while adjusting the step size.)

— Reply to this email directly or view it on GitHubhttps://github.com/headmyshoulder/odeint-v2/issues/92#issuecomment-20553741 .