Closed Char-Aznable closed 3 years ago
@trilinos/nox @trilinos/rol @rppawlo @dridzal
Dakota does black-box optimization. You can even use it without changing your code, as long as you can script input and have output in the right format. Dakota uses MPI and is looking into threads but doesn't really thread yet.
@trilinos/rol ROL is best suited for "embedded," i.e. data-structure aware optimization. It lets you work directly with the application memory space. ROL has a generic linear algebra (vector) interface, and so in theory it will run on any platform with any type of parallelism (inter-node + intra-node, for example). Some of ROL's algorithms have already been used with Kokkos; I believe that @etphipp has tried this most recently. There are a few technical issues that prevent some of ROL's algorithms from working with Kokkos. We are working on fixing this, and hope to have all algorithms Kokkos-compatible in the next three to six months. As a side note, ROL has also been used to solve nonlinear equations, i.e., not just optimization problems.
For more info, check out the slides given here:
https://trilinos.org/packages/rol/
While I'm perfectly fine corresponding with a fictional Gundam character, it may be useful to know your identity or at least your project's specific needs to provide better guidance. For instance, you mention machine learning as one of the applications --depending on the size of the problem, you may want to entertain combining on-node and MPI parallelism through Tpetra @trilinos/tpetra , which is interoperable with Kokkos.
NOX is used for Newton-based nonlinear solves. If you use the Thyra/Tpetra backend, it will run using Kokkos and gives access to most Trilinos solvers and preconditioners via Stratimikos.
@dridzal The OP is a serious Kokkos user and has been working a lot with Christian Trott. It looks like they want to branch out into the Trilinos world :-) .
@briadam could say more about the status of Trikota, which is a Trilinos interface to Dakota.
@mhoemmen @dridzal @rppawlo Thank you all for your help. I've been using Kokkos::View to represent array or matrix objects but it sounds like I should use Tpetra down the road. @dridzal I'm mostly interested in doing Bayesian inference via Monte Carlo methods. The ROL slides mention some simulation based optimization. Can you tell me if it's easy to represent probability distribution objects and do some simulation with them in ROL? Do you have documentation for what algorithms in ROL are compatible with Kokkos or not so that I can watch out for them? Thanks again for your help!
The question about probability distribution objects is best suited for @dpkouri . Many algorithms in ROL are already compatible with Kokkos and Tpetra/Kokkos. Methods requiring elementwise evaluations of certain nonlinear functions, such as logarithms (used in interior-point methods, for example) won't work until we make some changes to the vector model. What kind of optimization problems are you interested in: unconstrained, bound-constrained, equality-constrained, general inequality-constrained? I believe that the first three types of problems can be handled in a Kokkos-compatible way (maybe not with the full menu of methods, but there are definitely methods that will work).
Mostly I'll be using equality-constrained and general inequality-constrained. And I'm probably going to use a lot of element-wise evaluation of exp and log functions -- is it possible to work around this by doing the element-wise evaluation elsewhere and pass the data back in to ROL by pointer/reference?
These element-wise operations are member methods of the ROL::Vector class
virtual void applyUnary( const Elementwise::UnaryFunction<Real> &f )
virtual void applyBinary( const Elementwise::BinaryFunction<Real> &f, const Vector &x )
virtual Real reduce( const Elementwise::ReductionOp<Real> &r )
where the function object is always passed in. I've been told that this causes problems for Kokkos and on-node execution in general. Some of ROL's algorithms use these operations, others don't. Right now, this can't be bypassed, as far as I can tell. There are several ways to solve this issue. A simple way would be to offer a whole slew of specific unary and binary functions directly in the Vector interface (log, exp, min, max). This is ugly. Another option may be to use templated non-member functions and to provide specializations. Either approach would require some changes to ROL::Vector, and, correspondingly, to the algorithm implementations.
Including @mhoemmen @crtrott @etphipp on this for potential input. @gregvw and I will start working on this pretty soon.
@dridzal @gregvw If I remember right, UnaryFunction and BinaryFunction have a virtual operator()
. I think there is a way to keep that architecture, (mostly) maintain backwards compatibility with existing code that creates subclasses of UnaryFunction etc., and yet get this working on device. It depends on enabling RDC and on doing virtual method dispatch on device, but that works pretty well with Aria. This will look a bit complicated, but the end result will be that users mostly don't need to fuss about the details. Note that functions should not have mutable state. (That's a general thread-parallel correctness issue, not specific to CUDA. This may or may not affect how you've implemented ReductionOp
.)
First, mark UnaryFunction::operator()
with the KOKKOS_FUNCTION
macro. (Do not use KOKKOS_INLINE_FUNCTION
.) This is the only change that could affect backwards compatibility for users who write their own subclasses of UnaryFunction
, but it will only affect this for CUDA builds. Ditto for BinaryFunction
; I'll omit that for what follows.
Second, add a wrapper class that has an std::shared_ptr<const UnaryFunction<Real>>
inside. I'll call it UnaryFunctionWrapper<Real>
here. It needs a method const UnaryFunction<Real>* get()
.
Third, add a factory function, templated on UnaryFunctionType
and the desired memory space.
template<template <class> UnaryFunctionType, class Real, class MemorySpace>
std::shared_ptr<const UnaryFunctionWrapper<Real>>
wrapUnaryFunction ()
{
using parent_type = UnaryFunction<Real>;
using child_type = UnaryFunctionType<Real>;
using range_type = Kokkos::RangePolicy<typename MemorySpace::execution_space>;
void* rawPtr = Kokkos::kokkos_malloc<MemorySpace> (sizeof (child_type));
// Invoke child class constructor on device.
// This ensures that the vtable lives in device memory.
Kokkos::parallel_for ("Allocate object on device", range_type (1),
KOKKOS_LAMBDA (const int /* unused; always 0 */) {
new (rawPtr) UnaryFunctionType<Real>; // placement new
});
// Create `shared_ptr` with explicit deallocator.
// Don't keep this thing around past `Kokkos::finalize()`.
// If that's an issue, you could maintain a list of `weak_ptr`
// and manage their deallocation with an `atexit` hook.
std::shared_ptr<child_type> childPtr (rawPtr, [] (child_type* p) {
Kokkos::parallel_for ("Destroy object on device", range_type (1),
KOKKOS_LAMBDA (const int) {
p->~child_type (); // invoke destructor explicitly on device
});
Kokkos::fence (); // ensure that the destructor completed
Kokkos::kokkos_free<MemorySpace> (p); // deallocate the memory
});
return std::static_pointer_cast<parent_type> (childPtr);
}
This presumes that UnaryFunctionType
's constructor takes no arguments. You can generalize this by adding a parameter pack, though it has to make sense for those argument(s) to live on device. I've also omitted some details that you should add for safety, like a try-catch block to avoid a memory leak in case the parallel kernel launches fail.
Finally, add an overload of applyUnary
to ROL::Vector
that takes const UnaryFunctionWrapper<Real>&
. Actually, you shouldn't overload virtual methods. It would have been better to have made the virtual methods protected -- e.g., applyUnaryImpl
, and to have non-virtual public methods that you can overload. If you can redesign ROL::Vector
like that, then please do. Then, you can overload applyUnary
as mentioned above, and have it dispatch to applyUnaryImpl
, like this:
void applyUnary (const UnaryFunctionWrapper<Real>& ufw) {
this->applyUnaryImpl (ufw.get ());
}
The point of this redesign would be to let algorithms use ROL::Vector
without changes, but accept these wrapped device things as well as the usual UnaryFunction<Real>
that they currently know how to accept.
How does this sound?
btw, the applyUnaryImpl
function would need to invoke the pointer on device, like this:
virtual void applyUnaryImpl (const UnaryFunction<Real>* f, const bool runOnDevice) {
const LO numCols = LO (X.getNumVectors ()); // assuming X is the Tpetra::MultiVector
if (runOnDevice) {
using range_type = Kokkos::RangePolicy<execution_space, LO>;
auto X_lcl = X.getLocalView<memory_space> ();
Kokkos::parallel_for ("applyUnaryImpl", range_type (0, lclNumRows),
KOKKOS_LAMBDA (const LO lclRow) {
for (LO col = 0; col < numCols; ++col) {
X_lcl(lclRow, col) = (*f) (X_lcl(lclRow, col));
}
});
}
else {
using range_type = Kokkos::RangePolicy<Kokkos::DefaultHostExecutionSpace, LO>;
auto X_lcl = X.getLocalView<Kokkos::HostSpace> ();
Kokkos::parallel_for ("applyUnaryImpl", range_type (0, lclNumRows),
[=] (const LO lclRow) {
for (LO col = 0; col < numCols; ++col) {
X_lcl(lclRow, col) = (*f) (X_lcl(lclRow, col));
}
});
}
}
In the runOnDevice
is true
case, the input pointer is only valid on device, since it was allocated using kokkos_malloc
. The case where runOnDevice
is false
is the backwards compatibility case; it assumes that the input pointer lives in host memory.
@mhoemmen , thanks! This sounds very appealing (I’ll have to dig a little deeper to make sure I understand the details). What would you recommend if we were to bite the bullet and redesign the elementwise functionality (and change our algorithms accordingly)?
What is the right way of building Trilinos via cmake? I'm looking at the sampleScripts directory and it seems they are very customized to the authors' original settings. Is the stuff on https://trilinos.org/docs/files/TrilinosBuildReference.html still up to date? I'd greatly appreciate it if you can post the cmake options of building Tpetra and ROL with CUDA and debugging enabled @mhoemmen
What is the right way of building Trilinos via cmake?
Check out this file in the Trilinos repository: doc/build_ref/TrilinosBuildReferenceTemplate.rst
. You can also look in the Wiki here on GitHub, or in Kokkos' documentation, for more information.
Best practice is to start with as few CMake options as possible, then slowly add them as needed, using feedback from the build output.
@dridzal can tell you more about what options ROL wants.
There are no special options for ROL. Currently the only dependence is Teuchos (it will be enabled by default), so whatever options Teuchos requires. Having said that, I would make sure that optional dependencies are disabled (otherwise you'll get most of Trilinos),
-D Trilinos_ENABLE_ROL:BOOL=ON \
-D Trilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=OFF \
The ROL/Tpetra adapters should be activated if Tpetra is enabled.
I got #3702 when building with clang + cuda
Do you have a slack channel for Trilinos that I can join?
@mhoemmen, out of curiousity, what would be the argument against:
void applyUnary( const std::function<Real(Real)>& );
void applyBinary( const std::function<Real(Real,Real)>&, const Vector<Real>& );
Real reduce( const std::function<Real(Real,Real)>& ) const;
Edit: As far as I am aware, although the elementwise functions are defined with virtual inheritance, elementwise objects are only ever used within a function or class scope. The type of elementwise function being used in any given algorithm is always known at compile time.
@gregvw I like std::function
and wish we could use it generally, but CUDA doesn't let you make an std::function
(or whatever CUDA calls the wrapped thing, I think nvstd::function
) on host and use it on device. However, if you made applyUnaryImpl
the virtual method and made applyUnary
nonvirtual, you could actually overload applyUnary
to take an std::function
. It would then run on host.
@gregvw why not just take one step forward and template these functions so that they can take lambda for the unary or binary functions? For example:
template < class UnaryFunction >
void applyUnary( UnaryFunction&& );
Speaking as someone who is not familiar with the package, I don't see why one would need virtual method at all. Why not use CRTP instead?
@Char-Aznable Unfortunately the virtual inheritance design of ROL::Vector preclude templated member functions. I have written CRTP-based implementations of a number of ROL's core components and would definitely prefer this method in the long term, but it would present a fairly significant refactoring in the near term.
That being said, if anyone would like to help me implement static polymorphism in ROL, I'm certainly interested.
@Char-Aznable @gregvw @mhoemmen Let's assume we are willing to change ROL's algorithms a bit. Can't we keep the core linear algebra (dot, plus, scale) as-is, i.e., using inheritance, and then provide the elementwise functionality through non-member templated functions, like @Char-Aznable suggested? If this is possible, it would make sense for ROL for three reasons: 1) Many ROL algorithms don't require elementwise. 2) Elementwise has been an "optional" part of the linear algebra interface from the get-go. 3) Mathematically speaking, these operations don't really belong in a vector space, and, in my opinion, should be considered an add-on.
@mhoemmen That's a bit frustrating regarding std::function
. It looks like we are back where we started when we first wanted to implement these functions. Is the void* cast a situation one that is frequently encountered with GPU computing? It's not something I'd normally choose to do. I think I speak for the rest of the ROL team that we would strongly favor solutions which are as agnostic as possible at the level of ROL's core functionality. e.g. having Kokkos macros showing up in the interface for the elementwise call operators is not super appealing. Package or hardware-dependent code would ideally be relegated to ROL's adapters.
and then provide the elementwise functionality through non-member templated functions, like @Char-Aznable suggested?
To clarify, we would have to take some Vector arguments in these functions, of course. But the key idea is that these are free functions in the ROL namespace.
Something like this?
template<typename Real, typename UnaryFunction>
void applyUnary( MyVectorType<Real>& f, const UnaryFunction& );
Yes, if it'll do the job on the device with Kokkos/CUDA, but I'm honestly not qualified to make this determination due to the lack of GPU experience. @mhoemmen could probably let us know if this is a good idea ... assuming we're willing to make the change in ROL.
If we can satisfy the requirements with free functions until we have an opportunity to consider more careful refactoring down the road, that would be my preference.
Agreed. Free functions would take us roughly one week to implement and test. There are only one or two user implementations of ROL::Vector that would be affected, and a handful of vector adapters (Tpetra, Thyra, Eigen, StdVector). @mhoemmen would something like the above illustrated templated free functions be an effective solution?
I'm all for free functions. It doesn't really solve whatever backwards compatibility issues you might have, but then again, Trilinos hasn't had a real release in three years (if you don't count the 12.14 tag), so I think you have to forge your own path there ;-) . Free functions are helpful because you can overload them, but they don't solve the "how do I do run-time polymorphism on device" problem.
@gregvw wrote:
That's a bit frustrating regarding
std::function
. It looks like we are back where we started when we first wanted to implement these functions.
Sort of. There are other design approaches you could take. The point I'm trying to make with applyUnaryImpl
being virtual and applyUnary
being nonvirtual, is that you can safely overload nonvirtual functions. This would give you a backwards compatibility path, if that's something you want.
Is the
void*
cast a situation one that is frequently encountered with GPU computing?
This is a common enough pattern for use of virtual methods on device that I think Kokkos has added some kind of allocate+emplace function to make it easier. Aria uses this pattern, for example.
I didn't actually cast to void*
-- kokkos_malloc
returns void*
. That's not the point, though. The point is that if you want an object with a vtable to live on device, you have to invoke the object's constructor on device. Otherwise, the vtable's pointers are pointers to host functions, not pointers to device functions. If the object lives and dies in a device kernel, then you can just create it (invoke its constructor) in the usual way. The problem is that you want the object to persist across multiple kernels. That means you need to allocate it in device memory resulting from cudaMalloc
or cudaMallocManaged
. Regular new
allocates using the default allocator and constructs all at once, so you need to use placement new
, so you can control allocation.
It's not something I'd normally choose to do. I think I speak for the rest of the ROL team that we would strongly favor solutions which are as agnostic as possible at the level of ROL's core functionality....
This "kokkos_malloc
+ placement new
in a parallel_for
" pattern is what you would need to do in order to have an object with virtual methods that lives in GPU memory and persists through multiple kernel launches, but that doesn't mean I advocate exposing this to users. That's why I was proposing overloading applyUnary
to take a "UnaryFunctionWrapper" that hides this detail from users. Users should not have to write that kind of code; that was code for ROL developers. Anyway, there are other ways to get the desired effect, and free functions are also a good idea.
First, mark UnaryFunction::operator() with the KOKKOS_FUNCTION macro.
@mhoemmen many thanks for the detailed explanation and guidance above. Could you clarify this: Would the ROL::Vector class have to depend on Kokkos for this? Or is this placed in an adapter, and the ROL::Vector class is "clean" of dependencies (and any mention thereof). We have a milestone for this FY to build without any dependencies. I think this is the major point of @gregvw 's comment
we would strongly favor solutions which are as agnostic as possible at the level of ROL's core functionality
It looks like free functions would provide that path, at the nontrivial expense of breaking backward compatibility (which means that we would have to fix this for our users).
@dridzal wrote
Would the
ROL::Vector
class have to depend on Kokkos for this?
You could replace that macro with another macro that evaluates to __host__ __device__
. That would be a good idea if you didn't want to depend on Kokkos.
Any design that lets you overload for different "function" types would let you support all kinds of functions, including user-provided functions that may or may not run on device. If you can break backwards compatibility to do the right thing, go for it :-) .
I spent most of yesterday trying to devise some way of implementing the free function without having to change existing code with calls to elementwise functions. Unfortunately, I do not see a way given the constraints of runtime polymophism to preserve backward compatibility.
I can make an elementwise refactor branch and get started. What do we know about dependent codes using elementwise functions?
@mhoemmen are __host__
and __device__
nvcc specific?
@gregvw __host__
and __device__
are CUDA qualifiers. clang can compile them too.
@gregvw That's why you use a macro ;-) If CUDA is not enabled, you would define the macro to the empty string.
@mhoemmen Regarding your example above, I'm not clear on where wrapUnaryFunction
is defined and where it should be called because it is dependent on Kokkos.
The function applyUnaryImpl
takes the runOnDevice
argument, but
void Vector<Real>::applyUnary( const UnaryFunctionWrapper<Real>& ufw ) {
this->applyUnaryImpl( ufw.get() ); // Does not pass a bool
}
Where would it be deduced whether to pass a true
or false
?
Here is a snippet of how applyUnary
is currently used
namespace ROL {
namespace Elementwise {
template<class Real>
class Reciprocal : public UnaryFunction<Real> {
public:
Real apply( const Real& x ) const {
return static_cast<Real>(1)/x;:w
}
}
} // namespace Elementwise
template<class Real>
class LogBarrierObjective : public Objective<Real> {
// Sets g = -1/x elementwise
void gradient( Vector<Real>& g, const Vector<Real>& x, Real& tol ) {
g.set(x);
Elementwise::Reciprocal<Real> reciprocal;
g.applyUnary(reciprocal);
g.scale(-1.0);
}
};
} // namespace ROL
At what level would the UnaryFunction
be wrapped and the resulting wrapped function be passed?
If I understand correctly, you are suggesting marking UnaryFunction
like this
#if defined( __CUDACC__ )
#define COMMON_FUNCTION __host__ __device__
#else
#define COMMON_FUNCTION /**/
#endif // defined( __CUDA__ )
template<class Real>
class UnaryFunction {
public:
COMMON_FUNCTION
virtual Real apply( const Real &x ) const = 0;
};
Is this correct?
In response to your earlier question, none of the elementwise functions have mutable state variables.
Regarding your example above, I'm not clear on where
wrapUnaryFunction
is defined and where it should be called because it is dependent on Kokkos.
I thought about this a bit more and realized that the idea is to simulate constrained multiple dispatch. You have multiple kinds of Vector, and you have multiple kinds of unary (and binary -- I'll omit those again for brevity) functions. Examples of the latter might include:
UnaryFunction<Real>
std::function<Real (Real)>
__device__ Real operator() (Real) const
(or __host__ __device__ Real operator() (Real) const
)This is a constrained multiple dispatch problem, because some kinds of Vector (e.g., host-only, or sequential-only) and some kinds of UnaryFunction (e.g., device-only) don't work together. The classic way to handle that in languages without native multiple dispatch, is to have both Vector and UnaryFunction being abstract base classes, and to use the usual Smalltalk-like dynamic_cast
-branch pattern to enumerate UnaryFunction subclasses (and error out on unsupported combinations).
This pattern also works in this case. The key is to hide the class with the actual __device__
function inside of a subclass of UnaryFunction<Real>
. For example:
// Make Kokkos an _optional_ dependency. Alternately, you
// could make ROL depend on CUDA directly, and define
// ROL_HOST_DEVICE_FUNCTION as __host__ __device__.
#ifdef HAVE_ROL_KOKKOS
#define ROL_HOST_DEVICE_FUNCTION KOKKOS_FUNCTION
#else
#define ROL_HOST_DEVICE_FUNCTION
#endif
// Note how this is NOT a subclass of UnaryFunction<Real>.
template<class Real>
class HostDeviceFunction {
public:
// Destructors of abstract base classes must be marked virtual,
// for memory safety of derived classes.
ROL_HOST_DEVICE_FUNCTION virtual ~HostDeviceFunction() {}
// Subclasses implement this.
virtual ROL_HOST_DEVICE_FUNCTION Real operator() (const Real&) const = 0;
};
Now you can make a subclass of UnaryFunction that holds a pointer to
HostDeviceFunction` inside. For example:
template<class Real>
class UnaryFunctionWithHostDeviceFunctionInside : public UnaryFunction<Real>
{
public:
// NOT a device function. This could also take std::unique_ptr&&,
// or any other kind of pointer. That part of the design is not so
// important. The key is that the pointer points to device memory,
// allocated using cudaMalloc or cudaMallocManaged.
UnaryFunctionWithHostDeviceFunctionInside (std::shared_ptr<const HostDeviceFunction<Real>>) = default;
const HostDeviceFunction<Real>* get () const { return f_.get (); }
private:
std::shared_ptr<const HostDeviceFunction<Real>> f_;
};
Finally, your dynamic_cast
-branch pattern inside Vector's "applyUnaryFunction" virtual method would need to ask the given UnaryFunction<Real>
whether it is a UnaryFunctionWithHostDeviceFunctionInside<Real>
:
virtual void applyUnaryFunction (const UnaryFunction<Real>& f) const {
const UnaryFunctionWithHostDeviceFunctionInside<Real>* f1 =
dynamic_cast<const UnaryFunctionWithHostDeviceFunctionInside<Real>*> (&f);
if (f1 != nullptr) {
// ... if the Vector subclass can handle a HostDeviceFunction,
// then apply f1->get() to the Vector's entries ...
}
else {
// ... do whatever it is you do now with UnaryFunction ...
}
}
btw, the above exercise has two goals:
For (1), Epetra and Tpetra do something comparable by wrapping MPI_Comm
with Epetra_Comm
resp. Teuchos::Comm
. This ensures that they can work even if building without MPI.
I didn't talk about the "virtual 'impl' function vs. nonvirtual overloads" pattern in the discussion above. You could use that if you wanted the outward-facing user interface to take different kinds of "things that look like functions," but have a single / few implementation(s) for all of them.
@mhoemmen Would it be possible to use a Visitor pattern for some proxy for UnaryFunction
that has its call method overloaded for vector types? This proxy is called inside DerivedVector::applyUnary
method so that no 'dynamic_cast' would be needed to call the correct proxy method overloaded for DerivedVector
. I've been playing with this idea, but am not sure about what methods can be virtual and keep nvcc happy or if I am missing something important.
@gregvw wrote:
but am not sure about what methods can be virtual and keep nvcc happy
If the object has any virtual methods and you plan on calling them, then the vtable needs to live on device. That means constructing the object on device. That's why I showed you the "allocate some memory on device and placement new the object in it" pattern above.
Would it be possible to use a Visitor pattern for some proxy for
UnaryFunction
that has its call method overloaded for vector types?
Do you mean, in the matter of std::visit
, with the "overloaded operators for different types" pattern? That's kind of what I meant with "overloading applyUnaryFunction
."
@dridzal @gregvw @etphipp Does ROL work with autodiff using Sacado and Kokkos now? Do you have examples of this? I struggle to find documentation regarding tools in Trilinos in general.
This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE
label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE
.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.
This issue was closed due to inactivity for 395 days.
I have a few projects that use Kokkos' CUDA backend and I'm considering using some optimization library for my projects to solve nonlinear equation and to do optimization for some machine learning task. It seems that ROL or NOX has the functionality I want so does TriKota/Dakota. Can someone give an overview of what distinguish these packages from one another in terms of ease to use, computational performance and parallelization capability. Also, are they Kokkos compatible?
BTW, is there any video or talk slides that summarize or highlight the different packages? I'm totally new to this amazing package collection and I hope to find out more good stuff in it.
@trilinos/rol @trilinos/nox @trilinos/loca @trilinos/trikota