Support more than one solve per UFL equation

kynan commented 13 years ago

This discussion is currently specific to the CUDA backend!

The CudaAssembler currently only supports up to one solve per UFL input.

~~This is due to the hard coded use of only a single set of global variables declared for:~~

~~device pointers for local/global matrix, local/global vector and solution~~
~~the sparsity~~

Instead, a set of these would be required for each field solved for in the UFL equation and the corresponding code for their intialisation needs to be generated for initialise_gpu and the correct variables for the field solved for referenced in run_model.

It's more complicated than that:

we do actually only want to have up to a single local/global matrix, local/global vector and solution vector allocated on the device at any given time
- either allocate the maximum required space on the device during initialisation
- or (de)allocate as needed i.e. allocate as late as possible and deallocate as early as possible
whether we need different sparsities is highly dependent on the problem
- we certainly don't need one per solve in general
- what is a hard and fast rule to determine which sparsity goes with which solve?

Another important issue is that currently the data flow of coefficients is not tracked across extraction from state through solves and writing back to state. Coefficients extracted from and written back to state use the field allocated in state. For all other (temporary) coefficients, a temporary field needs to be allocated in state.

Note: a better strategy could be only associate the host memory location with a field held in state and keep the memory allocated for coefficients on the device completely separate. This would eliminate the unnecessary overhead currently incurred by adding a temporary field into state: the entire mesh and sparsity are copied unnecessarily at the moment.

~~This relates to issue #15, since currently the CUDA state holder cannot retrieve sparsities for different fields.~~

A test case needs to be added once this has been implemented.

dham commented 13 years ago

It's actually even more complex than the above as it may sometimes be necessary and/or efficient to have more than one matrix on the device at once. For example the pressure projection matrix is often very expensive to assemble but it's also often a linear term so we can keep the matrix around between timesteps.

The first thing to note is that sparsities and solves are primarily an OP2 (or equivalent backend) issue. I think that the OP2 way to do this would be to declare the OP2 sparsity. It's then up to OP2 to notice that the sparsity is the same as another one which has already been declared. This is in principle a reasonably easy task for OP2 as each sparsity has a signature which is the set of maps which are declared for it at sparsity declaration time.

Temporary fields should not be inserted into fluidity state. In fact they don't need to be either. Simply declare the appropriate op_dat at the start of the generated host routine and destroy it at the end. We don't yet have an OP2 destroy command but we're going to need one.

For both sparsities and fields, the copying back to state is also an OP2 issue. This was the point of the discussion about put and get at the Oxford meeting. It's essentially a cache dirtying problem. For sparsities and matrices, OP2 is welcome to decide that it doesn't have enough device memory and copy the sparsity/matrix back to the host whenever it feels like it. I think we will be able to get into a position where it will be able to notice that a matrix is linear and can therefore make decisions about whether to keep it lying around, but we'll worry about that one WAAY down the line.

For fields, clearly the only ones which ever need to be copied back are the ones which are re-inserted into state. The short term solution for that might be to simply issue an OP_GET (or whatever it's called) at the end of the host routine. The longer term solution might be to instrument fluidity so that it checks the coherency of fields whenever it touches them and triggers a copy back only when needed. Further down the line there are more possibilities: for example we might have a non-blocking OP_GET which does the copy back in the background and the main program keeps going until it needs that valule on the host and then blocks.

kynan commented 13 years ago

Thanks, these are very good points. My opening post is very specific to the CUDA backend and it's current implementation. Many of these issue should "magically" go away when OP2 is fitted with the necessary logic to handle fields an communicate with Fluidity state. Given that the CUDA backend will die eventually (but will still be used for our initial shallow-water work) we need it to "work" but not necessarily in the nicest / most efficient way for now.

kynan commented 13 years ago

This is tentatively fixed in e85741ea605c327222f0

Caveats:

The corresponding sparsity is now extracted from state for each coefficient on initialisation. This is dumb since in many cases linear operators for different solves share the same sparsity which hence could be reused. A reliable determination of the correct sparsity to use for each solve needs to be found.
Device memory allocated for local/global matrix/vector and solution simply uses the coefficient of the last solve as before. This will not work in the general case. Instead, MCFC needs to keep track of which fields are required (or efficient) to be held in GPU memory at any given point in the execution and allocate as late / deallocate as early as possible

gmarkall commented 12 years ago

Since the CUDA backend has a limited lifetime, and the OP2 backend is set to become the main backend, I think that Florian's fix is good for now, and it's probably better to invest effort into other things.

gmarkall / manycore_form_compiler

Support more than one solve per UFL equation #31