gmarkall / manycore_form_compiler

MCFC is deprecated. See https://code.launchpad.net/~grm08/ffc/pyop2
https://code.launchpad.net/~grm08/ffc/pyop2
GNU General Public License v3.0
3 stars 1 forks source link

Support more than one solve per UFL equation #31

Open kynan opened 13 years ago

kynan commented 13 years ago

This discussion is currently specific to the CUDA backend!

The CudaAssembler currently only supports up to one solve per UFL input.

This is due to the hard coded use of only a single set of global variables declared for:

Instead, a set of these would be required for each field solved for in the UFL equation and the corresponding code for their intialisation needs to be generated for initialise_gpu and the correct variables for the field solved for referenced in run_model.

It's more complicated than that:

Another important issue is that currently the data flow of coefficients is not tracked across extraction from state through solves and writing back to state. Coefficients extracted from and written back to state use the field allocated in state. For all other (temporary) coefficients, a temporary field needs to be allocated in state.

Note: a better strategy could be only associate the host memory location with a field held in state and keep the memory allocated for coefficients on the device completely separate. This would eliminate the unnecessary overhead currently incurred by adding a temporary field into state: the entire mesh and sparsity are copied unnecessarily at the moment.

This relates to issue #15, since currently the CUDA state holder cannot retrieve sparsities for different fields.

A test case needs to be added once this has been implemented.

dham commented 13 years ago

It's actually even more complex than the above as it may sometimes be necessary and/or efficient to have more than one matrix on the device at once. For example the pressure projection matrix is often very expensive to assemble but it's also often a linear term so we can keep the matrix around between timesteps.

The first thing to note is that sparsities and solves are primarily an OP2 (or equivalent backend) issue. I think that the OP2 way to do this would be to declare the OP2 sparsity. It's then up to OP2 to notice that the sparsity is the same as another one which has already been declared. This is in principle a reasonably easy task for OP2 as each sparsity has a signature which is the set of maps which are declared for it at sparsity declaration time.

Temporary fields should not be inserted into fluidity state. In fact they don't need to be either. Simply declare the appropriate op_dat at the start of the generated host routine and destroy it at the end. We don't yet have an OP2 destroy command but we're going to need one.

For both sparsities and fields, the copying back to state is also an OP2 issue. This was the point of the discussion about put and get at the Oxford meeting. It's essentially a cache dirtying problem. For sparsities and matrices, OP2 is welcome to decide that it doesn't have enough device memory and copy the sparsity/matrix back to the host whenever it feels like it. I think we will be able to get into a position where it will be able to notice that a matrix is linear and can therefore make decisions about whether to keep it lying around, but we'll worry about that one WAAY down the line.

For fields, clearly the only ones which ever need to be copied back are the ones which are re-inserted into state. The short term solution for that might be to simply issue an OP_GET (or whatever it's called) at the end of the host routine. The longer term solution might be to instrument fluidity so that it checks the coherency of fields whenever it touches them and triggers a copy back only when needed. Further down the line there are more possibilities: for example we might have a non-blocking OP_GET which does the copy back in the background and the main program keeps going until it needs that valule on the host and then blocks.

kynan commented 13 years ago

Thanks, these are very good points. My opening post is very specific to the CUDA backend and it's current implementation. Many of these issue should "magically" go away when OP2 is fitted with the necessary logic to handle fields an communicate with Fluidity state. Given that the CUDA backend will die eventually (but will still be used for our initial shallow-water work) we need it to "work" but not necessarily in the nicest / most efficient way for now.

kynan commented 13 years ago

This is tentatively fixed in e85741ea605c327222f0

Caveats:

gmarkall commented 12 years ago

Since the CUDA backend has a limited lifetime, and the OP2 backend is set to become the main backend, I think that Florian's fix is good for now, and it's probably better to invest effort into other things.