Local vector assembly - Githubissues

Changes to support code generation of the local assembly when we're solving for vector fields. The changes consist of:

When there is a vector basis function, this is the tensor product of a scalar basis. This needs to be put in an array at the start of the kernel. The new function _buildBasisTensors generates the code to declare this array and populate it with the correct terms from the scalar basis.
Since the vector basis is a multidimensional array, support for multidimensional array subscripts has been added to the CUDA backend. When code referencing a vector basis is generated, it uses buildMultiArraySubscript instead of buildArraySubscript so that it matches with the type of the array storing the vector basis.
Since vector bases have an additional index that is present in the UFL AST, we need to memorise what the indices are when we encounter an Indexed object. This is so that the indices can be referenced by the code that generates the array subscript for the vector basis. This memorisation is taken care of in the indexed and multi_index methods of ExpressionBuilder.
Computation of the number of basis functions for a vector or tensor basis is added, in order to get the correct upper bounds of loops over a vector basis.

Caveats:

Some of the changes cause modification to the output of op2 identity-vector test case. How do we feel about this? I imagine the rest of the functionality will need porting into op2 as well when we start using it as a backend instead of the CUDA one.
The code only works for forms with arguments, not for argument derivatives - this is because the generated code was very messy for initialising the tensor product of the derivatives of the basis functions, and additionally this code had to go inside the element loop, which would have made its cost non-trivial. Once we can generate the transformation in the form kernel, this will be far easier to implement since we will need the tensor product of derivatives of the basis functions on the reference element, which will only need doing once outside the element loop and be simpler code.

gmarkall / manycore_form_compiler