Creating view of array in Column-major format

vijaysm commented 3 years ago

I have a piece of code that gets an input a large vector (representing n vectors of size N) that are in column major format. So size=n*N, rows=N and cols=n. These are vector is packed and contiguous in memory. So in order to work with the given input data, I create a view and use it in matvec computations. However, since the input is in column-major format, and the Dense matrix assumes a row-major format, if (n>1), the following idea fails.

std::vector<double> inputData(N*n);
auto viewInput = gko::matrix::Dense<double>::create( device_executor, gko::dim< 2 >{ N, n },
                                                 val_array::view( host_executor, inputData.size(), inputData.data() ), n );

Is there a better way to create a dense matrix view without actually changing the underlying data. i.e., can I impose column-major format for Dense instead of the default row-major storage ? I didn't see an option for this in the documentation.

pratikvn commented 3 years ago

@vijaysm, thanks for creating an issue and trying out Ginkgo. Unfortunately, currently we do not have a way to view contiguous data as column-major. You can always create multiple views for the different columns, if necessary. Do you have a specific application where you are using column-major storage ?

vijaysm commented 3 years ago

Thanks for the response. The input vector for which I need to compute SpMV operations is essentially a 2-D dense matrix that is provided by a Fortran subroutine. And the SpMV operations are invoked multiple times during the course of the simulation and so having multiple views would lose any gains that can come out of potential vectorization that will happen with sparse(A)*dense(vec) operations.

Copying the data in the row-major format is not ideal every time this is needed. The ideal solution here would be a Ginkgo view for Dense that understands column-major format as well. If there is a better approach, do let me know.

upsj commented 3 years ago

We are thinking about adding general indexing (row stride + column stride) to our Dense matrix format, so that is not entirely out of reach. What I can't follow though is how the performance of A*x can profit from vectorization when x is stored in column-major? In my understanding, this would mean that entries of the same row are scattered across memory, as opposed to being stored next to each other like in row-major, so you need to use vector gather instructions instead of normal vector loads, which usually come with a significant performance penalty.

vijaysm commented 3 years ago

What I can't follow though is how the performance of A*x can profit from vectorization when x is stored in column-major? In my understanding, this would mean that entries of the same row are scattered across memory, as opposed to being stored next to each other like in row-major, so you need to use vector gather instructions instead of normal vector loads, which usually come with a significant performance penalty.

Fair point. If the input to Ginkgo was controllable, that would be a fair modification to make i.e., use row-major instead of column-major. However, the input vector x is provided as-is, and re-copying to a different format to perform A*x operation would probably add a significant constant penalty term depending on the size of x. I will try to write a small test to measure the performance difference to see if its worth changing the ordering of the view. If there already exists a test that can help benchmark the performance degradation with a different dense-matrix format, that would be helpful.

upsj commented 3 years ago

How large an n are we talking about here? And do you want to use it in an SpMV, solver or preconditioner? As long as the vector is of sufficient size and n is sufficiently small, it might be feasible with almost no overhead to assume the memory just maps to "stacked" "row-major" vectors and treat each one individually using Dense::get_submatrix.

vijaysm commented 3 years ago

For the current application use-case of interest, the row-length N is typically above 100K, and col-length n is typically between 10-40. At the moment, we are looking at optimizations of the SpMV operations only as part of a linear map application for projecting fields.

it might be feasible with almost no overhead to assume the memory just maps to "stacked" "row-major" vectors and treat each one individually using Dense::get_submatrix.

Are you suggesting launching multiple Ax operations with OpenMP to operate on slices of the original column-major vector, to achieve better throughput using hierarchical parallelism ? At the moment, I have a for-loop over these slices but no work splitting. I actually may be able to implement some nested parallelism to see if it provides better speedup.

upsj commented 3 years ago

I see. Yes, I meant launching a single SpMV per column instead of one large SpMM for multiple columns. Whether you can get some benefit/concurrent execution out of it somewhat depends on the executor: On OpenMP, we basically defer everything to the runtime environment, so if you launch kernels on independent objects in parallel, they might provide nested parallelism (using Ginkgo from multiple threads may be a bit unstable though, since we don't provide any thread-safety guarantees for executors). In CUDA, we currently don't support multi-stream execution, so it is unlikely that you would actually get kernels executed concurrently as opposed to one after the other, and with 100k rows only, it may or may not be possible to utilize the GPU to its fullest extent.

So thanks for your question, this definitely shows some places where we can provide more generality, we might use this issue to keep track of our column-major support :)

ginkgo-project / ginkgo

Creating view of array in Column-major format #756