CUDA Debug/Performance TODOS

odarbelaeze commented 8 years ago

[x] Verify that the materials with zero atoms are properly handled in CUDA stats.
[x] Limit the cu::grid_size variable at initialisation (according to the plan)
[ ] Add restrict const to read only arrays in kernel calls for better caching
[x] Add support for tensorial exchange
[ ] Move to cusp for vector-matrix multiplication with DIA format
[x] Check register usage in Heun scheme - may not be too bad
[ ] CuSPARSE calls are asynchronous - overlap with spin fields at the cost of extra storage? Also need to check for synchronisation before Heun integration
[ ] Remove a bunch of intermediate variables from external field calculation - might help performance
[ ] The discrimination towards energy as a stat in GPU needs to stop #stopTheDiscrimination

That's one thing, let's see what else comes up and register it here.

richard-evans commented 8 years ago

[ ] Try symmetric matrix calculation to improve sparse matrix performance, but requires data preconditioning
[ ] Fix a few type casts to speed up SP performance, eg sqrtf, 1.0f etc

odarbelaeze commented 8 years ago

@richard-evans according to this:

Jxx_vals_h.push_back( -::atoms::v_exchange_list[iid].Jij[0]);
Jyy_vals_h.push_back( -::atoms::v_exchange_list[iid].Jij[1]);
Jzz_vals_h.push_back( -::atoms::v_exchange_list[iid].Jij[2]);

I can guess the xx, yy, and zz values of the exchange for the vector case, but, what would be the indices for xy, xz, ...etc for the tensor case?

mattoaellis commented 8 years ago

IIRC there is separate exchange lists for isotropic, exchange and tensor. So to do the case for xy,xz,.. you probably want something like: Jxy_vals_h.push_back( -::atoms::t_exchange_list[iid].Jij[0][1)]

I would need to check the src files though.

odarbelaeze commented 8 years ago

@mattoaellis since you're here, did you try CUSP to do the matrix multiplication I'm looking at this at the moment http://cusplibrary.github.io/classcusp_1_1csr__matrix__view.html#a2a5a8f5d37b23c657002ad4c31509688

mattoaellis commented 8 years ago

Yeah, I've used CUSP in the past for matrix multiplication. It's pretty neat for setting up and easy calculations. It has both the diagonal format and Ellpack which are both useful for the exchange calculation. I thought we decided against it as it wasn't part of the main CUDA toolkit and would require people to download the library themselves? Another downside is (unless they have updated it) they don't have any symmetric solvers since they claim the transpose would not be efficient enough.

odarbelaeze commented 8 years ago

Hum, it seems to be on the clusters, no extra downloads required, furthermore seems like DIA is the way to go and is very easy to do transformations with this.

mattoaellis commented 8 years ago

I think that was because we were using it before. Out of interest what version is it? I would certainly recommend it and downloading an extra library is not that hard but I don't think we can count on it being pre-installed on every cluster. DIA is likely going to be the best but not always as it heavily depends on the numerical ordering of the spins. ELL is a bit more versatile in that respect as it only depends on having a constant number of values per row which unless it is some strange exchange we are liekly to have. I suggest setting it up as a typedef within a macro so that we can easily switch between them if we go for CUSP:

#ifdef __DIAEXCH
typedef cusp::dia_matrix<int, double, cusp::device_memory> exch_matrix_t
#else
typedef cusp::ell_matrix<int, double, cusp::device_memory> exch_matrix_t
#endif

In CUSP it is super easy to set up the matrix on the host in a simple format i.e csr which I think the serial version uses anyway then jsut convert to our chosen format.

 cusp::csr_matrix<int, double, cusp::host_memory> exchange_h(N,N,NNbrs);
\\ fill matrix
exch_matrix_t exchange_d(exchange_h); // Should do the conversion and host-to-device copy for us

odarbelaeze commented 8 years ago

Jeje, I'm on it, I'll just amalgamate the library now, as we did with CUB, we can clean up the mess before we merge to development 💯 , is it safe to do this:

cusp::array1d_view <...> spin_view (some_thrust_device_vector.begin(), some_thrust_device_vector.end());

mattoaellis commented 8 years ago

Yeah, the cusp arrays are just slightly expanded thrust vectors so they are all compatible. Annoyingly thrust is (mostly) compatible with cusp arrays but not vice-versa. Why the array view in this case? Since the array goes from begin() to end() why not array1d?

mattoaellis commented 8 years ago

Ah I think I get it. Is it because the view does not copy the data but jsut references the underlying data?

odarbelaeze commented 8 years ago

Yeah, just to keep the spin arrays intact so that the changes do not conflict with other pieces of the code 😸

mattoaellis commented 8 years ago

I've just tested that and it seems to work fine creating a array view from a thrust device vector and passing that to cusp::multiply.

odarbelaeze commented 8 years ago

I'm, integrating it, but seems like including the cusp headers has conflicts with currand, did you run into those?

/usr/local/cuda/bin/..//include/curand_mtgp32_kernel.h(315): error: calling a __device__ function("__syncthreads") from a __host__ __device__ function("curand_mtgp32_single") is not allowed

/usr/local/cuda/bin/..//include/curand_mtgp32_kernel.h(373): error: calling a __device__ function("__syncthreads") from a __host__ __device__ function("curand_mtgp32_single_specific") is not allowed

/usr/local/cuda/bin/..//include/curand_kernel.h(392): warning: missing return statement at end of non-void function "__curand_uint32_as_float"

/usr/local/cuda/bin/..//include/curand_mtgp32_kernel.h(315): error: calling a __device__ function("__syncthreads") from a __host__ __device__ function("curand_mtgp32_single") is not allowed

/usr/local/cuda/bin/..//include/curand_mtgp32_kernel.h(373): error: calling a __device__ function("__syncthreads") from a __host__ __device__ function("curand_mtgp32_single_specific") is not allowed

/usr/local/cuda/bin/..//include/curand_kernel.h(392): warning: missing return statement at end of non-void function "__curand_uint32_as_float"

Those are not there if I omit the cusp headers 👎

odarbelaeze commented 8 years ago

Seems like this could be the issue, however, we are working with cuda 6.5, and reversing orders here and there does not help it 😢 http://stackoverflow.com/questions/23352122/including-thrust-sort-h-before-curand-kernel-h-gives-compilation-error

https://github.com/cusplibrary/cusplibrary/blob/a9a642d0a29537124b1d05b3378fa97ecf7fb09f/cusp/system/detail/generic/sort.h#L26

mattoaellis commented 8 years ago

I've never had any problem with curand, cusp and thrust all together. As the post says it is a version issue; I remember hearing some worrying reports about bugs like that. Can you drop back to CUDA 5 to test out? Otherwise it may require moving the order the headers are included in.

mattoaellis commented 8 years ago

Ah I do remember having some issues with CUSP at one point (not the same as yours) when I moved to a newer version but they were just warnings rather than errors and it ran jsut fine. Check whether the version of CUSP you are including is the latest one. If not you could downlaod the library locally.

odarbelaeze commented 8 years ago

I downloaded the 0.4.0 version, I think is the latest, I'll have to keep changing the order of the include statements.

mattoaellis commented 8 years ago

I've tested out cuda 5.5 and 6.5 using the latest version of CUSP v0.5.1 and I don't seem to have any problems. Which CUSP and Curand headers are you using exactly?

odarbelaeze commented 8 years ago

@mattoaellis is there a nice way to do the C = a*A*B + b*B thing we do with CuSPARSE in CUSP? also see https://github.com/odarbelaeze/vampire/pull/2/commits/98cafe24855c80cf596287974ec39361ac25ba7c#diff-ab8be042c5e3fb066cb0ea4e24a10990R15, that compiles in Wohlfarth.

mattoaellis commented 8 years ago

@odarbelaeze CuRAND or CuSparse?

There is a more generalised multiply routine which allows you to specify details about the operation http://cusplibrary.github.io/group__matrix__algorithms.html#gafc7e7d80ae5f5f3fcd696f89074aa0b2

That is in CUSP v0.5.1 but I don't know if it is v0.4. Mind you for the operation you specify then you can just fold b*B into the diagonals of A. I have done that in the past for computing the anisotropy in addition to the exchange interaction.

odarbelaeze commented 8 years ago

Now it runs, and yields essentially the same results, however, it's considerably slower, although all the data is already in the GPU, CUSP decides that it needs to move stuff to the host memory and back 👎 while performing the field calculation, that is:

Step one is to create array views for the fields and spins.
Step two is to perform the multiplication.

generalized_spgemm seems like not being added to the cusp namespace for some reason, so I ended up using generalized_spmv which is the same for this purpose.

richard-evans commented 8 years ago

Yuck - thats not what we wanted. So is the solution to do DIA by hand and then pass this to cusparse? Or can cusp do the conversion once and store that, and then just pass this as a DIA matrix?

On 8 Aug 2016, at 14:15, Oscar David Arbeláez E. notifications@github.com wrote:

Now it runs, and yields essentially the same results, however, it's considerably slower, although all the data is already in the GPU, CUSP decides that it needs to move stuff to the host memory and back 👎 while performing the field calculation, that is:

Step one is to create array views for the fields and spins. Step two is to perform the multiplication. generalized_spgemm seems like not being added to the cusp namespace for some reason, so I ended up using generalized_spmv which is the same for this purpose.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/richard-evans/vampire/issues/12#issuecomment-238231332, or mute the thread https://github.com/notifications/unsubscribe-auth/ACX-vMFHgsc1MRdUr66M_Grw32cp_zoZks5qdyvagaJpZM4JYQ4n.

Dr. Richard F L Evans Lecturer The Department of Physics University of York York, UK YO10 5DD Tel: +44 (0)1904 322822 Room: Quantum Hub 212b Email: richard.evans@york.ac.uk

www-users.york.ac.uk/~rfle500/ vampire.york.ac.uk

mattoaellis commented 8 years ago

That is rather strange. Indeed CUSP can do the conversion and then you can access the data how you want to perform the spmv elsewhere. I've never had this issue but I used the cusp arrays rather than the thrust ones so never needed the array views. Would it be worth swapping to the cusp arrays since they can be used the same as thrust ones elsewhere?

odarbelaeze commented 8 years ago

That could be possible, since cusp handles the transformations easily, however, seems like the DIA format is not available in CuSPARSE http://docs.nvidia.com/cuda/cusparse/#matrix-formats, we could experiment with the ELL format, or try to set up the arrays as cusp::array1d instead of thrust::device_vector, but that may require major changes in the whole vcuda package.

odarbelaeze commented 8 years ago

@mattoaellis the thrust::raw_pointer_cast signature is the same with cusp::array1d types?

mattoaellis commented 8 years ago

Yep, that has worked in the past for me.

odarbelaeze commented 8 years ago

@mattoaellis @richard-evans Good news, changing the types of the vectors to cusp::array1d and changing from C = alpha * A * B + beta * B to C = A * B speeded up the code by a large amount, the only downside is that the exchange fields need to be called first now, but it was the same for the non exchange fields before. I'll test for correctness, and have Andrea check for speed 👍

richard-evans commented 8 years ago

Good news! Lets see how things turn out...

On 8 Aug 2016, at 15:07, Oscar David Arbeláez E. notifications@github.com wrote:

@mattoaellis https://github.com/mattoaellis @richard-evans https://github.com/richard-evans Good news, changing the types of the vectors to cusp::array1d and changing from C = alpha * A * B + beta * B to C = A * B speeded up the code by a large amount, the only downside is that the exchange fields need to be called first now, but it was the same for the non exchange fields before. I'll test for correctness, and have Andrea check for speed 👍

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/richard-evans/vampire/issues/12#issuecomment-238248886, or mute the thread https://github.com/notifications/unsubscribe-auth/ACX-vLEfPvUAovE056Iy0owPr9nR2ZPIks5qdzg1gaJpZM4JYQ4n.

odarbelaeze commented 8 years ago

I did a real quick curie temperature benchmark,

# With CuSPARSE CSR
real    3m21.703s
user    2m16.917s
sys     1m3.840s

# With CUSP DIA
real    2m5.137s
user    1m30.466s
sys     0m34.178s

real    19m29.640s
user    19m25.569s
sys     0m0.468s

I'll edit this with the CPU serial results when they're available, furthermore, we are putting a lot of nice flags on the NVCC_FLAGS variable, and then we're ignoring that variable in favour of CUDA_FLAGS 😅

richard-evans commented 8 years ago

Ok nice! so ~ 5x faster for exchange calculation. Yes I deleted the NVCC variable in my PR. So maybe next thing is to reprofile and see where the new hotspots are.

R

On 8 Aug 2016, at 15:24, Oscar David Arbeláez E. notifications@github.com wrote:

I did a real quick curie temperature benchmark,

With CuSPARSE CSR

real 3m21.703s user 2m16.917s sys 1m3.840s

With CUSP DIA

real 2m5.137s user 1m30.466s sys 0m34.178s I'll edit this with the CPU serial results when they're available, furthermore, we are putting a lot of nice flags on the NVCC_FLAGS variable, and then we're ignoring that variable in favour of CUDA_FLAGS 😅

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/richard-evans/vampire/issues/12#issuecomment-238253624, or mute the thread https://github.com/notifications/unsubscribe-auth/ACX-vA8rIuQ3CTlTebwBU0KHPoRCkLOtks5qdzwRgaJpZM4JYQ4n.

odarbelaeze commented 8 years ago

I need better cpus in Colombia 😞

# With CuSPARSE CSR
real    3m21.703s
user    2m16.917s
sys     1m3.840s

# With CUSP DIA
real    2m5.137s
user    1m30.466s
sys     0m34.178s

# With 1 serial core in Wohlfarth
real    19m29.640s
user    19m25.569s
sys     0m0.468s

richard-evans / vampire

CUDA Debug/Performance TODOS #12

With CuSPARSE CSR

With CUSP DIA