Open odarbelaeze opened 8 years ago
@richard-evans according to this:
Jxx_vals_h.push_back( -::atoms::v_exchange_list[iid].Jij[0]);
Jyy_vals_h.push_back( -::atoms::v_exchange_list[iid].Jij[1]);
Jzz_vals_h.push_back( -::atoms::v_exchange_list[iid].Jij[2]);
I can guess the xx, yy, and zz values of the exchange for the vector case, but, what would be the indices for xy, xz, ...etc for the tensor case?
IIRC there is separate exchange lists for isotropic, exchange and tensor. So to do the case for xy,xz,.. you probably want something like:
Jxy_vals_h.push_back( -::atoms::t_exchange_list[iid].Jij[0][1)]
I would need to check the src files though.
@mattoaellis since you're here, did you try CUSP to do the matrix multiplication I'm looking at this at the moment http://cusplibrary.github.io/classcusp_1_1csr__matrix__view.html#a2a5a8f5d37b23c657002ad4c31509688
Yeah, I've used CUSP in the past for matrix multiplication. It's pretty neat for setting up and easy calculations. It has both the diagonal format and Ellpack which are both useful for the exchange calculation. I thought we decided against it as it wasn't part of the main CUDA toolkit and would require people to download the library themselves? Another downside is (unless they have updated it) they don't have any symmetric solvers since they claim the transpose would not be efficient enough.
Hum, it seems to be on the clusters, no extra downloads required, furthermore seems like DIA is the way to go and is very easy to do transformations with this.
I think that was because we were using it before. Out of interest what version is it? I would certainly recommend it and downloading an extra library is not that hard but I don't think we can count on it being pre-installed on every cluster. DIA is likely going to be the best but not always as it heavily depends on the numerical ordering of the spins. ELL is a bit more versatile in that respect as it only depends on having a constant number of values per row which unless it is some strange exchange we are liekly to have. I suggest setting it up as a typedef within a macro so that we can easily switch between them if we go for CUSP:
#ifdef __DIAEXCH
typedef cusp::dia_matrix<int, double, cusp::device_memory> exch_matrix_t
#else
typedef cusp::ell_matrix<int, double, cusp::device_memory> exch_matrix_t
#endif
In CUSP it is super easy to set up the matrix on the host in a simple format i.e csr which I think the serial version uses anyway then jsut convert to our chosen format.
cusp::csr_matrix<int, double, cusp::host_memory> exchange_h(N,N,NNbrs);
\\ fill matrix
exch_matrix_t exchange_d(exchange_h); // Should do the conversion and host-to-device copy for us
Jeje, I'm on it, I'll just amalgamate the library now, as we did with CUB, we can clean up the mess before we merge to development ๐ฏ , is it safe to do this:
cusp::array1d_view <...> spin_view (some_thrust_device_vector.begin(), some_thrust_device_vector.end());
Yeah, the cusp arrays are just slightly expanded thrust vectors so they are all compatible. Annoyingly thrust is (mostly) compatible with cusp arrays but not vice-versa. Why the array view in this case? Since the array goes from begin() to end() why not array1d?
Ah I think I get it. Is it because the view does not copy the data but jsut references the underlying data?
Yeah, just to keep the spin arrays intact so that the changes do not conflict with other pieces of the code ๐ธ
I've just tested that and it seems to work fine creating a array view from a thrust device vector and passing that to cusp::multiply.
I'm, integrating it, but seems like including the cusp headers has conflicts with currand, did you run into those?
/usr/local/cuda/bin/..//include/curand_mtgp32_kernel.h(315): error: calling a __device__ function("__syncthreads") from a __host__ __device__ function("curand_mtgp32_single") is not allowed
/usr/local/cuda/bin/..//include/curand_mtgp32_kernel.h(373): error: calling a __device__ function("__syncthreads") from a __host__ __device__ function("curand_mtgp32_single_specific") is not allowed
/usr/local/cuda/bin/..//include/curand_kernel.h(392): warning: missing return statement at end of non-void function "__curand_uint32_as_float"
/usr/local/cuda/bin/..//include/curand_mtgp32_kernel.h(315): error: calling a __device__ function("__syncthreads") from a __host__ __device__ function("curand_mtgp32_single") is not allowed
/usr/local/cuda/bin/..//include/curand_mtgp32_kernel.h(373): error: calling a __device__ function("__syncthreads") from a __host__ __device__ function("curand_mtgp32_single_specific") is not allowed
/usr/local/cuda/bin/..//include/curand_kernel.h(392): warning: missing return statement at end of non-void function "__curand_uint32_as_float"
Those are not there if I omit the cusp headers ๐
Seems like this could be the issue, however, we are working with cuda 6.5, and reversing orders here and there does not help it ๐ข http://stackoverflow.com/questions/23352122/including-thrust-sort-h-before-curand-kernel-h-gives-compilation-error
I've never had any problem with curand, cusp and thrust all together. As the post says it is a version issue; I remember hearing some worrying reports about bugs like that. Can you drop back to CUDA 5 to test out? Otherwise it may require moving the order the headers are included in.
Ah I do remember having some issues with CUSP at one point (not the same as yours) when I moved to a newer version but they were just warnings rather than errors and it ran jsut fine. Check whether the version of CUSP you are including is the latest one. If not you could downlaod the library locally.
I downloaded the 0.4.0
version, I think is the latest, I'll have to keep changing the order of the include statements.
I've tested out cuda 5.5 and 6.5 using the latest version of CUSP v0.5.1 and I don't seem to have any problems. Which CUSP and Curand headers are you using exactly?
@mattoaellis is there a nice way to do the C = a*A*B + b*B
thing we do with CuSPARSE in CUSP? also see https://github.com/odarbelaeze/vampire/pull/2/commits/98cafe24855c80cf596287974ec39361ac25ba7c#diff-ab8be042c5e3fb066cb0ea4e24a10990R15, that compiles in Wohlfarth.
@odarbelaeze CuRAND or CuSparse?
There is a more generalised multiply routine which allows you to specify details about the operation http://cusplibrary.github.io/group__matrix__algorithms.html#gafc7e7d80ae5f5f3fcd696f89074aa0b2
That is in CUSP v0.5.1 but I don't know if it is v0.4. Mind you for the operation you specify then you can just fold b*B into the diagonals of A. I have done that in the past for computing the anisotropy in addition to the exchange interaction.
Now it runs, and yields essentially the same results, however, it's considerably slower, although all the data is already in the GPU, CUSP decides that it needs to move stuff to the host memory and back ๐ while performing the field calculation, that is:
generalized_spgemm
seems like not being added to the cusp
namespace for some reason, so I ended up using generalized_spmv
which is the same for this purpose.
Yuck - thats not what we wanted. So is the solution to do DIA by hand and then pass this to cusparse? Or can cusp do the conversion once and store that, and then just pass this as a DIA matrix?
On 8 Aug 2016, at 14:15, Oscar David Arbelรกez E. notifications@github.com wrote:
Now it runs, and yields essentially the same results, however, it's considerably slower, although all the data is already in the GPU, CUSP decides that it needs to move stuff to the host memory and back ๐ while performing the field calculation, that is:
Step one is to create array views for the fields and spins. Step two is to perform the multiplication. generalized_spgemm seems like not being added to the cusp namespace for some reason, so I ended up using generalized_spmv which is the same for this purpose.
โ You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/richard-evans/vampire/issues/12#issuecomment-238231332, or mute the thread https://github.com/notifications/unsubscribe-auth/ACX-vMFHgsc1MRdUr66M_Grw32cp_zoZks5qdyvagaJpZM4JYQ4n.
Dr. Richard F L Evans Lecturer The Department of Physics University of York York, UK YO10 5DD Tel: +44 (0)1904 322822 Room: Quantum Hub 212b Email: richard.evans@york.ac.uk
www-users.york.ac.uk/~rfle500/ vampire.york.ac.uk
That is rather strange. Indeed CUSP can do the conversion and then you can access the data how you want to perform the spmv elsewhere. I've never had this issue but I used the cusp arrays rather than the thrust ones so never needed the array views. Would it be worth swapping to the cusp arrays since they can be used the same as thrust ones elsewhere?
That could be possible, since cusp handles the transformations easily, however, seems like the DIA format is not available in CuSPARSE http://docs.nvidia.com/cuda/cusparse/#matrix-formats, we could experiment with the ELL format, or try to set up the arrays as cusp::array1d
instead of thrust::device_vector
, but that may require major changes in the whole vcuda
package.
@mattoaellis the thrust::raw_pointer_cast
signature is the same with cusp::array1d
types?
Yep, that has worked in the past for me.
@mattoaellis @richard-evans Good news, changing the types of the vectors to cusp::array1d
and changing from C = alpha * A * B + beta * B
to C = A * B
speeded up the code by a large amount, the only downside is that the exchange fields need to be called first now, but it was the same for the non exchange fields before. I'll test for correctness, and have Andrea check for speed ๐
Good news! Lets see how things turn out...
On 8 Aug 2016, at 15:07, Oscar David Arbelรกez E. notifications@github.com wrote:
@mattoaellis https://github.com/mattoaellis @richard-evans https://github.com/richard-evans Good news, changing the types of the vectors to cusp::array1d and changing from C = alpha * A * B + beta * B to C = A * B speeded up the code by a large amount, the only downside is that the exchange fields need to be called first now, but it was the same for the non exchange fields before. I'll test for correctness, and have Andrea check for speed ๐
โ You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/richard-evans/vampire/issues/12#issuecomment-238248886, or mute the thread https://github.com/notifications/unsubscribe-auth/ACX-vLEfPvUAovE056Iy0owPr9nR2ZPIks5qdzg1gaJpZM4JYQ4n.
I did a real quick curie temperature benchmark,
# With CuSPARSE CSR
real 3m21.703s
user 2m16.917s
sys 1m3.840s
# With CUSP DIA
real 2m5.137s
user 1m30.466s
sys 0m34.178s
real 19m29.640s
user 19m25.569s
sys 0m0.468s
I'll edit this with the CPU serial results when they're available, furthermore, we are putting a lot of nice flags on the NVCC_FLAGS
variable, and then we're ignoring that variable in favour of CUDA_FLAGS
๐
Ok nice! so ~ 5x faster for exchange calculation. Yes I deleted the NVCC variable in my PR. So maybe next thing is to reprofile and see where the new hotspots are.
R
On 8 Aug 2016, at 15:24, Oscar David Arbelรกez E. notifications@github.com wrote:
I did a real quick curie temperature benchmark,
With CuSPARSE CSR
real 3m21.703s user 2m16.917s sys 1m3.840s
With CUSP DIA
real 2m5.137s user 1m30.466s sys 0m34.178s I'll edit this with the CPU serial results when they're available, furthermore, we are putting a lot of nice flags on the NVCC_FLAGS variable, and then we're ignoring that variable in favour of CUDA_FLAGS ๐
โ You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/richard-evans/vampire/issues/12#issuecomment-238253624, or mute the thread https://github.com/notifications/unsubscribe-auth/ACX-vA8rIuQ3CTlTebwBU0KHPoRCkLOtks5qdzwRgaJpZM4JYQ4n.
I need better cpus in Colombia ๐
# With CuSPARSE CSR real 3m21.703s user 2m16.917s sys 1m3.840s # With CUSP DIA real 2m5.137s user 1m30.466s sys 0m34.178s # With 1 serial core in Wohlfarth real 19m29.640s user 19m25.569s sys 0m0.468s
cu::grid_size
variable at initialisation (according to the plan)That's one thing, let's see what else comes up and register it here.