Open jkn93 opened 8 years ago
That's correct. 30000x30000 double precision should go beyond of 6GB. I don't recommend testing double precision on TITAN. We normally use K40. You can try single, and it should delivery very good performance.
Thank you for the quick reply. I'm using the older Titan-cards with 1/3-dp performance on my dev-server. I've considered the performance of dgemm w/ matrices of dim. 20000x20000 as rather promising, so I was actually wondering if your library is considered for production use. Therefore, it should be able to handle larger matrices, check the available mem@dev and employ some sort of batching.
It can go any matrix size as long as you have enough host RAM (The current memory issue is due to the way I'm marking for finished tasks; however, it does not incur too much benefit for implementing a very complex logic). I have handed out a copy of my paper to forks at MATLAB and NVIDIA. I will leave the production codes for them as it is too much work for an individual if they are interested in taking on BLASX. Besides memory checking, there are a lot additional work considering so many variations of level 3 BLAS such as Trans, Uplo, and Diag.
Can you tell me why you need a matrix multiplication of 10^4 size? Thanks.
Good call, let's hope they make the best of it. I'm developing ab initio electronic structure methods for large-scale systems.
I've modified the gemm-example to use dgemm only with matrices of dimension 30000x30000. Using a server with 4 GTX Titan cards the program produces a segfault. It seems that there aren't any checks regarding available device-memory.
nvidia-smi:
| 0 30252 C ./gemm 6067MiB | | 1 30252 C ./gemm 6067MiB | | 2 30252 C ./gemm 6067MiB | | 3 30252 C ./gemm 6067MiB |