Open pseudotensor opened 7 years ago
What's your matrix size?
Please note the profile starts at right away when the program executes. That means it needs conduct host registration and other one-time initializations before actually doing the computation.
I pointed to the repo that has the gemm2.c, but the main lines changed are:
int M = 10000; int N = M; int K = M;
But the problem occurs for any size.
In nsight, I selected the option for the profiler to not start profiling when program starts. So it only starts profiling when the profile start function is called. But in any case, when gemm is going, the other GPUs are mostly idle.
From the figure, other GPUs are not computing at all. I will investigate this and get back to you.
FYI, can I ask why are you using this? What is it for? If it's okay for you, you can reach me at: wangnan318@gmail.com
Hi, in general multi-GPU blas is great idea, just drop-in. Similar to nvblas's cublasXT. I have problems with many programs segfaulting with nvblas, however, so I was trying yours. Nice paper :)
Hello,
Thank you for your interest. If you're not in rush, it may take me a while to fix into the case. If you're working on a paper, and needs use the library ASAP. Please let me know, I will prioritize this request. Thank you.
Hi, I'd love to use it ASAP, and I'm looking at other alternatives in case.
I see, I will get back to you within this week.
Hello,
Can you please try to check this commit? I'm preparing a new project for NIPS at the moment, and it is quite unlikely for me to support this project at the moment. Sorry about that.
9bc22cce65e00673a8265b2dd8c7c18f91c94299
Running the testing/gemm.c with only sgemm (commenting out dgemm code) and larger matrices:
int loop = 0; for (loop = 1; loop < 2; loop++) { int M = 10000; int N = M; int K = M; float alpha_f = (float)(((double) rand()/(double)RAND_MAX)10)+1; float beta_f = (float)(((double) rand()/(double)RAND_MAX)10)+1; float A_f, B_f, C_f; A_f = (float)malloc(sizeof(float)MK); B_f = (float)malloc(sizeof(float)KN); C_f = (float)malloc(sizeof(float)MN); Fill_Float(A_f,M,K); Fill_Float(B_f,K,N); Fill_Float(C_f,M,N); fprintf(stderr,"START"); cudaProfilerStart(); cblas_sgemm(CblasColMajor,CblasNoTrans,CblasNoTrans,M,N,K, alpha_f,A_f,M, B_f,K, beta_f,C_f,M); cudaProfilerStop(); fprintf(stderr,"END"); free(A_f); free(B_f); free(C_f); }
shows in nvprof very little multi-GPU use with my 4 Titan-X (Pascal)'s. Also, discounting matrix filling, still lots of wait time before any gemm stuff is done.
I forced type=3 for always blasx use in blas/sgemm.c , so this should be all blasx and no cpu blas.