nvprof profile shows excessive waiting and lack of multi-GPU use

linnanwang / BLASX

a heterogeneous multiGPU level-3 BLAS library

45 stars 11 forks source link

nvprof profile shows excessive waiting and lack of multi-GPU use #4

Open pseudotensor opened 7 years ago

pseudotensor commented 7 years ago

Running the testing/gemm.c with only sgemm (commenting out dgemm code) and larger matrices:

int loop = 0; for (loop = 1; loop < 2; loop++) { int M = 10000; int N = M; int K = M; float alpha_f = (float)(((double) rand()/(double)RAND_MAX)10)+1; float beta_f = (float)(((double) rand()/(double)RAND_MAX)10)+1; float A_f, B_f, C_f; A_f = (float)malloc(sizeof(float)MK); B_f = (float)malloc(sizeof(float)KN); C_f = (float)malloc(sizeof(float)MN); Fill_Float(A_f,M,K); Fill_Float(B_f,K,N); Fill_Float(C_f,M,N); fprintf(stderr,"START"); cudaProfilerStart(); cblas_sgemm(CblasColMajor,CblasNoTrans,CblasNoTrans,M,N,K, alpha_f,A_f,M, B_f,K, beta_f,C_f,M); cudaProfilerStop(); fprintf(stderr,"END"); free(A_f); free(B_f); free(C_f); }

shows in nvprof very little multi-GPU use with my 4 Titan-X (Pascal)'s. Also, discounting matrix filling, still lots of wait time before any gemm stuff is done.

I forced type=3 for always blasx use in blas/sgemm.c , so this should be all blasx and no cpu blas.

gemm2

linnanwang commented 7 years ago

What's your matrix size?

Please note the profile starts at right away when the program executes. That means it needs conduct host registration and other one-time initializations before actually doing the computation.

pseudotensor commented 7 years ago

I pointed to the repo that has the gemm2.c, but the main lines changed are:

int M = 10000; int N = M; int K = M;

But the problem occurs for any size.

In nsight, I selected the option for the profiler to not start profiling when program starts. So it only starts profiling when the profile start function is called. But in any case, when gemm is going, the other GPUs are mostly idle.

linnanwang commented 7 years ago

From the figure, other GPUs are not computing at all. I will investigate this and get back to you.

FYI, can I ask why are you using this? What is it for? If it's okay for you, you can reach me at: wangnan318@gmail.com

pseudotensor commented 7 years ago

Hi, in general multi-GPU blas is great idea, just drop-in. Similar to nvblas's cublasXT. I have problems with many programs segfaulting with nvblas, however, so I was trying yours. Nice paper :)

linnanwang commented 7 years ago

Hello,

Thank you for your interest. If you're not in rush, it may take me a while to fix into the case. If you're working on a paper, and needs use the library ASAP. Please let me know, I will prioritize this request. Thank you.

pseudotensor commented 7 years ago

Hi, I'd love to use it ASAP, and I'm looking at other alternatives in case.

linnanwang commented 7 years ago

I see, I will get back to you within this week.

linnanwang commented 7 years ago

Hello,

Can you please try to check this commit? I'm preparing a new project for NIPS at the moment, and it is quite unlikely for me to support this project at the moment. Sorry about that.

9bc22cce65e00673a8265b2dd8c7c18f91c94299