Performance Issue w.r.t raw C cuda implementation

GoogleCodeExporter commented 8 years ago

Please post a short code sample which reproduces the problem:

See attachment for code sample

What is the expected output? What do you see instead?

Timing two sample, you will see that the thrust implementation of the same 
routine is about 20% slower than the raw cuda implementation

What version of Thrust are you using? Which version of nvcc?  Which host
compiler?  On what operating system?

nvcc: 3.1
Thrust: 100201
GCC 4.2 on Mac Book Pro

Should the thrust abstraction introduces 20% overhead?

Please provide any additional information below.

Original issue reported on code.google.com by wing1127aishi@gmail.com on 6 Sep 2010 at 6:14

Attachments:

GoogleCodeExporter commented 8 years ago

Original comment by wnbell on 6 Sep 2010 at 7:32

Changed state: Accepted

GoogleCodeExporter commented 8 years ago

There were a few problems with the code in main_cuda.  For example, the kernel 
used a block size of 1 and attempted to launch a grid of 10M blocks (the 
maximum dimension is 64K).  Also, one should not place the cudaMalloc or copy() 
calls inside the timing loop since those cost more than the transformation.

Anyway, I've attached new version of main_thrust.cu and main_cuda.cu that 
resolves these issues.  As you can see the resulting timings are much closer.

nbell@nvresearch-test2:~/scratch/sandbox$ nvcc -O2 main_thrust.cu -I 
~/scratch/thrust
nbell@nvresearch-test2:~/scratch/sandbox$ time ./a.out 

real    0m2.770s
user    0m2.050s
sys 0m0.610s
nbell@nvresearch-test2:~/scratch/sandbox$ time ./a.out 

real    0m2.616s
user    0m1.910s
sys 0m0.640s
nbell@nvresearch-test2:~/scratch/sandbox$ time ./a.out 

real    0m2.588s
user    0m1.930s
sys 0m0.580s
nbell@nvresearch-test2:~/scratch/sandbox$ time ./a.out 

real    0m2.602s
user    0m1.930s
sys 0m0.610s
nbell@nvresearch-test2:~/scratch/sandbox$ nvcc -O2 main_cuda.cu -I 
~/scratch/thrust
nbell@nvresearch-test2:~/scratch/sandbox$ time ./a.out 

real    0m2.562s
user    0m1.900s
sys 0m0.560s
nbell@nvresearch-test2:~/scratch/sandbox$ time ./a.out 

real    0m2.460s
user    0m1.790s
sys 0m0.590s
nbell@nvresearch-test2:~/scratch/sandbox$ time ./a.out 

real    0m2.465s
user    0m1.800s
sys 0m0.580s
nbell@nvresearch-test2:~/scratch/sandbox$

Original comment by wnbell on 7 Sep 2010 at 11:53

Changed state: Invalid

Attachments:

GoogleCodeExporter commented 8 years ago

Thanks for the fix. The Malloc and Copy in the loop were intended because 
memory allocation is also hidden inside the abstraction. 

Nonetheless, I rerun the test with Copy and Malloc in the loop on Linux and Mac 
with a card on it.

wingsit@wingsit-desktop:~/MyCuda/et$ time ./main_cuda 

real    2m19.160s
user    2m16.525s
sys 0m2.604s
wingsit@wingsit-desktop:~/MyCuda/et$ time ./main_thrust 

real    2m22.535s
user    2m19.077s
sys 0m3.436s

But on my mac I Get

time ./main_cuda
      87.64 real        86.55 user         0.48 sys
time ./main_thrust
     106.81 real       105.02 user         0.53 sys

What might cause this issue?

Original comment by wing1127aishi@gmail.com on 8 Sep 2010 at 2:01

GoogleCodeExporter commented 8 years ago

Is there a reason this particular test is of interest to you?

Offhand, I can't think of an explanation for that performance discrepancy.  
However, if you can pinpoint the cause we'd take another look at it.

As far as I can tell, your programs are just timing cudaMemcpy() + a simple 
kernel.  Accessing cudaMemcpy() via thrust::copy() should not add any 
measurable overhead.

Original comment by wnbell on 8 Sep 2010 at 2:31

GoogleCodeExporter commented 8 years ago

I am planning on experimenting some template expression technique with CUDA 
kernel and am looking for a good container class that handles all the memory 
details. This is why I am testing the overhead of device_vector implementation. 

By any chance, can you point me to some technique that I can use to deduce 
further on the problem?

BTW I am watching your lecture from Stanford. Thank you for your work.

Original comment by wing1127aishi@gmail.com on 8 Sep 2010 at 4:08

lion03 / thrust

Performance Issue w.r.t raw C cuda implementation #208