Closed GoogleCodeExporter closed 8 years ago
Original comment by wnbell
on 6 Sep 2010 at 7:32
There were a few problems with the code in main_cuda. For example, the kernel
used a block size of 1 and attempted to launch a grid of 10M blocks (the
maximum dimension is 64K). Also, one should not place the cudaMalloc or copy()
calls inside the timing loop since those cost more than the transformation.
Anyway, I've attached new version of main_thrust.cu and main_cuda.cu that
resolves these issues. As you can see the resulting timings are much closer.
nbell@nvresearch-test2:~/scratch/sandbox$ nvcc -O2 main_thrust.cu -I
~/scratch/thrust
nbell@nvresearch-test2:~/scratch/sandbox$ time ./a.out
real 0m2.770s
user 0m2.050s
sys 0m0.610s
nbell@nvresearch-test2:~/scratch/sandbox$ time ./a.out
real 0m2.616s
user 0m1.910s
sys 0m0.640s
nbell@nvresearch-test2:~/scratch/sandbox$ time ./a.out
real 0m2.588s
user 0m1.930s
sys 0m0.580s
nbell@nvresearch-test2:~/scratch/sandbox$ time ./a.out
real 0m2.602s
user 0m1.930s
sys 0m0.610s
nbell@nvresearch-test2:~/scratch/sandbox$ nvcc -O2 main_cuda.cu -I
~/scratch/thrust
nbell@nvresearch-test2:~/scratch/sandbox$ time ./a.out
real 0m2.562s
user 0m1.900s
sys 0m0.560s
nbell@nvresearch-test2:~/scratch/sandbox$ time ./a.out
real 0m2.460s
user 0m1.790s
sys 0m0.590s
nbell@nvresearch-test2:~/scratch/sandbox$ time ./a.out
real 0m2.465s
user 0m1.800s
sys 0m0.580s
nbell@nvresearch-test2:~/scratch/sandbox$
Original comment by wnbell
on 7 Sep 2010 at 11:53
Attachments:
Thanks for the fix. The Malloc and Copy in the loop were intended because
memory allocation is also hidden inside the abstraction.
Nonetheless, I rerun the test with Copy and Malloc in the loop on Linux and Mac
with a card on it.
wingsit@wingsit-desktop:~/MyCuda/et$ time ./main_cuda
real 2m19.160s
user 2m16.525s
sys 0m2.604s
wingsit@wingsit-desktop:~/MyCuda/et$ time ./main_thrust
real 2m22.535s
user 2m19.077s
sys 0m3.436s
But on my mac I Get
time ./main_cuda
87.64 real 86.55 user 0.48 sys
time ./main_thrust
106.81 real 105.02 user 0.53 sys
What might cause this issue?
Original comment by wing1127aishi@gmail.com
on 8 Sep 2010 at 2:01
Is there a reason this particular test is of interest to you?
Offhand, I can't think of an explanation for that performance discrepancy.
However, if you can pinpoint the cause we'd take another look at it.
As far as I can tell, your programs are just timing cudaMemcpy() + a simple
kernel. Accessing cudaMemcpy() via thrust::copy() should not add any
measurable overhead.
Original comment by wnbell
on 8 Sep 2010 at 2:31
I am planning on experimenting some template expression technique with CUDA
kernel and am looking for a good container class that handles all the memory
details. This is why I am testing the overhead of device_vector implementation.
By any chance, can you point me to some technique that I can use to deduce
further on the problem?
BTW I am watching your lecture from Stanford. Thank you for your work.
Original comment by wing1127aishi@gmail.com
on 8 Sep 2010 at 4:08
Original issue reported on code.google.com by
wing1127aishi@gmail.com
on 6 Sep 2010 at 6:14Attachments: