learn CUDA - Githubissues

CPU和GPU区别总结的挺精辟的.

CPU: designed to minimize latency. 所以主要的silicon area用作advanced control logic和large cache.
GPU: desigend to maximize throughput. 所以主要silicon area用作massive number of cores.

如何利用massive number of GPU cores.

概念:

线程的组织:

grid/block有dimension，可以是1D/2D/3D.

Memory model

这节讲的很好...excellent!

注意local memory是很慢的..比shared memory慢很多. 这是因为local memory是off-chip的.

图中绿色的部分是NVIDIA cores所在的地方，上面的memory是on-chip的. 蓝色部分是DRAM，是off-chip.

讲了synchronization primitives.

thread级别的是_syncthreads().
kernel级别的是cudaDeviceSynchronize() - Blocks until the device has completed all preceding requested task.

jasperzhong / cs-notes