Eurosys '21 | Accelerating Graph Sampling for Graph Machine Learning using GPUs

jasperzhong commented 1 year ago

https://dl.acm.org/doi/pdf/10.1145/3447786.3456244

jasperzhong commented 1 year ago

没想到这篇文章的background部分提高了我对GPU的认知的....

一个是关于warp的. SM执行一个thread block的时候，SM每次调度一个subset of threads，叫做warp. 一般是连续的32个threads. GPU是用SIMT的execution model: all threads in a warp run the same instruction in lock-step. 注意，the same instruction. 这意味着如果遇到了一个branch，那么这个warp中不执行这个branch的threads，需要等待执行这个branch的threads做完后，才能继续执行. 这个现象叫做warp divergence，会导致很差的性能.

第二个是关于SM是没法做context switching的. 比如有两个thread blocks想在某个SM上执行，在执行thread block A的时候出现了等待（比如由于memory latency），这个时候是没法context switch到thread block B执行的. 这点和CPU很不一样，CPU可以很轻易地做context switching (thread A保存寄存器到memory）.

最后是一个知道但是不是很清楚的. 就是同一个warp中的对global memory的同时连续访存是可以合并的. 因为global memory延迟比较大嘛，能合并操作的话可以提高throughput. 这个优化我知道，但没用过.

jasperzhong commented 1 year ago

我看不懂.

jasperzhong / read-papers-and-code

Eurosys '21 | Accelerating Graph Sampling for Graph Machine Learning using GPUs #313