turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.23k stars 238 forks source link

Possible problem in zero output #279

Open chu-tianxiang opened 5 months ago

chu-tianxiang commented 5 months ago

Currently, blocks where blockIdx.z = 0 will set the output to zero here. However, due to the absence of synchronization between blocks, is it possible that other blocks with different blockIdx.z might complete their calculations and update the output before it's zeroed out?

turboderp commented 5 months ago

Conventional wisdom would say yes, it's possible for blocks to launch in any order. In practice (and I tested this a lot) I've never seen block (x, y, 0) launch after (x, y, 1). There's some discussion of it here and some tests of the correlation between launch time and blockIdx, which turns out to be very strong.

Of course, relying on it is still a little hacky since there are no actual guarantees from NVIDIA. I definitely want to find a better solution, and I'm trying to rework the kernel to do FP32 accumulation anyway, which may make it a non-issue soon.

chu-tianxiang commented 5 months ago

Actually the concern is, modern GPU have enough SMs that block (x, y, 0) and (x, y, 1) are launched concurrently, it's hard to guarantee block 0 will finish the zero-fill first.