Closed rayrayraykk closed 10 months ago
@ggerganov @Green-Sky @leejet Looking forward to your help! :)
After I change to ggml_flash_attn(ctx, q, k, v, false);
and add:
if (masked){
GGML_ASSERT(P >= 0);
}
The program works fine, but got images that make absolutely no sense... I'm really confused :(
Edit: ggml_flash_attn
scales kq
, after comment line below, everything works fine.
q = ggml_scale_inplace(ctx, q, ggml_new_f32(ctx, 1.0f / sqrt((float)d_head)));
Did you observe any speed improvement?
Did you observe any speed improvement?
There is a speed improvement in CLBlast. But other backends are not obvious.
@rayrayraykk I will see if after finishing my pull request for adding the CUDA backend, I will work on using flash attention v2 and improving the conv2d algorithm with FFT. Additionally, there is a new recent paper that allows using tensor cores to accelerate convolutions in CUDA. The truth is that I don't have enough knowledge to interpret equations from papers into code, which makes it somewhat difficult for me to implement things.
I try to use
ggml_flash_attn
to accelerate the process, so I replaceggml_mul_mat
incross-attention
in UNET in stable-diffusion.cpp:But it leads to an error. Looks like the
max_position = 2
,N = 64
, andconst int64_t P = nek1 - N;
which is less than0
. Can someone help me? Great thx!