Got error on bench_batch_decode

cyLi-Tiger commented 3 months ago

Hi, thanks for the great work!

I got an error while benchmarking batch_decode's efficiency with following command.

./bench_batch_decode -a seqlen=4096 -a page_budget=[64,512]

The logs are attached. I didn't change the code, any clues on solving that?

Devices

[0] NVIDIA A100-SXM4-80GB

SM Version: 800 (PTX Version: -1)

Number of SMs: 108

SM Default Clock Rate: 1410 MHz

Global Memory: 80832 MiB Free / 81251 MiB Total

Global Memory Bus Peak: 2039 GB/sec (5120-bit DDR @1593MHz)

Max Shared Memory: 164 KiB/SM, 48 KiB/Block

L2 Cache Size: 40960 KiB

Maximum Active Blocks: 32/SM

Maximum Active Threads: 2048/SM, 1024/Block

Available Registers: 65536/SM, 65536/Block

ECC Enabled: Yes

Log
Run:  [1/2] bench_flashinfer_batch_decode_half [Device=0 seqlen=4096 page_budget=64 page_size=16 num_qo_heads=32 num_kv_heads=32 cooperative=1]
Fail: Unexpected error: Please call BatchDecodeHandler's BeginForward() before calling BatchDecodeWithPagedKVCacheWrapper()
Run:  [2/2] bench_flashinfer_batch_decode_half [Device=0 seqlen=4096 page_budget=512 page_size=16 num_qo_heads=32 num_kv_heads=32 cooperative=1]
Fail: Unexpected error: Please call BatchDecodeHandler's BeginForward() before calling BatchDecodeWithPagedKVCacheWrapper()
Benchmark Results

bench_flashinfer_batch_decode_half

[0] NVIDIA A100-SXM4-80GB

No data -- check log.

cyLi-Tiger commented 3 months ago

I recompiled with set(CMAKE_CUDA_ARCHITECTURES 80) and it's working now, that's quite weird lmao.

happierpig commented 3 months ago

Hi @cyLi-Tiger ,

Just trying to reproduce the problem (Check the figure attached). I suppose it may come from different impl of runtime APIs like cudaOccupancyMaxActiveBlocksPerMultiprocessor with different compute_capability. Therefore, the pre-kernel process (like this) may fail and cause assertion. Sorry for the issue as quest is only tested on sm89 GPUs :)

cyLi-Tiger commented 3 months ago

Thanks for your prompt reply!@happierpig

Another question, I'm new to flashinfer and wonder why the current kernel doesn't support GQA? Is it because flashinfer doesn't support GQA itself or GQA isn't suitable for quest currently?

Besides, can similar approach be applied to prefill stage and reduce TTFT?

happierpig commented 3 months ago

Thanks for your great questions!

FlashInter does efficiently support various GQA setting (even MLA). However, currently Quest does not support it. Query heads within same group should attend to same set of kv-tokens in order to utilize tensor core for efficient GQA execution. We are conducting preliminary experiments to add support for GQA.

Regarding to prefill phase, it is similar to GQA. Since basic shape of tensor op is 16x8x16 (M dimension 16). It is better to align 2 continuous query token to attend same "critical" tokens, so that 2 x Group_size can saturate 16 (M dimension). Therefore, we can use some aggregate op in estimating process to implement this.

cyLi-Tiger commented 3 months ago

Since basic shape of tensor op is 16x8x16 (M dimension 16)

Might seem silly...But where does this come from, any reference?

happierpig commented 3 months ago

You can check with official CUDA docs here. Since "mma.sync.m8n8k4 is optimized for target architecture sm_70 and may have substantially reduced performance on other target architectures", I suppose the effective minimal M dimension is 16.

cyLi-Tiger commented 3 months ago

All my questions are well answered, thanks!

mit-han-lab / Quest

Got error on bench_batch_decode #4

Devices

[0] `NVIDIA A100-SXM4-80GB`

Log

Benchmark Results

bench_flashinfer_batch_decode_half

[0] NVIDIA A100-SXM4-80GB