mit-han-lab / Quest

[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
172 stars 10 forks source link

Got error on bench_batch_decode #4

Closed cyLi-Tiger closed 3 months ago

cyLi-Tiger commented 3 months ago

Hi, thanks for the great work!

I got an error while benchmarking batch_decode's efficiency with following command.

./bench_batch_decode -a seqlen=4096 -a page_budget=[64,512]

The logs are attached. I didn't change the code, any clues on solving that?

Devices

[0] NVIDIA A100-SXM4-80GB

  • SM Version: 800 (PTX Version: -1)
  • Number of SMs: 108
  • SM Default Clock Rate: 1410 MHz
  • Global Memory: 80832 MiB Free / 81251 MiB Total
  • Global Memory Bus Peak: 2039 GB/sec (5120-bit DDR @1593MHz)
  • Max Shared Memory: 164 KiB/SM, 48 KiB/Block
  • L2 Cache Size: 40960 KiB
  • Maximum Active Blocks: 32/SM
  • Maximum Active Threads: 2048/SM, 1024/Block
  • Available Registers: 65536/SM, 65536/Block
  • ECC Enabled: Yes

Log

Run:  [1/2] bench_flashinfer_batch_decode_half [Device=0 seqlen=4096 page_budget=64 page_size=16 num_qo_heads=32 num_kv_heads=32 cooperative=1]
Fail: Unexpected error: Please call BatchDecodeHandler's BeginForward() before calling BatchDecodeWithPagedKVCacheWrapper()
Run:  [2/2] bench_flashinfer_batch_decode_half [Device=0 seqlen=4096 page_budget=512 page_size=16 num_qo_heads=32 num_kv_heads=32 cooperative=1]
Fail: Unexpected error: Please call BatchDecodeHandler's BeginForward() before calling BatchDecodeWithPagedKVCacheWrapper()

Benchmark Results

bench_flashinfer_batch_decode_half

[0] NVIDIA A100-SXM4-80GB

No data -- check log.

cyLi-Tiger commented 3 months ago

I recompiled with set(CMAKE_CUDA_ARCHITECTURES 80) and it's working now, that's quite weird lmao.

happierpig commented 3 months ago

Hi @cyLi-Tiger ,

Just trying to reproduce the problem (Check the figure attached). I suppose it may come from different impl of runtime APIs like cudaOccupancyMaxActiveBlocksPerMultiprocessor with different compute_capability. Therefore, the pre-kernel process (like this) may fail and cause assertion. Sorry for the issue as quest is only tested on sm89 GPUs :)

截屏2024-07-03 15 03 58
cyLi-Tiger commented 3 months ago

Thanks for your prompt reply!@happierpig

Another question, I'm new to flashinfer and wonder why the current kernel doesn't support GQA? Is it because flashinfer doesn't support GQA itself or GQA isn't suitable for quest currently?

Besides, can similar approach be applied to prefill stage and reduce TTFT?

happierpig commented 3 months ago

Thanks for your great questions!

FlashInter does efficiently support various GQA setting (even MLA). However, currently Quest does not support it. Query heads within same group should attend to same set of kv-tokens in order to utilize tensor core for efficient GQA execution. We are conducting preliminary experiments to add support for GQA.

Regarding to prefill phase, it is similar to GQA. Since basic shape of tensor op is 16x8x16 (M dimension 16). It is better to align 2 continuous query token to attend same "critical" tokens, so that 2 x Group_size can saturate 16 (M dimension). Therefore, we can use some aggregate op in estimating process to implement this.

cyLi-Tiger commented 3 months ago

Since basic shape of tensor op is 16x8x16 (M dimension 16)

Might seem silly...But where does this come from, any reference?

happierpig commented 3 months ago

You can check with official CUDA docs here. Since "mma.sync.m8n8k4 is optimized for target architecture sm_70 and may have substantially reduced performance on other target architectures", I suppose the effective minimal M dimension is 16.

cyLi-Tiger commented 3 months ago

All my questions are well answered, thanks!