The following edits were required to make llama3 8b fp16 work:
config["attn_head_count"] = 8 # 8 instead of 32
config["paged_kv_cache"] = {}
config["paged_kv_cache"]["block_seq_stride"] = config["block_seq_stride"]
del config["block_seq_stride"]
config["paged_kv_cache"]["device_block_count"] = 256
There are 2 main problems:
the attn_head_count should be set to attention_head_count_kv from export_paged_llm_v1 and not attention_head_count. This should be fixed in sharktank, at least by including both attention head counts
kvcache params should be in config["paged_kv_cache"]
Really need integration tests between sharktank and shortfin.
The following edits were required to make llama3 8b fp16 work:
There are 2 main problems:
attn_head_count
should be set toattention_head_count_kv
from export_paged_llm_v1 and notattention_head_count
. This should be fixed in sharktank, at least by including both attention head countsReally need integration tests between sharktank and shortfin.