Mismatches between config.json exported by export_paged_llm_v1.py and expected by shortfin

The following edits were required to make llama3 8b fp16 work:

config["attn_head_count"] = 8 # 8 instead of 32
config["paged_kv_cache"] = {}
config["paged_kv_cache"]["block_seq_stride"] = config["block_seq_stride"]
del config["block_seq_stride"]
config["paged_kv_cache"]["device_block_count"] = 256

There are 2 main problems:

the attn_head_count should be set to attention_head_count_kv from export_paged_llm_v1 and not attention_head_count. This should be fixed in sharktank, at least by including both attention head counts
kvcache params should be in config["paged_kv_cache"]

Really need integration tests between sharktank and shortfin.

nod-ai / shark-ai

Mismatches between config.json exported by export_paged_llm_v1.py and expected by shortfin #405