nod-ai / shark-ai

SHARK Inference Modeling and Serving
Apache License 2.0
12 stars 26 forks source link

Mismatches between config.json exported by export_paged_llm_v1.py and expected by shortfin #405

Open renxida opened 2 weeks ago

renxida commented 2 weeks ago

The following edits were required to make llama3 8b fp16 work:

config["attn_head_count"] = 8 # 8 instead of 32
config["paged_kv_cache"] = {}
config["paged_kv_cache"]["block_seq_stride"] = config["block_seq_stride"]
del config["block_seq_stride"]
config["paged_kv_cache"]["device_block_count"] = 256

There are 2 main problems:

Really need integration tests between sharktank and shortfin.

renxida commented 2 weeks ago

This was triaged in #401