nod-ai / sharktank

SHARK Inference Modeling and Serving
Apache License 2.0
9 stars 13 forks source link

Finish support for exporting batch size 1 llama using direct cache #40

Open ScottTodd opened 3 months ago

ScottTodd commented 3 months ago

https://github.com/nod-ai/sharktank/pull/39 started support for exporting a bs1-only variant of the llama model using a direct cache (instead of a paged cache) and was able to export prefill using batch size 1:

# first comment out `generate_batch_decode(bs)`, then run
python -m sharktank.examples.export_paged_llm_v1 \
    --hf-dataset=open_llama_3b_v2_f16_gguf \
    --output_mlir=/tmp/llama_prefill_bs1.mlir \
    --output_config=/tmp/llama_prefill_bs1.json \
    --bs=1

decode support will need this TODO resolved: https://github.com/nod-ai/sharktank/blob/54332d0276ce5e6fea8649a2da0b8c9598134e70/sharktank/sharktank/models/llama/llama.py#L432-L443

see the use of 'index' ops here: https://github.com/nod-ai/sharktank/blob/54332d0276ce5e6fea8649a2da0b8c9598134e70/sharktank/sharktank/layers/kv_cache.py#L275-L310

ScottTodd commented 3 months ago

If we end up wanting bs1 with a paged cache, that also fails to compile with <unknown>:0: error: 'vector.mask' op expects a 'vector<4x1xi1>' mask for the maskable operation