https://github.com/nod-ai/sharktank/pull/39 started support for exporting a bs1-only variant of the llama model using a direct cache (instead of a paged cache) and was able to export prefill using batch size 1:
# first comment out `generate_batch_decode(bs)`, then run
python -m sharktank.examples.export_paged_llm_v1 \
--hf-dataset=open_llama_3b_v2_f16_gguf \
--output_mlir=/tmp/llama_prefill_bs1.mlir \
--output_config=/tmp/llama_prefill_bs1.json \
--bs=1
If we end up wanting bs1 with a paged cache, that also fails to compile with <unknown>:0: error: 'vector.mask' op expects a 'vector<4x1xi1>' mask for the maskable operation
https://github.com/nod-ai/sharktank/pull/39 started support for exporting a bs1-only variant of the llama model using a direct cache (instead of a paged cache) and was able to export prefill using batch size 1:
decode support will need this TODO resolved: https://github.com/nod-ai/sharktank/blob/54332d0276ce5e6fea8649a2da0b8c9598134e70/sharktank/sharktank/models/llama/llama.py#L432-L443
see the use of 'index' ops here: https://github.com/nod-ai/sharktank/blob/54332d0276ce5e6fea8649a2da0b8c9598134e70/sharktank/sharktank/layers/kv_cache.py#L275-L310