Finish support for exporting batch size 1 llama using direct cache

https://github.com/nod-ai/sharktank/pull/39 started support for exporting a bs1-only variant of the llama model using a direct cache (instead of a paged cache) and was able to export prefill using batch size 1:

# first comment out `generate_batch_decode(bs)`, then run
python -m sharktank.examples.export_paged_llm_v1 \
    --hf-dataset=open_llama_3b_v2_f16_gguf \
    --output_mlir=/tmp/llama_prefill_bs1.mlir \
    --output_config=/tmp/llama_prefill_bs1.json \
    --bs=1

decode support will need this TODO resolved: https://github.com/nod-ai/sharktank/blob/54332d0276ce5e6fea8649a2da0b8c9598134e70/sharktank/sharktank/models/llama/llama.py#L432-L443

see the use of 'index' ops here: https://github.com/nod-ai/sharktank/blob/54332d0276ce5e6fea8649a2da0b8c9598134e70/sharktank/sharktank/layers/kv_cache.py#L275-L310

nod-ai / sharktank

Finish support for exporting batch size 1 llama using direct cache #40