Enable llama2 benchmarking with Turbine

This is extension of the main Turbine refactoring work: https://github.com/nod-ai/SHARK/issues/1931. To enable future performance-related work, we should recreate the 1.0 benchmarking mode from vicuna.py:

Enablement

[x] Allow llama2 to be run with a single prompt using a CLI script (@raikonenfnu)
[ ] Port the benchmarking/statistics options from vicuna.py (e.g., setting the prompts, generating exactly K output tokens, running multiple iterations and reporting the averages, etc.)
[ ] Add a README with benchmarking instructions

Correctness

[ ] Make sure the output is human-readable with 7b/13b/70b on the targets of interest (gfx9, gfx11, and others)

Performance

[x] Add ukernel for argmax for ROCm GFX9 (@raikonenfnu)
[x] Add ukernel for argmax for ROCm GFX11 (@raikonenfnu)
[ ] Add kernel for argmax for SPIR-V/Vulkan (@qedawkins)

nod-ai / SHARK-Studio

Enable llama2 benchmarking with Turbine #2050

Enablement

Correctness

Performance