This is extension of the main Turbine refactoring work: https://github.com/nod-ai/SHARK/issues/1931. To enable future performance-related work, we should recreate the 1.0 benchmarking mode from vicuna.py:
Enablement
[x] Allow llama2 to be run with a single prompt using a CLI script (@raikonenfnu)
[ ] Port the benchmarking/statistics options from vicuna.py (e.g., setting the prompts, generating exactly K output tokens, running multiple iterations and reporting the averages, etc.)
[ ] Add a README with benchmarking instructions
Correctness
[ ] Make sure the output is human-readable with 7b/13b/70b on the targets of interest (gfx9, gfx11, and others)
Performance
[x] Add ukernel for argmax for ROCm GFX9 (@raikonenfnu)
[x] Add ukernel for argmax for ROCm GFX11 (@raikonenfnu)
[ ] Add kernel for argmax for SPIR-V/Vulkan (@qedawkins)
This is extension of the main Turbine refactoring work: https://github.com/nod-ai/SHARK/issues/1931. To enable future performance-related work, we should recreate the 1.0 benchmarking mode from
vicuna.py
:Enablement
Correctness
Performance