llava-benchmark
is a general purpose benchmarking tool designed to evaluate the
image and audio processing capabilities of LLaVA
models with Ollama.
EvalRateBenchmark
: Measure model image processing speed ๐
LicensePlateBenchmark
: Extract license plate numbers from processed images ๐
CallAudioBenchmark
: Transcribe phone calls to summarized call notes from audio files ๐ฑ
By running these benchmarks, you can quickly assess how well different LLaVA
models
perform when asked to read license plate numbers from images, or summarize an audio call
recording as meeting notes.
Before running llava-benchmark
, clone the repository to your local machine:
Open a Terminal: On Windows, you can use Command Prompt or PowerShell. On macOS or Linux, you can use Terminal.
Navigate to the Desired Directory: Use the cd
command to navigate to
the directory where you want to clone the repository.
Clone the Repository: Run the following command to clone the repository:
git clone https://github.com/jcassady/llava-benchmark.git
Follow these steps after cloning into the local llava-benchmark/
repo directory:
Create a Virtual Environment:
python -m venv .venv
Activate the Virtual Environment:
.venv\Scripts\Activate.ps1
source .venv/bin/activate
Install the Dependencies:
pip install -r requirements.txt
The tool uses YAML configuration file data/config.yml
to specify the models
,
prompts
, and media
files for the benchmark to use.
Here's a brief explanation of each section:
models
: This lists the models to be benchmarkedprompts
: This lists the prompts to be used for each modelmedia
: This lists the file names to be used in the benchmarkLicensePlateBenchmark
configuration lists images of license plates:
# data/config_license_plates.yml
models:
- llava:latest
- llava-llama3:8b
prompts:
- >-
Read and return the license plate number and letters
as text on a new line as plain text:
media:
- 1.jpg
- 2.jpg
CallAudioBenchmark
configuration lists audio files of phone calls:
# data/config_call_audio.yml
models:
- llava:latest
- llava-llama3:8b
prompts:
- >-
Summarize the key points of this audio call
transcript in point form as call notes:
media:
- 1.mp3
- 2.mp3
When you execute llava_benchmark.py
, it performs a series of operations:
Checks if Ollama is Installed: The script checks if the ollama
binary is present
on your system. If not, it will print an error message and exit.
Checks if the model is Installed: For each model specified in the YAML configuration file, the script checks if the model is installed. If a model is not found, it will print a message and skip that model.
Runs the Benchmark: For each model, prompt, and media file specified in the YAML
configuration file, the script runs the ollama
command and stores the evaluation
rate and any relevant test result data.
Prints the Average Evaluation Rate: After running the benchmark for all models, prompts, and media files, the script prints the average evaluation rate for each model.
Plots the Evaluation Rate Chart: The script plots an ASCII line chart of the evaluation rates for visual analysis.
To run the script, navigate to the root directory containing the llava_benchmark.py
script and use the --media license_plates
argument to run the LicensePlateBenchmark
:
$ python llava_benchmark.py --media license_plates
========================================
๐ฆ MODEL: llava:latest ๐ฆ
========================================
PROMPT:
Read and return the license plate
number and letters as text on a new
line as plain text:
DATA\IMAGES\1.JPG ๐
โฝ Tokens/s: 55.56 ๐
โฝ Plate: K5210V ๐
DATA\IMAGES\2.JPG ๐
โฝ Tokens/s: 54.73 ๐
โฝ Plate: PAX 44 ๐
----------------------------------------
Average eval rate: 56.833 ๐
----------------------------------------
Y-axis: Evaluation Rates
X-axis: Images
76.07 โค
71.52 โค โญโโฎ โญโ
66.96 โค โ โ โ
62.41 โค โญโโฎ โ โ โ
57.85 โผโโฏ โฐโโฎ โญโโฎ โญโโฏ โฐโโฏ
53.30 โค โฐโโฏ โฐโโฏ
To run the CallAudioBenchmark
, use the --media call_audio
argument:
$ python llava_benchmark.py --media call_audio
========================================
๐ฆ MODEL: llava:latest ๐ฆ
========================================
PROMPT:
Summarize the key points of this audio
call transcript in point form as call
notes:
DATA\CALL_AUDIO\1.MP3 ๐
โฝ Tokens/s: 51.64 ๐
----------------------------------------
CALL NOTES:
| * Cloud network temporarily shut down
| due to non-payment of subscription
| * Circumstances can change and
| assistance is available
| * Payment needed to reactivate
| services
| * Internet connectivity issues can be
| addressed by contacting local provider
| * Once online, assistance will be
| provided
DATA\CALL_AUDIO\2.MP3 ๐
โฝ Tokens/s: 51.23 ๐
----------------------------------------
CALL NOTES:
| Call Notes:
| * App being discussed is a meditation
| app
| * The app is described as more potent
| than a triple shot almond milk latte
| and is disrupting the sharing economy
| * Unicorn mascots in augmented reality
| glasses are mentioned
| * Flash mob IPO with dancers spelling
| out stock ticker in Times Square
| * Ocha, Man-Bun, Kombatcha, Aficionado
| terms listed
----------------------------------------
Average eval rate: 51.435 ๐
----------------------------------------
Y-axis: Evaluation Rates
X-axis: Media
51.64 โผโโฎ
51.56 โค โ
51.48 โค โ
51.39 โค โ
51.31 โค โฐโ
51.23 โค
The source code for the project includes comprehensive documentation comments and docstrings. Automatically generated HTML docs can be viewed on GitHub Pages:
https://jcassady.github.io/llava-benchmark/ ๐
Please see the source files, including __init__.py
files for comments and
additional information on the structure and organization of this project.
The EvalRateBenchmark
class is initialized to process and store the evaluation rates
from a benchmark's result. Eval rates provide metrics measured in tokens/s, and chart
performance differences between models and media files under test.
The LicensePlateBenchmark
class is initialized to process the license plate from
an image file. License plate numbers are read with a compatible LLaVA
model, then returned
alongside the benchmark result containing the eval rate produced by EvalRateBenchmark
.
The CallAudioBenchmark
class is initialized to process call audio recordings via
speech-to-text transcription with OpenAI's whisper
library. Local LLaVA
models
summarize the transcripts into call notes, which are returned alongside the benchmark
result from EvalRateBenchmark
.
The project is designed to be easily extendable for other LLaVA-compatible tasks. This is done through the use of benchmark objects, which are instances of classes that define specific tasks.
In the main function of llava_benchmark.py
, instances of EvalRateBenchmark
and LicensePlateBenchmark
are executed when the --media
argument license_plates
is used:
# python llava_benchmark.py --media license_plates
if args.media == "license_plates":
benchmarks = [EvalRateBenchmark(), LicensePlateBenchmark()]
llava_benchmark("data/config_licence_plates.yml", benchmarks)
The --media
argument call_audio
can be used to run an instances of
EvalRateBenchmark
and CallAudioBenchmark
:
# python llava_benchmark.py --media call_audio
elif args.media == "call_audio":
benchmarks = [EvalRateBenchmark(), CallAudioBenchmark()]
llava_benchmark("data/config_call_audio.yml", benchmarks)
To extend the script for other LLAVA tasks, you can define new benchmark classes that
implement the code needed for those tasks. Then, you can create instances of those
classes and add them to the benchmarks
list by processing args.media
.
The llava-benchmark
module includes a suite of tests to ensure its functionality.
These tests are written using the pytest
framework and make use of fixtures and
parameterization to test various aspects of the benchmarking process.
To run the tests, navigate to the llava-benchmark/tests/
directory and execute
the following command:
$ pytest
======================================================== test session starts =========================================================
rootdir: ./llava-benchmark
collected 2 items
tests\test_eval_rate_benchmark.py . [ 50%]
tests\test_license_plate_benchmark.py . [100%]
========================================================= 2 passed in 0.08s ==========================================================
The .github/workflows
directory contains configuration for the following GitHub Actions:
archives.yml:
docs.yml:
code_coverage.yml:
pytests.yml:
llava-benchmark
tests.code_coverage_ats.yml:
llava-benchmark
using ATS, identifying only the tests necessary to run for
each pull request, reducing the number of tests and saving time. (experimental)Contributions are welcome to the llava-benchmark
project! If you're interested in
contributing, here's how you can do it:
Open an Issue: If you have a suggestion for an improvement, or you've found a bug, start by opening an issue in the project repository. Describe your suggestion or bug report in detail.
Discussion: Once the issue is opened, maintainers of the project or other contributors will review the issue and discuss it.
Implementation: If your suggestion is accepted, you or someone else can start working on implementing it.
We appreciate your help in making the LLaVA Benchmark project better!
Jordan Cassady is a Canadian Network Engineer with a decade of startup experience automating test systems aligned to company KPIs. If youโve got a puzzle to solve, a codebase to conquer, or a moonshot idea, count me in. Letโs connect! โ๏ธ
๐ https://www.linkedin.com/in/jordancassady/
This project is licensed under the terms of the MIT license.