Performance - Githubissues

Currently no performance optimizations have been included in the code. I started benchmarking the performance against HF transformers (which I think is the fair comparison for this project, and vs llama.cpp for the llama.f90 code *). Results (in /benchmark) show that on my computers it's slower with linux+openblas and faster with MacOS+accelerate, though Fortran starts up and loads the weights faster. We'll see how that changes with any optimizations.

Current potential changes:

stop using spread and add bias weights without copying
calling transpose repeatedly and unnecessarily in matmuls
compare speed of tokenization

@certik

* There is also this using GGML for embeddings but it's GPU focused https://bloop.ai/blog/gpu_with_ggml

rbitr / ferrite

Performance #2